DSPy: Optimize LLM Pipelines with Declarative Programming

August 5, 2024

by kevin

In LLM applications, optimizing a pipeline process has long been a challenging issue. Prompts, as predefined strings, often lack clear optimization directions. The DSPy framework introduced in this article may offer significant help in practical applications for effect optimization. This article is primarily based on “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines.”

Currently, in LLM applications, everyone is exploring prompt engineering techniques and combining them into pipelines to solve complex tasks. Unfortunately, existing LM pipelines typically use hardcoded “prompt templates” for implementation. To provide a more systematic approach to developing and optimizing LM pipelines, this article introduces DSPy: a programming model that abstracts LM pipelines as text transformation graphs. DSPy modules are parameterized, meaning they can learn how to apply combinations of prompting, fine-tuning, augmentation, and reasoning techniques by creating and collecting examples. Case studies show that concise DSPy processes can express and optimize complex LLM pipelines. The open-source repository is available at DSPY.

Introduction

Language models (LMs) have enabled researchers to build NLP systems at a higher level of abstraction and with less data requirements, driving the development of prompting techniques and lightweight fine-tuning methods. While these techniques were mostly explored individually, they are now increasingly used to construct multi-stage pipelines that break down complex tasks into more manageable LM calls. However, LMs are highly sensitive to prompting methods, especially in pipelines requiring multiple LM calls. Existing methods often rely on handcrafted “prompt templates,” generating instructions through trial and error. This approach, though common, is considered fragile and unscalable, similar to manually adjusting classifier weights, and difficult to generalize across different pipelines, LMs, data domains, or inputs.

To design a more systematic approach to AI pipelines, this article introduces the DSPy programming model. DSPy shifts the construction of new LM pipelines from manipulating free-form strings to programming (combining modular operators to build text transformation graphs), where a compiler automatically generates optimized LM call strategies and prompts from the program. Inspired by neural network abstractions, where many general-purpose layers can be modularly combined in any complex architecture and model weights can be trained using optimizers rather than manual tuning, this article proposes the DSPy programming model.

First, string-based prompting techniques, including complex and task-dependent techniques like Chain of Thought and ReAct, are translated into declarative modules with natural language type signatures. DSPy modules are task-adaptive components (similar to neural network layers) abstracting any specific text transformation, such as answering questions or summarizing papers. Each module is then parameterized so that it can learn its expected behavior by iteratively bootstrapping useful demonstrations in the pipeline. Inspired by PyTorch abstractions, DSPy modules use expressive define-by-run computation graphs. Pipelines are expressed by declaring the required modules and connecting them logically in any logical control flow (e.g., if statements, for loops, exceptions, etc.).

The article then develops the DSPy compiler, which optimizes any DSPy program to improve quality or reduce cost. The compiler input is a program, some training inputs with optional labels, and a validation metric. The compiler simulates versions of the program on the inputs and bootstraps example trajectories for each module for self-improvement, using them to build effective few-shot prompts or fine-tune small LMs to perform pipeline steps. Optimization in DSPy is highly modular: it’s done by teleprompters, which are general optimization strategies that determine how modules learn from data. In this way, the compiler automatically maps declarative modules to high-quality combinations of prompting, fine-tuning, inference, and augmentation.

While a programming model like DSPy can be evaluated from multiple dimensions, this article focuses on the role of expert-crafted prompts in shaping system performance. The article attempts to reduce or even eliminate their role through DSPy modules (versions of popular techniques like Chain of Thought) and teleprompters. Two case studies are conducted: mathematical word problems (GMS8K) and multi-hop question answering (HotPotQA), exploring chain of thought, multi-chain reflection, multi-hop retrieval, retrieval-augmented question answering, and agent loops. The evaluation effectively uses several different compilation strategies, showing that simple DSPy programs outperform systems using handcrafted prompts while also allowing our programs to effectively use smaller and more efficient LMs.

Overall, this article presents the first programming model that transforms prompting techniques into parameterized declarative modules and introduces an effective compiler with general optimization strategies (teleprompters) to optimize arbitrary pipelines of these modules. The main contributions are empirical and algorithmic: with DSPy, the article finds that very short programs can be written that bootstrap self-improving multi-stage NLP systems using LMs as small as possible (such as llama2-13b-chat and T5-Large (770M parameters)). Without handcrafted prompts and with only minutes to tens of minutes of compilation time, combinations of DSPy modules can improve the quality of simple programs from 33% to 82% and from 32% to 46% (based on GPT-3.5), from 9% to 47% and from 22% to 41% (based on llama2-13b-chat).

Related Work

Inspired by the role of Torch, Theano,… Chainer in deep learning development by providing powerful abstraction capabilities, this article aims to provide a solid conceptual framework and programming abstractions for foundation model programming. The article draws on differentiable programming but applies it to LMs rather than neural networks.

In-context learning is a key mechanism for foundation model programming. Increasing research shows that complex LM behaviors can be elicited through prompts, especially through instruction tuning. Similarly, weakly supervised forms that typically require task-specific heuristics are now completed by LMs.

In-context learning methods now often call tools, leading to the emergence of LM pipelines using retrieval models, multimodal foundation models, and more traditional tools like APIs and calculators. Market offerings include LangChain, Semantic Kernel, LlamaIndex, and many other retrieval and agent libraries. These toolkits provide pre-packaged chains and agents that connect LMs with numerous accessible tools. However, they face the universal prompt engineering challenges that this article addresses in DSPy: they express task-specific behaviors through handwritten prompt templates.

Researchers have begun applying discrete optimization and reinforcement learning (RL) to find effective prompts, often targeting single logical LM calls. DSPy aims to generalize this space: it optimizes arbitrary pipelines from high-level declarative signatures by bootstrapping high-quality multi-stage demonstrations and attaching constraints. Within this framework, DSPy prompters can be optimized using model selection techniques such as cross-validation, and in principle, can also use complex techniques involving RL and LM feedback or learning, or Bayesian hyperparameter optimization methods.

This article aims to motivate DSPy as a programming model and report new findings from applying the DSPy compiler. It focuses on showcasing how DSPy and its compiler enable the construction of excellent LM systems without handwriting prompt strings, but rather through truly modular units, opening the door to systematically exploring a rich design space at a very high programming abstraction level.

DSPY MODEL

DSPy views language models (LMs) as abstract devices for text generation and optimizes their use in arbitrary computation graphs. DSPy programs are written in Python: each program receives task inputs (e.g., a question to answer or a paper to summarize) and returns outputs (e.g., an answer or a summary) after a series of steps. DSPy provides three abstractions for automatic optimization: signatures, modules, and teleprompters.

Signatures: Abstracting Prompting Expressions

Unlike free-form string prompts, DSPy programs use natural language signatures to assign work to language models (LMs). DSPy signatures are natural language type declarations for functions: a concise declarative specification telling DSPy what a text transformation needs to do (e.g., “receive a question and return an answer”), rather than how to prompt a specific LM to achieve that behavior. Here are some signature examples for popular LLM tasks:

Question answering: “question -> answer”
Retrieval-augmented question answering: “context, question -> answer”
Multi-choice question answering with reasoning: “question, choices -> reasoning, selection”

Signatures have two advantages over prompts: they can be compiled into self-improving prompts or fine-tuning instructions that adapt to the pipeline. This is primarily achieved by bootstrapping useful examples for each signature. Additionally, they handle structured formats and parsing logic to reduce (or ideally, avoid) fragile string manipulations in user programs. Here’s an example of a signature application:

Modules: Abstracting Prompting Techniques

Similar to type signatures in programming languages, DSPy signatures only define an interface and provide type-like prompts for expected behavior. To use a signature, a module with that signature must be declared. Such a module declaration returns a function with that signature.

Predict Module: In DSPy, the core module for using signatures is Predict. Internally, Predict stores the provided signature, an optional LM (initially None but can override the module’s default LM), and a list of examples for prompting (initially empty). Similar to layers in PyTorch, instantiated modules behave as callable functions: they receive keyword arguments corresponding to the signature’s input fields (e.g., question), format a prompt implementing the signature and including appropriate examples, call the LM, and parse the output fields. When Predict detects it’s being used in compilation mode, it also internally tracks traces of inputs/outputs to help prompters bootstrap examples.

Other Built-in Modules: DSPy modules transform prompting techniques into modular functions that support any signature, contrasting with the standard approach of prompting LMs with task-specific details (e.g., handwritten few-shot examples). To this end, DSPy includes many more complex modules such as ChainOfThought, ProgramOfThought, MultiChainComparison, and ReAct. All these modules can be used interchangeably to implement DSPy signatures. For example, simply changing Predict to ChainOfThought in the above program results in a system that thinks step-by-step before submitting output fields.

Importantly, all these modules are implemented by extending user-defined signatures and calling Predict once or multiple times on new signatures, in just a few lines of code. For example, below is a simplified implementation of the built-in ChainOfThought. This is a fully-fledged module capable of learning effective few-shot prompts for any LM or task.

Parameterization: DSPy uniquely parameterizes these prompting techniques. To understand this parameterization, note that the parameters that need to be specified for any LM call attempting to implement a specific signature include: (1) the specific LM to call, (2) prompt instructions and string prefixes for each signature field, (3) examples to use as few-shot prompts (for frozen LMs) or as training data (for fine-tuning). This article primarily focuses on automatically generating and selecting useful examples. In our case studies, we find that bootstrapping good examples provides a powerful way to systematically teach LMs complex new pipeline behaviors.

Tools: DSPy programs can use tools, which are modules that perform computations. We support retrieval models through the dspy.Retrieve module.

Programs: DSPy modules can be arbitrarily combined into pipelines in define-by-run interfaces. Directly inspired by PyTorch and Chainer, pipelines are expressed by first declaring the required modules at initialization, allowing DSPy to track them for optimization, and then calling these modules with arbitrary code in a forward method to express the pipeline. As a simple example, we provide the following simple but complete retrieval-augmented generation (RAG) system.

To highlight modularity, we use ChainOfThought as a direct replacement for basic Predict. It can now be used by simply writing RAG()(“Where is Guaraní spoken?”). Note that if we used the signature “context, question -> search query”, we would get a system that generates search queries instead of answers.

Teleprompters: Automating Prompting

When compiling a DSPy program, a teleprompter is typically called, which is an optimizer that receives a program, training set, and metric, and returns a new optimized program. Different teleprompters will employ different optimization strategies.

In DSPy, training sets can be small, potentially with just a few examples, though larger-scale data enables more powerful optimization. Training examples can be incomplete, only requiring input values. Labels for pipeline steps are not required unless they need to be used in the metric. In practice, we typically assume only the final output of the program has labels (at most), not intermediate steps. This label efficiency is crucial for modularity: building a new pipeline in DSPy only requires recompiling the code for the new pipeline, not annotating data specific to the new pipeline.

Metrics can be simple concepts like exact match (EM) or F1, but they can also be full DSPy programs that balance multiple concerns. For example, we can compile the above RAG module against a training set of question-answer pairs qa_trainset and the EM metric. The goal here is to effectively bootstrap few-shot demonstrations. The following code accomplishes this:

In this example, the BootstrapFewShot teleprompter simulates RAG on training examples. It will collect demonstrations for each module (i.e., examples of its input-output behavior) that collectively lead to effective outputs (i.e., conforming to the signature and metric). If we wanted to push the compiled program to extract in retrieved contexts, we could define a custom metric to replace dspy.evaluate.answer_exact_match:

DSPY COMPILER

An important source of expressiveness in DSPy is its ability to compile or automatically optimize any program in the programming model. Compilation relies on an optimizer called a teleprompter, which improves the quality (or cost) of DSPy program modules through prompting or fine-tuning. A typical teleprompter goes through three stages.

Stage 1: Candidate Generation

The compiler first (recursively) finds all unique predict modules (predictors) in the program, including those nested under other modules. For each unique predictor P, the teleprompter can generate candidate values for P’s parameters: instructions, field descriptions, or examples (i.e., input-output pair examples). Although language models (LMs) can be highly unreliable, we find them quite efficient at multi-stage solution space search. A well-decomposed program can typically find at least a few training examples where the LM can be coerced through the constraints enforced by the signature and metric, bootstrapping iteratively if needed.

Stage 2: Parameter Optimization

Now, each parameter has a set of discrete candidate values: examples, instructions, etc. Many hyperparameter tuning algorithms (e.g., random search or tree-structured Parzen estimators in HyperOpt and Optuna) can be applied to the selection of candidate values. Another type of optimization is fine-tuning using BootstrapFinetune, where examples are used to update the LM weights for each predictor. When this method is applied, the LM parameter of each module is updated to new LM weights. Typically, we optimize for average quality on the training set or validation set using cross-validation. This applies to cases where there are no labels for any stage, depending on the nature of the metric.

Stage 3: Higher-Order Program Optimization

Another type of optimization supported by the DSPy compiler is modifying the program’s control flow. One of the simplest forms, which we use in our case studies, is ensembling. An ensemble will bootstrap multiple copies of the same program and then replace the program with a new one that runs all copies in parallel and combines their predictions into one through a custom function (e.g., majority voting). In future work, this stage can easily accommodate more dynamic (i.e., test-time) bootstrapping as well as techniques for automatic backtracking logic.

Case Study: Mathematical Word Problems

We evaluate on the popular GSM8K dataset, which contains elementary school math problems. We sample 200 and 300 question-answer pairs from the official training set for training and development, respectively. Final evaluation uses the official 1.3k test set examples. To avoid overfitting on the test, we conduct extensive comparisons on the development set. Following previous work on GSM8K, we evaluate the accuracy of the final numerical value that appears in the LM output.

Programs Used: For this task, we consider three simple DSPy programs: a single-step Predict module (vanilla), a two-step ChainOfThought module (CoT), and a multi-stage ComparerOfThoughts module (ThoughtReflection). These programs are as follows:

In reflection, five reasoning chains (and their answers) are sampled from the LM and compared in parallel through the built-in MultiChainComparison module, which generalizes previous work. This generates a new answer that takes into account patterns from the five attempts. Importantly, all the modules used are generic, with none being specific to math problems or any particular LM.

Compilation Process

As discussed earlier, DSPy programs can be compiled into new optimized programs. In our experiments, we evaluated zero-shot (no compilation) as well as several compilation strategies. The simplest compiler is LabeledFewShot:

compiled_program = LabeledFewShot(program).compile(
    trainset=trainset, 
    num_shots=8, 
    metric=metric
)

Here, program can be any DSPy module. This simply randomly samples k=8 examples from the training set for fields common to both the training examples and the signature, which in this case are question and answer, but not reasoning. For instance, when applying this random sampling, we report averages over 3-5 runs (depending on the setting).

Next, we also considered using random search to bootstrap few-shot examples:

compiled_program = BootstrapFewShot(program).compile(
    trainset=trainset, 
    metric=metric,
    max_bootstrapped_demos=8,
    max_labeled_demos=4,
    num_trials=20,
    bootstrap_model=dspy.OpenAI(model='gpt-3.5-turbo')
)

This generates demonstration chains for examples in the training set and optimizes the process of selecting these demonstrations to self-improve the program’s modules. As the name suggests, this is done through random search, treating demonstration selection as a parameter to optimize.

Furthermore, if needed, this bootstrapping process can be nested within DSPy. Specifically, an optimized bootstrapped program can be used to further bootstrap another program. For example, this approach is relevant when the original zero-shot program performs poorly:

compiled_program = BootstrapFewShot(program).compile(
    trainset=trainset,
    metric=metric,
    max_bootstrapped_demos=8,
    max_labeled_demos=4,
    num_trials=20,
    bootstrap_model=BootstrapFewShot(program).compile(
        trainset=trainset,
        metric=metric,
        max_bootstrapped_demos=4,
        max_labeled_demos=2,
        num_trials=10
    )
)

GSM8K includes human reasoning chains. The training set mentioned above does not include these reasoning chains. We also evaluated using human CoT, which extends the examples in the training set to include the human reasoning strings. These two datasets can be used interchangeably as values for the trainset parameter above. It’s worth noting that compilation typically completes in minutes (or tens of minutes), as even costly setups only require running the program a few thousand times (e.g., 10-20 trials on 150-300 validation examples) and can be parallelized.

Results

Our experimental results are shown in the table above, which includes development results as well as evaluations on promising representatives of each method on the test set. First, the results for the vanilla program show that GPT-3.5 and llama2-13b-chat perform poorly on math word problems when required to predict answers directly, i.e., without using reasoning chains. This is most evident in the absence of good demonstrations, which can be seen in the no-compile setting (i.e., zero-shot instructions) and few-shot setting (i.e., randomly sampled QA pairs). Interestingly, however, vanilla was significantly helped by bootstrapped compilation and by iterating this process to bootstrap×2. Upon inspection of the bootstrapped prompts, it was found that the prompts allowed the LM to first use the answer field for reasoning, which was allowed since the metric extracted the final numerical value for evaluation.

Next, we considered the CoT program. While expert human reasoning chains (+human CoT) provided a significant boost when available, using bootstrapping was able to match or exceed this, validating our hypothesis that DSPy can reduce the need for handcrafted prompts. Beyond this, the reflection program, while only a few lines longer than the other programs, was clearly the winner, although CoT was also quite effective when ensembled. Overall, the bootstrapped compilation process brought large gains to each program, spanning both LMs. In fact, all programs in the table are expressed by combining two to four DSPy modules and prompters, revealing overall that in the new paradigm prescribed by DSPy, it is the combination of appropriate generic modules, rather than the manipulation of string prompts, that improved accuracies for different LMs from 4-20% to 49-88%.

Categories: AI Tools Guide