Prompt Cache technology is revolutionizing the inference processes of large language models (LLMs) by enabling the reuse of attention states between different prompts. This innovation not only enhances efficiency but also positions major model vendors to leverage cutting-edge advancements in artificial intelligence.

Understanding Token Generation in LLMs

Figure 1 illustrates the various methods LLMs use to generate tokens, breaking down the process into three steps. Each box represents a token, with blue boxes indicating prompts.

Understanding Token Generation in LLMs
  • (a) An LLM receives a prompt (blue token) and predicts the next token (A) (1). It appends the generated token (A) to the prompt to predict the next token (B) (2). This autoregressive process continues until a stopping condition is met.
  • (b) KV caching computes the prompt’s time attention state only once in the first step (1) and reuses it in subsequent steps.
  • (c) Prompt Cache allows for the reuse of KV states across services, bypassing the need for recalculating prompt attention. When a pattern is loaded, Prompt Cache fills its cache and reuses cached states for prompts derived from that pattern (1). Figure 2 provides further details on Step 1.

Identifying the Problem

Many input prompts exhibit structural overlap, including system messages, prompt templates, and document context. These overlapping segments can be pre-computed and stored, allowing their attention states to be reused when they appear in user prompts.

The Mechanics of Prompt Cache Technology

Prompt Cache utilizes a framework known as Prompt Markup Language (PML) to define reusable text segments called prompt modules. PML ensures the accuracy of positions when reusing attention states and provides users with an interface to access cached states within their prompts.

Mechanics of Prompt Cache Technology

Workflow of Prompt Cache

Upon receiving a prompt, Prompt Cache processes its patterns and computes the attention states of its prompt modules. These states are then reused for prompt modules within the prompt and for other prompts derived from the same pattern.

Figure 2 outlines the reuse mechanism in Prompt Cache:

  1. PML clarifies reusable prompt modules within patterns and prompts. Prompt modules can include parameters, such as itinerary plans. The importing prompt provides values for parameters (e.g., duration of 3 days). New text segments can be included at the positions of excluded modules and parameters, with additional text appended at the end.
  2. Prompt modules encode pre-computed attention states for all modules in the pattern (1) and cache them for future reuse.
  3. When a prompt is provided, Prompt Cache retrieves the cached attention states for the imported prompt modules (2), computes them for parameters (3) and new text segments (4), and combines them to produce the attention states for the entire prompt (5). This process is a further elaboration of Step 1 in Figure 1c.

Design and Implementation

The design of Prompt Cache emphasizes clarity in prompt structures, encoding of prompt modules, and a detailed cached inference process. The implementation leverages Hugging Face’s transformers library and has been evaluated on both CPU and GPU.

Design and Implementation

Prototype evaluations across multiple LLMs demonstrate that Prompt Cache significantly reduces latency for generating the first token, particularly in long prompts used for document-based question answering and recommendations. Performance improvements on GPUs range from 8x to 60x, while on CPUs, enhancements can reach up to 60x, all while maintaining output accuracy without modifying model parameters.

CPU latency measurement 1

Performance Metrics

GPU Latency Measurement: First Token Time (TTFT) across eight LongBench datasets on three NVIDIA GPUs.

CPU Latency Measurement: First Token Time (TTFT) across eight LongBench datasets on two CPUs.

CPU Latency Measurement

Conclusion

Prompt Cache technology marks a significant advancement in optimizing the inference processes of large language models. By effectively reusing attention states, it enhances performance while preserving the integrity of model outputs. This innovation is crucial for applications requiring efficient processing of extensive prompts, paving the way for more responsive and capable AI systems.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *