RAG 2.0, introduced by contextual.ai, integrates pre-training, fine-tuning, and alignment of all components as a single system. It maximizes performance through backpropagation with large models and retrievers. This method aims to address the issue where individual RAG components are effective, but the overall system is far from optimal.
Google DeepMind has proposed a novel approach called RICHES (Retrieval Interlaced with Sequence Generation). This method natively interweaves text generation with document retrieval through a single LLM and decoding process. It eliminates the need for separate retrievers and generators, directly decoding document content or relevant natural language retrieval keys. Without additional training, it can adapt to diverse new tasks through prompting.
How RICHES Works
- Model Initialization: Select a suitable pre-trained large language model (LLM).
- Retrieval Key Definition: Determine document identifiers for retrieval, such as titles, paragraphs, sentences, or propositions.
- Index Construction: Build an index for the corpus using techniques like FM-Index to optimize retrieval efficiency.
- Input Reception: Receive the user’s question or query as input.
- Alternating Generation: The LLM alternates between free text generation and constrained retrieval key generation.
- Constraint Application: During generation, use the index to constrain retrieval keys, ensuring they correspond to valid documents in the corpus.
- Document Retrieval: Retrieve relevant documents or information snippets from the corpus based on generated retrieval keys.
- Integration and Output: Combine retrieved content with generated text to form a complete answer or solution.
- Evaluation: Assess output results using appropriate metrics (e.g., F1 score, AutoAIS).
- Iterative Optimization: Improve the model and process based on evaluation results.
RICHES Detailed Principles
Interweaving Retrieval and Generation:
RICHES retrieves documents by directly decoding document content or related natural language retrieval keys that point to the documents that generated them. This approach allows text generation and retrieval to be interwoven in a single decoding process, avoiding the use of separate retrievers and generators.
Retrieval Key Definition:
Retrieval keys are token sequences existing within a predefined finite sequence set K, with each entry associated with one or more documents in the underlying corpus C. Special tokens « and » mark the beginning and end of retrieval keys in the output sequence y.
Probability Model Update:
The standard autoregressive language modeling probability Pθ(y|x) is updated to include retrieval keys by introducing an indicator function 1K(q). This model achieves constrained decoding by zeroing out the continuation probability of disallowed sequences.
Constrained Beam Decoding:
Beam search is used as the decoding strategy, simulating heuristic best-first search. At each time step, the LLM estimates the value of each node (token) and adds it to a fixed-size queue (beam).
Efficient Constraints via FM-Index:
FM-Index (Ferragina and Manzini, 2000) is used to constrain model output during the decoding process, ensuring output sequences exist in the corpus. FM-Index is a compressed suffix array supporting fast substring search operations.
def constrain ( input_prefix ) :
# Fetch continuations for prefix
allowed_tokens = fm_index .
get_continuations ( input_prefix )
# Get next token probabilities
logprobs = LLM . logprobs ( input_prefix )
# Disallowed tokens are set to -inf
for i in logprobs :
token = vocab [ i ]
if token not in allowed_tokens :
logprobs [ i ] -= np . inf
return logprobs
Adaptive Beam Size:
An adaptive decoding strategy is introduced, dynamically adjusting beam size based on the different needs of generated constrained and unconstrained sequences. Constrained sequences require exact matching of target retrieval keys, while unconstrained sequences are more flexible.
Indexing Strategies:
RICHES supports various indexing strategies, including document titles, paragraph substrings, sentence substrings, and proposition indexing. The choice of document representation significantly impacts retrieval effectiveness.
Performance and Comparisons
RICHES demonstrates strong performance in open-domain question answering tasks (attribution QA, multi-hop QA, and retrieval-augmented thinking). It particularly excels in multi-hop QA tasks (Hotpot) compared to traditional retrieval-augmented generation methods, achieving more accurate answer generation through a single decoding process.
The appendix includes few-shot prompt templates for multi-hop and single-hop question answering, as well as a template for extracting answers from propositions.
In conclusion, RICHES represents a significant advancement in retrieval-augmented generation, offering a more integrated and efficient approach to combining information retrieval and text generation. Its performance across various tasks demonstrates its potential to enhance the capabilities of large language models in handling complex, knowledge-intensive queries.