Due to the modular nature of Retrieval-Augmented Generation (RAG) systems, the need for evaluating long-text responses, and the insufficient reliability of existing evaluation metrics, conducting a comprehensive assessment of RAG systems presents several challenges.
RAGChecker Overview
Amazon AWS AI has open-sourced RAGChecker, a fine-grained evaluation framework based on statement-level entailment checks. This framework involves extracting statements from responses and ground truth answers and comparing them with other texts.
The metrics proposed in RAGChecker are illustrated in the accompanying Venn diagram, which compares model responses with ground truth answers, highlighting possible correct (O), incorrect (X), and missing statements (V). Retrieved blocks are categorized into two types based on the statements they contain. Below, we define the metrics for overall performance, retriever performance, and generator performance, explaining how to evaluate each component of the RAG system.
RAGChecker enables developers and researchers to conduct precise, in-depth evaluations, diagnostics, and enhancements of their RAG systems:
- Comprehensive Evaluation: RAGChecker provides overall metrics for assessing the entire RAG process.
- Diagnostic Metrics: These metrics analyze the retriever component’s performance through diagnostic retriever metrics, and evaluate the generator component’s performance through diagnostic generator metrics. Such insights are invaluable for targeted improvements.
- Fine-Grained Assessment: Utilizing statement-level entailment operations allows for detailed evaluations.
- Benchmark Dataset: A comprehensive RAG benchmark dataset containing 4,000 questions across 10 domains is forthcoming.
- Meta-Evaluation: A human-annotated preference dataset is available for assessing the correlation between RAGChecker results and human judgments.
Through comprehensive experiments on public datasets across 10 domains involving eight state-of-the-art RAG systems, RAGChecker demonstrates stronger correlations with human evaluators and provides profound insights into the inherent trade-offs in RAG system component behavior and design.
RAG Benchmark Statistics
This benchmark is constructed from repurposed public datasets covering 10 domains, containing 4,162 questions. For areas ranging from finance and lifestyle to entertainment, technology, science, and fiction, short answers have been expanded into longer responses using GPT-4.
The results show the correlation of accuracy, completeness, and overall evaluation with human assessments. Each baseline framework (TruLens, RAGAS, ARES, CRUD-RAG) is evaluated against relevant metrics.
The metrics in RAGChecker assist researchers and practitioners in developing more effective RAG systems by providing improvement recommendations through adjustments to RAG system settings, such as the number of retrievers, block size, block overlap ratio, and generation prompts.
Average Evaluation Results Across Different RAG Systems
The average evaluation results of various RAG systems across 10 datasets are quantified using precision (Prec.), recall (Rec.), and F1 score to measure overall performance. The retriever component is evaluated based on statement recall (CR) and contextual precision (CP), while the generator component is diagnosed through contextual utility (CU), relevant noise sensitivity (NS(I)), irrelevant noise sensitivity (NS(II)), hallucination (Hallu.), self-knowledge (SK), and fidelity (Faith.). Additionally, the average number of response statements per RAG system is provided.
Appendix
A comparison of the similarity prediction scores of RAGChecker and RAGAS answers is illustrated. Each point in the figure represents an instance from the meta-evaluation dataset, with the x-axis indicating the corresponding human preference labels and the y-axis showing the predicted similarity scores between RAGChecker and RAGAS answers. The distribution of predicted scores is represented by colored areas, with the dashed line indicating the average.
Enhanced Contextual Fidelity Diagnosis
The analysis of top-k selections and segment sizes balances the amount of noise and useful information presented to the generator, albeit in different manners. The corresponding results are displayed in the two figures below.
Increasing k introduces more potentially irrelevant context, while increasing segment size provides more relevant factual surrounding context. Consequently, as k increases, contextual precision tends to decrease, while it improves with larger segment sizes. Nevertheless, both adjustments lead to better statement recall during retrieval.
The generator often exhibits greater fidelity when provided with more context, although this trend is less pronounced for Llama3, which already demonstrates high fidelity. Due to increased noise, contextual utility typically deteriorates with more context, resulting in higher relevant noise sensitivity.
End-to-end RAG performance is slightly better with more context, primarily due to improved recall rates. It is recommended to moderately increase both parameters to achieve more faithful generation while being cautious of saturation at high values, as the amount of useful information is limited. In situations with limited context length, opting for larger segment sizes and smaller k values is particularly advantageous for easier datasets (finance, writing). This is especially evident when comparing segment sizes of 150 with k=20 against segment sizes of 300 with k=10.
Conclusion
RAGChecker provides a robust framework for diagnosing and enhancing RAG systems, offering significant insights into their performance and areas for improvement. By leveraging this tool, developers and researchers can refine their systems to achieve better outcomes in various applications.
For more information, visit the RAGChecker GitHub repository and access the full research paper here.