Microsoft’s GraphRAG: Next-Gen Open Source RAG Sensation

Microsoft recently open-sourced its GraphRAG project, which has rapidly gained popularity, amassing an impressive 5.9K stars on GitHub in less than a week. GraphRAG is a structured, hierarchical approach to Retrieval-Augmented Generation (RAG) that differs from semantic search methods using plain text snippets. The GraphRAG process involves extracting a knowledge graph from raw text, building a hierarchical structure, generating summaries, and then leveraging these structures when performing RAG-based tasks.

What is RAG?

In the fields of artificial intelligence and natural language processing, RAG (Retrieval-Augmented Generation) cleverly combines information retrieval and text generation, bringing unprecedented flexibility and accuracy to AI systems.

This approach enhances AI’s ability to access and utilize vast amounts of external knowledge by retrieving relevant information before generating responses. Not only does this improve AI’s answering capabilities, but it also significantly increases the reliability and relevance of its outputs. RAG shows immense potential in various applications, including question-answering systems, virtual assistants, and content creation.

GitHub: https://github.com/microsoft/graphrag

The Power of GraphRAG

GraphRAG can connect information across large datasets and use these connections to answer questions that are difficult or impossible to address using keyword and vector-based search mechanisms. It enables systems to answer questions spanning multiple documents and queries such as “What are the most important themes in this dataset?”

GraphRAG vs. Baseline RAG

While most RAG methods use vector similarity as the search technique (Baseline RAG), GraphRAG utilizes knowledge graphs to achieve significant improvements in question-answering performance when dealing with complex information. Baseline RAG struggles in certain situations:

  • Connecting the dots when answering a question requires traversing disparate information pieces through their shared attributes to provide new synthesized insights.
  • Comprehensively understanding summarized semantic concepts over large data collections or even singular large documents.

Microsoft Research’s GraphRAG addresses these issues by using LLMs to create a knowledge graph based on the input corpus. This graph, along with summaries and graph machine learning outputs, is used to augment prompts at query time, demonstrating substantial improvements in answering the aforementioned types of questions.

Baseline RAGGraphRAG
The term “Novorossiya” is not mentioned in the provided context. However, to provide some information, Novorossiya, or New Russia, is a historical term referring to a region in the south of Ukraine that was conquered by the Russian Empire in the 18th century. In recent years, the term has been revived by Russian-backed separatists in Eastern Ukraine to refer to the territories they claim, but it is not officially recognized as a separate entity by international bodies.Novorossiya, also known as New Russia, is a term with historical and contemporary significance. Historically, it referred to a region in the southern part of the Russian Empire that is now part of Ukraine. In the contemporary context, it has been used by pro-Russian separatists to refer to certain areas of Ukraine [Data: Entities (6494, 912, 6493)]. 

The term Novorossiya has been used in the context of a political movement in Ukraine, particularly by Russian nationalists. This movement has been associated with efforts to destabilize Ukraine and create a so-called “Novorossiya” [Data: Entities (912); Relationships (15211, 15206)]. The term has also been used to denote a separatist state in Ukraine, known as the Federal Republic of Novorossiya (FRN) [Data: Entities (7690)]. 
source: Microsoft Research Blog

The GraphRAG Process

Indexing Stage

  1. Slice the input documents into a series of text units, which serve as analyzable units for the entire process and provide fine-grained references for the output.
  2. Extract all entities, relationships, and key statements from these text units using a large language model (LLM).
  3. Perform hierarchical clustering of the graph using the Leiden algorithm. Each circle in the graph represents an entity (e.g., a person, place, or organization), with the size indicating the entity’s importance and the color representing its community.
  4. Generate summaries for each community and its constituents from the bottom up, aiding in the comprehensive understanding of the dataset.

Query Stage

The generated structures are used to provide materials for the large language model’s context window when answering questions. The main query modes are:

Global Search: Used for reasoning about global questions concerning the entire document set, primarily utilizing community summaries.

Local Search: Used for reasoning about specific entities by expanding to their neighbors and related concepts.

Getting Started with GraphRAG

Before using GraphRAG, ensure that Python 3.10-3.12 is installed on your machine.

  1. Install graphrag:
   pip install graphrag
  1. Run the indexer:

Create a dataset directory:

mkdir -p ./ragtest/input

Download the dataset:

curl https://www.gutenberg.org/cache/epub/24022/pg24022.txt > ./ragtest/input/book.txt

Configure workspace variables:

python -m graphrag.index --init --root ./ragtest

Run the indexing pipeline:

python -m graphrag.index --root ./ragtest
  1. Run the query engine:

Global Search:

python -m graphrag.query  
--root ./ragtest  
--method global  
"What are the top themes in this story?"

Local Search:

python -m graphrag.query  
--root ./ragtest  
--method local  
"Who is Scrooge, and what are his main relationships?"

For more information on how to use GraphRAG, refer to the official GraphRAG documentation.

Categories: GitHub
X