Build a Fast Local AI Search Engine with Llama 3

In an era where data is abundant, the ability to efficiently search and extract insights from local files is paramount. This article presents an innovative open-source generative AI search engine that utilizes the Llama 3 model to facilitate intelligent semantic searches of local files. This project not only serves as a robust alternative to existing tools like Microsoft Copilot but also champions the ethos of technology sharing and innovation.

System Architecture

To construct a local generative search engine or assistant, several components are necessary:

  • Content Indexing System: This component is responsible for storing the content of local files and is equipped with an information retrieval engine to efficiently search for the most relevant documents related to user queries.
  • Language Model: The Llama 3 model analyzes the selected local document content and generates concise summary answers based on it.
  • User Interface: An intuitive interface that allows users to easily query and obtain information.

Interaction Between Components

The interaction between these components is illustrated as follows:

  • Qdrant is employed as the vector storage solution, while Streamlit serves as the user interface. The Llama 3 model can be accessed via the Nvidia NIM API (700B version) or downloaded from HuggingFace (80B version). Document chunking is accomplished using Langchain.

Semantic Indexing

Semantic indexing is crucial for providing the most relevant document matches by analyzing the similarity between file content and queries. Qdrant serves as the vector storage solution, allowing for efficient document similarity comparisons directly in memory without requiring a full server-side installation.

Initializing Qdrant

When initializing Qdrant, it is necessary to predefine the vectorization method and metrics used. Here’s how to set it up:

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(path="qdrant/")
collection_name = "MyCollection"

if client.collection_exists(collection_name):
    client.delete_collection(collection_name)

client.create_collection(collection_name, vectors_config=VectorParams(size=768, distance=Distance.DOT))

Document Embedding

To build the vector index, documents on the hard drive must undergo embedding processing. Selecting an appropriate embedding method and vector comparison metric is crucial, as different paragraph, sentence, or word embedding techniques yield varying results.

One of the main challenges in document vector searches is the asymmetric search problem, which is prevalent in information retrieval, especially when matching short queries with long documents.

In this implementation, we selected a model fine-tuned on the MSMARCO dataset, named sentence-transformers/msmarco-bert-base-dot-v5. This model is based on the BERT architecture and is specifically optimized for dot-product similarity measures.

Chunking Documents

To address the limitations of BERT models, which can only handle a maximum of 512 tokens, we opted for document chunking. This process utilizes LangChain’s built-in chunking tool:

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_text(file_content)
metadata = [{"path": file} for file in texts]
qdrant.add_texts(texts, metadatas=metadata)

This code splits the text into segments of 500 tokens, with a 50-token overlap to maintain contextual continuity.

Generating the Index

Before indexing file content, it is essential to read these files. The project simplifies this process by allowing users to specify the folder they wish to index. The indexer recursively searches through the specified folder and its subfolders for all supported file types, such as PDF, Word, PPT, and TXT formats.

Retrieving Files

Here’s a recursive method to retrieve all files within a given folder:

import os

def get_files(dir):
    file_list = []
    for f in os.listdir(dir):
        if os.path.isfile(os.path.join(dir, f)):
            file_list.append(os.path.join(dir, f))
        elif os.path.isdir(os.path.join(dir, f)):
            file_list += get_files(os.path.join(dir, f))
    return file_list

Reading File Content

The project supports reading various formats, including MS Word documents (.docx), PDF documents, MS PowerPoint presentations (.pptx), and plain text files (.txt). Below are examples of how to read these formats:

For MS Word documents:

import docx

def getTextFromWord(filename):
    doc = docx.Document(filename)
    fullText = [para.text for para in doc.paragraphs]
    return 'n'.join(fullText)

For PDF files:

import PyPDF2

def getTextFromPDF(filename):
    reader = PyPDF2.PdfReader(filename)
    return " ".join([reader.pages[i].extract_text() for i in range(len(reader.pages))])

Complete Indexing Function

The complete indexing function is structured as follows:

file_content = ""
for file in onlyfiles:
    file_content = ""
    if file.endswith(".pdf"):
        print("Indexing " + file)
        file_content = getTextFromPDF(file)
    elif file.endswith(".txt"):
        print("Indexing " + file)
        with open(file, 'r') as f:
            file_content = f.read()
    elif file.endswith(".docx"):
        print("Indexing " + file)
        file_content = getTextFromWord(file)
    elif file.endswith(".pptx"):
        print("Indexing " + file)
        file_content = getTextFromPPTX(file)
    else:
        continue

    texts = text_splitter.split_text(file_content)
    metadata = [{"path": file} for _ in texts]
    qdrant.add_texts(texts, metadatas=metadata)

print("Finished indexing!")

Generative Search API

The web service is built using the FastAPI framework, designed to host the generative search engine. This API will connect to the previously established Qdrant client index, leveraging vector similarity search algorithms to delve deeper and using the Llama 3 model to generate precise answers from the most relevant chunks.

Setting Up the API

Here’s how to configure and introduce the key components of the generative search:

from fastapi import FastAPI
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_qdrant import Qdrant
from qdrant_client import QdrantClient
from pydantic import BaseModel

app = FastAPI()

class Item(BaseModel):
    query: str

@app.get("/")
async def root():
    return {"message": "Hello World"}

Search Functionality

To ensure the API operates correctly, two functionalities will be designed: one for semantic search and another that selects the top 10 most relevant text blocks as context to generate answers based on the search.

@app.post("/search")
def search(Item: Item):
    query = Item.query
    search_result = qdrant.similarity_search(query=query, k=10)
    list_res = [{"id": i, "path": res.metadata.get("path"), "content": res.page_content} for i, res in enumerate(search_result)]
    return list_res
@app.post("/ask_localai")
async def ask_localai(Item: Item):
    query = Item.query
    search_result = qdrant.similarity_search(query=query, k=10)
    context = ""
    mappings = {}
    for i, res in enumerate(search_result):
        context += f"{i}n{res.page_content}nn"
        mappings[i] = res.metadata.get("path")

    rolemsg = {
        "role": "system",
        "content": "Answer user's question using documents given in the context. Please always reference document id (in square brackets, for example [0],[1]) of the document that was used to make a claim."
    }
    messages = [rolemsg, {"role": "user", "content": f"Documents:n{context}nnQuestion: {query}"}]

    # Call to Llama 3 model for generating the answer
    completion = client_ai.chat.completions.create(
        model="meta/llama3-70b-instruct",
        messages=messages,
        temperature=0.5,
        top_p=1,
        max_tokens=1024,
        stream=False
    )
    response = completion.choices[0].message.content
    return {"response": response}

Conclusion

This article has outlined the process of building a generative AI search engine for local files by integrating Qdrant’s semantic search technology with the powerful Llama 3 language model. The resulting system enables a referenced-augmented generation (RAG) workflow, allowing users to search through local files and receive concise, cited answers to their queries.

The project illustrates the potential of open-source AI tools to enhance productivity and knowledge discovery. By leveraging large language models like Llama 3, developers can create intelligent search solutions that go beyond simple keyword matching, truly understanding the meaning behind user requests.

As AI continues to advance, we can expect to see more innovative applications like this generative search engine emerge, empowering users to effortlessly navigate and extract insights from their local data. The future of AI-powered search is bright, and projects like this one are leading the way.

Categories: AI Tools Guide
X