Mastering RAG Optimization: BGE Embeddings & Landmark Breakthroughs

As the landscape of natural language processing (NLP) evolves, the need for effective retrieval-augmented generation (RAG) techniques has become increasingly vital. This article delves into the advancements in BGE (BAAI General Embedding) word embeddings and explores the innovative approach of Landmark Embedding, which addresses common challenges in processing long-context inputs.

I. Recommendations and Results for Embedding Selection

Embedding Selection Guidelines

When selecting embedding models, several factors must be considered, including token sequence length and domain-specific performance. Here are some key recommendations based on recent evaluations:

Sequence Lengths: Most models support a sequence length of 512 tokens. For longer contexts, consider models like tao-8k for 8192 tokens and stella for 1024 tokens.
Performance in Specialized Domains: In professional data settings, embedding models often underperform compared to traditional methods like BM25. However, fine-tuning can significantly enhance their effectiveness.
Recommended Models: For users with limited experience in model training but requiring fine-tuning, the BGE series is highly recommended due to its comprehensive training scripts and negative example mining capabilities. Other models based on BERT can also be referenced for similar training scripts.
Re-ranking Models: Options for re-ranking models are limited, but bge-reranker is a solid choice, supporting fine-tuning. Note that re-ranking models typically require GPU deployment due to their input size.

Selection Results

PEG

Author: Tencent
Model Address: PEG Model
Paper: PEG Paper
Focus: Optimizing retrieval capabilities.

GTE Series

Author: Alibaba
Model Address: GTE Model
Paper: GTE Paper

Piccolo Series

Author: SenseTime
Address: Piccolo Model
Notes: Includes fine-tuning tips.

Stella Series

Address: Stella Model
Blog: Stella Blog
Details: Based on the Piccolo model with support for 1024 token sequences.

BGE Series

Author: BAAI
Address: BGE Model
Paper: BGE Paper
GitHub: FlagEmbedding Repository
Highlights: Offers extensive open-source information and fine-tuning examples.

M3E Series

Author: MokaAI
Address: M3E Model
GitHub: M3E GitHub
Significance: Early pioneer in Chinese general embedding models.

Multilingual E5 Large

Address: Multilingual E5 Model
Paper: E5 Paper
Feature: Supports multiple languages.

Tao-8k

Address: Tao-8k Model
Note: Supports 8192 token sequences, but with limited information available.

II. Overview of BGE

In the era of large models, addressing issues such as hallucinations, knowledge obsolescence, and ultra-long text processing is crucial. However, high-quality semantic vector models in the Chinese context remain scarce and often closed-source.

To tackle these challenges, BAAI has released BGE (BAAI General Embedding), a robust open-source model for Chinese and English semantic vectors. BGE outperforms all similar models in the community, including OpenAI’s text embedding 002, in both semantic retrieval accuracy and overall representation capabilities. Notably, BGE maintains the smallest vector dimension among models of comparable parameter scales, resulting in lower usage costs.

The paper titled “C-Pack: Packed Resources For General Chinese Embeddings” introduces a resource package that includes:

C-MTP: A large text embedding training set with extensive unsupervised and high-quality supervised corpora.
C-MTEB: A benchmark covering six tasks and thirty-five datasets for Chinese text embeddings.
BGE: Multi-scale text embedding models.

III. Utilizing BGE Models

Here are several methods to implement BGE models using FlagEmbedding, Sentence-Transformers, LangChain, and HuggingFace Transformers.

1. Using FlagEmbedding

pip install -U FlagEmbedding

from FlagEmbedding import FlagModel

sentences = ["Sample data-1", "Sample data-2"]

model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="Generate a representation for this sentence to be used for retrieving relevant articles:")

embeddings_1 = model.encode(sentences)
embeddings_2 = model.encode(sentences)

similarity = embeddings_1 @ embeddings_2.T

print(similarity)

queries = ['query_1', 'query_2']
passages = ["Sample document-1", "Sample document-2"]

q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode(passages)

scores = q_embeddings @ p_embeddings.T

2. Using Sentence-Transformers

pip install -U sentence-transformers

from sentence_transformers import SentenceTransformer

sentences = ["Sample data-1", "Sample data-2"]

model = SentenceTransformer('BAAI/bge-large-zh')

embeddings_1 = model.encode(sentences, normalize_embeddings=True)
embeddings_2 = model.encode(sentences, normalize_embeddings=True)

similarity = embeddings_1 @ embeddings_2.T

print(similarity)

queries = ['query_1', 'query_2']
passages = ["Sample document-1", "Sample document-2"]

instruction = "Generate a representation for this sentence to be used for retrieving relevant articles:"

q_embeddings = model.encode([instruction + q for q in queries], normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)

scores = q_embeddings @ p_embeddings.T

3. Using LangChain

from langchain.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True} 

model = HuggingFaceBgeEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

4. Using HuggingFace Transformers

from transformers import AutoTokenizer, AutoModel
import torch

sentences = ["Sample data-1", "Sample data-2"]

tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
model = AutoModel.from_pretrained('BAAI/bge-large-zh')

encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

with torch.no_grad():
    model_output = model(**encoded_input)

sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)

print("Sentence embeddings:", sentence_embeddings)

IV. Understanding BGE Landmark Embedding

Large language models (LLMs) often need to process long sequence inputs for various applications. Retrieval augmentation is a highly effective method for managing long-context language modeling. However, existing retrieval techniques typically work with chunked contexts, which can lead to subpar semantic representations and incomplete information retrieval.

The paper “BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models,” published in February 2024, outlines a method for addressing incomplete information retrieval in long contexts. By introducing a chunking-free retrieval approach, it ensures better coherence of context and enhances the understanding of the last sentence in a continuous information segment through a position-aware function during training. This significantly improves the performance of long-context retrieval augmentation while maintaining embedding details comparable to Sentence Embedding.

Innovations in BGE Landmark Embedding

BGE Landmark Embedding presents three key innovations:

Chunking-Free Architecture: This design eliminates the issues associated with traditional chunking methods, which often disrupt context coherence.
Position-Aware Objective Function: This function enhances the model’s ability to perceive the significance of information segments, ensuring that critical details are not overlooked.
Multi-Stage Training Algorithm: This approach optimizes the training process, allowing the model to adapt more effectively to long-context scenarios.

The results indicate that Landmark Embedding is a powerful tool capable of achieving more accurate and efficient information retrieval across various long-context tasks.

For further reading, refer to the full paper: BGE Landmark Embedding Paper.

Mastering RAG Optimization: BGE Embeddings & Landmark Breakthroughs

I. Recommendations and Results for Embedding Selection

Embedding Selection Guidelines

Selection Results

II. Overview of BGE

III. Utilizing BGE Models

1. Using FlagEmbedding

2. Using Sentence-Transformers

3. Using LangChain

4. Using HuggingFace Transformers

IV. Understanding BGE Landmark Embedding

Innovations in BGE Landmark Embedding

Choose the Right AI Embedding Model: A Beginner’s Guide

GraphRAG: The Next-Gen RAG Powering Smarter AI Search | Neo4j

Top AI Tools for Predictive Analytics: Boost Efficiency & Insights

Exclusive: 4 Secret AI Models Tested – OpenAI, Cohere & Google

Boost Content Moderation Efficiency with Google ShieldGemma

DiskGNN: Optimize Large-Scale GNN Training with Efficient Offline AI

Leave a Reply Cancel reply

Join 40,000+ AI Enthusiasts Receiving Our
Weekly NobleFilt Newsletter

Subscribe now and get exclusive access to our free guide: “10 Game-Changing AI Tools to Supercharge Your Productivity!”

I. Recommendations and Results for Embedding Selection

Embedding Selection Guidelines

Selection Results

II. Overview of BGE

III. Utilizing BGE Models

1. Using FlagEmbedding

2. Using Sentence-Transformers

3. Using LangChain

4. Using HuggingFace Transformers

IV. Understanding BGE Landmark Embedding

Innovations in BGE Landmark Embedding

Similar Posts

Leave a Reply Cancel reply

Join 40,000+ AI Enthusiasts Receiving OurWeekly NobleFilt Newsletter

Subscribe now and get exclusive access to our free guide: “10 Game-Changing AI Tools to Supercharge Your Productivity!”

Join 40,000+ AI Enthusiasts Receiving Our
Weekly NobleFilt Newsletter