As the landscape of natural language processing (NLP) evolves, the need for effective retrieval-augmented generation (RAG) techniques has become increasingly vital. This article delves into the advancements in BGE (BAAI General Embedding) word embeddings and explores the innovative approach of Landmark Embedding, which addresses common challenges in processing long-context inputs.
I. Recommendations and Results for Embedding Selection
Embedding Selection Guidelines
When selecting embedding models, several factors must be considered, including token sequence length and domain-specific performance. Here are some key recommendations based on recent evaluations:
- Sequence Lengths: Most models support a sequence length of 512 tokens. For longer contexts, consider models like tao-8k for 8192 tokens and stella for 1024 tokens.
- Performance in Specialized Domains: In professional data settings, embedding models often underperform compared to traditional methods like BM25. However, fine-tuning can significantly enhance their effectiveness.
- Recommended Models: For users with limited experience in model training but requiring fine-tuning, the BGE series is highly recommended due to its comprehensive training scripts and negative example mining capabilities. Other models based on BERT can also be referenced for similar training scripts.
- Re-ranking Models: Options for re-ranking models are limited, but bge-reranker is a solid choice, supporting fine-tuning. Note that re-ranking models typically require GPU deployment due to their input size.
Selection Results
- PEG
- GTE Series
- Piccolo Series
- Author: SenseTime
- Address: Piccolo Model
- Notes: Includes fine-tuning tips.
- Stella Series
- Address: Stella Model
- Blog: Stella Blog
- Details: Based on the Piccolo model with support for 1024 token sequences.
- BGE Series
- Author: BAAI
- Address: BGE Model
- Paper: BGE Paper
- GitHub: FlagEmbedding Repository
- Highlights: Offers extensive open-source information and fine-tuning examples.
- M3E Series
- Author: MokaAI
- Address: M3E Model
- GitHub: M3E GitHub
- Significance: Early pioneer in Chinese general embedding models.
- Multilingual E5 Large
- Address: Multilingual E5 Model
- Paper: E5 Paper
- Feature: Supports multiple languages.
- Tao-8k
- Address: Tao-8k Model
- Note: Supports 8192 token sequences, but with limited information available.
II. Overview of BGE
In the era of large models, addressing issues such as hallucinations, knowledge obsolescence, and ultra-long text processing is crucial. However, high-quality semantic vector models in the Chinese context remain scarce and often closed-source.
To tackle these challenges, BAAI has released BGE (BAAI General Embedding), a robust open-source model for Chinese and English semantic vectors. BGE outperforms all similar models in the community, including OpenAI’s text embedding 002, in both semantic retrieval accuracy and overall representation capabilities. Notably, BGE maintains the smallest vector dimension among models of comparable parameter scales, resulting in lower usage costs.
The paper titled “C-Pack: Packed Resources For General Chinese Embeddings” introduces a resource package that includes:
- C-MTP: A large text embedding training set with extensive unsupervised and high-quality supervised corpora.
- C-MTEB: A benchmark covering six tasks and thirty-five datasets for Chinese text embeddings.
- BGE: Multi-scale text embedding models.
III. Utilizing BGE Models
Here are several methods to implement BGE models using FlagEmbedding, Sentence-Transformers, LangChain, and HuggingFace Transformers.
1. Using FlagEmbedding
pip install -U FlagEmbedding
from FlagEmbedding import FlagModel
sentences = ["Sample data-1", "Sample data-2"]
model = FlagModel('BAAI/bge-large-zh', query_instruction_for_retrieval="Generate a representation for this sentence to be used for retrieving relevant articles:")
embeddings_1 = model.encode(sentences)
embeddings_2 = model.encode(sentences)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
queries = ['query_1', 'query_2']
passages = ["Sample document-1", "Sample document-2"]
q_embeddings = model.encode_queries(queries)
p_embeddings = model.encode(passages)
scores = q_embeddings @ p_embeddings.T
2. Using Sentence-Transformers
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
sentences = ["Sample data-1", "Sample data-2"]
model = SentenceTransformer('BAAI/bge-large-zh')
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
queries = ['query_1', 'query_2']
passages = ["Sample document-1", "Sample document-2"]
instruction = "Generate a representation for this sentence to be used for retrieving relevant articles:"
q_embeddings = model.encode([instruction + q for q in queries], normalize_embeddings=True)
p_embeddings = model.encode(passages, normalize_embeddings=True)
scores = q_embeddings @ p_embeddings.T
3. Using LangChain
from langchain.embeddings import HuggingFaceBgeEmbeddings
model_name = "BAAI/bge-small-en"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': True}
model = HuggingFaceBgeEmbeddings(
model_name=model_name,
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)
4. Using HuggingFace Transformers
from transformers import AutoTokenizer, AutoModel
import torch
sentences = ["Sample data-1", "Sample data-2"]
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
model = AutoModel.from_pretrained('BAAI/bge-large-zh')
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
sentence_embeddings = model_output[0][:, 0]
sentence_embeddings = torch.nn.functional.normalize(sentence_embeddings, p=2, dim=1)
print("Sentence embeddings:", sentence_embeddings)
IV. Understanding BGE Landmark Embedding
Large language models (LLMs) often need to process long sequence inputs for various applications. Retrieval augmentation is a highly effective method for managing long-context language modeling. However, existing retrieval techniques typically work with chunked contexts, which can lead to subpar semantic representations and incomplete information retrieval.
The paper “BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models,” published in February 2024, outlines a method for addressing incomplete information retrieval in long contexts. By introducing a chunking-free retrieval approach, it ensures better coherence of context and enhances the understanding of the last sentence in a continuous information segment through a position-aware function during training. This significantly improves the performance of long-context retrieval augmentation while maintaining embedding details comparable to Sentence Embedding.
Innovations in BGE Landmark Embedding
BGE Landmark Embedding presents three key innovations:
- Chunking-Free Architecture: This design eliminates the issues associated with traditional chunking methods, which often disrupt context coherence.
- Position-Aware Objective Function: This function enhances the model’s ability to perceive the significance of information segments, ensuring that critical details are not overlooked.
- Multi-Stage Training Algorithm: This approach optimizes the training process, allowing the model to adapt more effectively to long-context scenarios.
The results indicate that Landmark Embedding is a powerful tool capable of achieving more accurate and efficient information retrieval across various long-context tasks.
For further reading, refer to the full paper: BGE Landmark Embedding Paper.