Jina AI Introduces PE-Rank for Efficient Listwise Passage Reranking
Following the release of Jina Reranker v2, Jina AI has open-sourced PE-Rank, a new LLM-based reranker for efficient listwise passage reranking. Instead of feeding raw text into the LLM’s context window, PE-Rank represents each passage as a special token using an embedding model. It then inputs + + into the LLM. During inference, PE-Rank constrains the output space to these special tokens, enabling more efficient decoding. PE-Rank dramatically reduces the latency of reranking 100 documents from 21 seconds to just 3 seconds.
Comparison of RankGPT and PE-Rank
RankGPT (top) takes entire paragraphs as input and outputs ordered numbers, while PE-Rank (bottom) uses a list of special tokens as both input and output. The right side shows the reranking results on DL19 using different forms of input.
The Appeal and Challenges of Using LLMs as Rerankers
Using large language models (LLMs) as rerankers offers several attractive features:
- Flexible instructions for new tasks
- Zero-shot capabilities
- Contextual reasoning
However, in practice, several factors hinder the use of LLMs as rerankers:
- Context length: Reranking 100 documents with 1,000 tokens each essentially requires a context length of 100,000 tokens.
- Finding a needle in a haystack: Performance may fluctuate as important information can get lost in long contexts.
- Susceptibility to prompt injection: Instructions and queries may be overridden by candidate documents.
- Output format issues: Ensuring the output follows the correct order format (e.g., d1 > d3 > d2 > d7) can be challenging. Sometimes you may get grammatical errors or overly verbose results.
How PE-Rank Works
With PE-Rank, the input to the LLM is essentially the instruction + query + embedded paragraphs, each as a special token.
Secondly, the idea of using special paragraph tokens to represent the original text is similar to soft prompting. However, in PE-Rank, external embedding models like Jina/BGE are used to encode the documents. This introduces some discrepancy between the external embeddings and the backbone LLM’s own token embeddings, meaning a mapping function needs to be learned. To do this, the embedding model and LLM are frozen, and only a 2-layer MLP is trained to transform the embedding space.
But how do you fine-tune a large language model (LLM)? Is the classic supervised fine-tuning (SFT) method useful here? In fact, it’s not particularly helpful because the decoding space is limited to the special paragraph embedding tokens, so applying standard SFT is not straightforward. In PE-Rank, two losses are combined: ListMLE, which maximizes the probability of generating the next most relevant paragraph token; and Contextual ListMLE, which further conditions on the original content itself. This approach enhances the model’s ability to leverage token-level interactions between the query and paragraphs and helps transfer this ability when ranking using only embeddings.
PE-Rank Performance Evaluation
Using Mistral-7B-Instruct-v0.2 as the base model for PE-Rank’s LLM and Jina-embeddings-v2/BGE-v1.5 for external embeddings, PE-Rank achieves performance comparable to feeding the original documents into GPT-4 (RankGPT4 in the table) but with only one-sixth the latency, reducing the total time cost from 20 seconds to 3 seconds. If reranking only the top 20 candidates, the latency per query can be further reduced to 0.5 seconds, making it quite practical for real-world applications.
When switching between Jina embeddings and BGE embeddings, PE-Rank consistently improves the performance of the underlying retriever, whether it’s BM25, Jina, or BGE. Interestingly, although BGE scores higher than Jina on MTEB, the performance of reranking BM25 retrieval results using BGE embeddings is consistently lower than using Jina embeddings across three different datasets. This suggests that models excelling on general embedding benchmarks like MTEB may not necessarily perform well in this specific context, while Jina embeddings show better scalability here.
Key Takeaways
- PE-Rank is a new LLM-based reranker that uses passage embeddings for efficient listwise reranking, reducing latency from 21 seconds to 3 seconds for reranking 100 documents.
- PE-Rank represents passages as special tokens and constrains the LLM’s output space to these tokens during inference for more efficient decoding.
- Using LLMs as rerankers offers benefits like flexible instructions, zero-shot capabilities, and contextual reasoning, but faces challenges related to context length, information loss, prompt injection, and output formatting.
- PE-Rank combines ListMLE and Contextual ListMLE losses to enhance the model’s ability to leverage token-level interactions between queries and passages.
- With an optimized setup, PE-Rank achieves performance comparable to using GPT-4 for reranking but with significantly lower latency, making it practical for real-world applications.
https://github.com/liuqi6777/pe_rank
https://arxiv.org/pdf/2406.14848
Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models