Jina AI Introduces PE-Rank for Efficient Listwise Passage Reranking

Following the release of Jina Reranker v2, Jina AI has open-sourced PE-Rank, a new LLM-based reranker for efficient listwise passage reranking. Instead of feeding raw text into the LLM’s context window, PE-Rank represents each passage as a special token using an embedding model. It then inputs + + into the LLM. During inference, PE-Rank constrains the output space to these special tokens, enabling more efficient decoding. PE-Rank dramatically reduces the latency of reranking 100 documents from 21 seconds to just 3 seconds.

Jina AI

Comparison of RankGPT and PE-Rank

RankGPT (top) takes entire paragraphs as input and outputs ordered numbers, while PE-Rank (bottom) uses a list of special tokens as both input and output. The right side shows the reranking results on DL19 using different forms of input.

Comparison of RankGPT and PE Rank

The Appeal and Challenges of Using LLMs as Rerankers

Using large language models (LLMs) as rerankers offers several attractive features:

  • Flexible instructions for new tasks
  • Zero-shot capabilities
  • Contextual reasoning

However, in practice, several factors hinder the use of LLMs as rerankers:

  • Context length: Reranking 100 documents with 1,000 tokens each essentially requires a context length of 100,000 tokens.
  • Finding a needle in a haystack: Performance may fluctuate as important information can get lost in long contexts.
  • Susceptibility to prompt injection: Instructions and queries may be overridden by candidate documents.
  • Output format issues: Ensuring the output follows the correct order format (e.g., d1 > d3 > d2 > d7) can be challenging. Sometimes you may get grammatical errors or overly verbose results.

How PE-Rank Works

With PE-Rank, the input to the LLM is essentially the instruction + query + embedded paragraphs, each as a special token.

How PE Rank Works

Secondly, the idea of using special paragraph tokens to represent the original text is similar to soft prompting. However, in PE-Rank, external embedding models like Jina/BGE are used to encode the documents. This introduces some discrepancy between the external embeddings and the backbone LLM’s own token embeddings, meaning a mapping function needs to be learned. To do this, the embedding model and LLM are frozen, and only a 2-layer MLP is trained to transform the embedding space.

PE Rank Overview

But how do you fine-tune a large language model (LLM)? Is the classic supervised fine-tuning (SFT) method useful here? In fact, it’s not particularly helpful because the decoding space is limited to the special paragraph embedding tokens, so applying standard SFT is not straightforward. In PE-Rank, two losses are combined: ListMLE, which maximizes the probability of generating the next most relevant paragraph token; and Contextual ListMLE, which further conditions on the original content itself. This approach enhances the model’s ability to leverage token-level interactions between the query and paragraphs and helps transfer this ability when ranking using only embeddings.

Classic Supervised Fine Tuning

PE-Rank Performance Evaluation

Using Mistral-7B-Instruct-v0.2 as the base model for PE-Rank’s LLM and Jina-embeddings-v2/BGE-v1.5 for external embeddings, PE-Rank achieves performance comparable to feeding the original documents into GPT-4 (RankGPT4 in the table) but with only one-sixth the latency, reducing the total time cost from 20 seconds to 3 seconds. If reranking only the top 20 candidates, the latency per query can be further reduced to 0.5 seconds, making it quite practical for real-world applications.

NDCG@10
Delays at different stages

When switching between Jina embeddings and BGE embeddings, PE-Rank consistently improves the performance of the underlying retriever, whether it’s BM25, Jina, or BGE. Interestingly, although BGE scores higher than Jina on MTEB, the performance of reranking BM25 retrieval results using BGE embeddings is consistently lower than using Jina embeddings across three different datasets. This suggests that models excelling on general embedding benchmarks like MTEB may not necessarily perform well in this specific context, while Jina embeddings show better scalability here.

BM25 Jina or BGE

Key Takeaways

  • PE-Rank is a new LLM-based reranker that uses passage embeddings for efficient listwise reranking, reducing latency from 21 seconds to 3 seconds for reranking 100 documents.
  • PE-Rank represents passages as special tokens and constrains the LLM’s output space to these tokens during inference for more efficient decoding.
  • Using LLMs as rerankers offers benefits like flexible instructions, zero-shot capabilities, and contextual reasoning, but faces challenges related to context length, information loss, prompt injection, and output formatting.
  • PE-Rank combines ListMLE and Contextual ListMLE losses to enhance the model’s ability to leverage token-level interactions between queries and passages.
  • With an optimized setup, PE-Rank achieves performance comparable to using GPT-4 for reranking but with significantly lower latency, making it practical for real-world applications.
https://github.com/liuqi6777/pe_rank
https://arxiv.org/pdf/2406.14848
Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *