While the cost per token for large language models has decreased, the time required for uploading long prompts remains a significant bottleneck. For the foreseeable future, prompt compression will be an essential consideration when developing various AI applications.
Current attempts to compress lengthy prompts have resulted in compressed prompts with noticeably lower readability and interpretability compared to the original, negatively impacting their utility. To address this issue, researchers from King Abdullah University of Science and Technology in Saudi Arabia, South China University of Technology, Ant Group, University of Macau, and Nankai University have proposed PROMPT-SAW (PROMPT compresSion via Relation AWare graphs). This effective strategy leverages relation-aware graphs for text prompt compression, outperforming baseline models by 12.3 and 13.7 on the same dataset with compression ratios of 33.0 and 56.7, respectively.
Although the paper did not provide a GitHub repository or any code, I attempted to reproduce the results using Python based on the algorithm described in the paper. Despite some differences, I achieved broadly similar results using the same data.
How PROMPT-SAW Works
The core idea of PROMPT-SAW is to use a graph structure to represent the textual information in the prompt words, which helps analyze the key aspects of the prompts. By refining the information through the graph structure, a compressed prompt can be generated that maintains semantic consistency without distorting the final performance or utility of the prompt. Specifically, PROMPT-SAW first uses a built-in learning prompt to prompt the language model to construct a graph from the original prompt text. For task-relevant scenarios, PROMPT-SAW traverses the graph to retain only task-relevant information as a task-specific subgraph. For task-agnostic scenarios, PROMPT-SAW uses similarity scores between continuous information elements in the graph to identify and remove redundant elements to obtain the desired subgraph.
PROMPT-SAW achieves prompt compression through the following main steps:
- Extract: Extract entities (such as person names, place names, organization names, etc.) and their relationships from the original prompt text to construct a knowledge graph.
- Model: Organize the extracted entities and relationships into a graph model, where nodes represent entities and edges represent relationships between entities.
- Graph compression based on specific scenarios:
- For task-aware scenarios, traverse the graph model and retain only task-relevant information as a task-specific subgraph.
- For task-agnostic scenarios, calculate the similarity between adjacent information units in the graph, remove redundant elements, and obtain the desired subgraph. A binary search algorithm is used to find an appropriate similarity threshold.
- Encoder: Encode the entities and relationships contained in the compressed subgraph into natural language to generate the final compressed prompt.
Compared to token-level compression methods like LLMlingua, PROMPT-SAW analyzes and compresses the prompt text by transforming it into a knowledge graph, allowing it to grasp the key information of the prompt at a higher semantic level. The generated compressed prompts also excel in readability and interpretability, as evident from the case comparisons provided in the paper.
For example, for the question “Who got the first Nobel Prize in Physics,” LLMlingua’s compressed prompt is: “ITitle: List of Nobelates in The first PrizelWilhelmrad, ofwho receiveds2in en prize: won twoes forg01 theate women prizeertMayer(1963). As of 2017, the prize has been awarded,” which has poor readability and obscures key information. In contrast, PROMPT-SAW’s compressed prompt is: “Wilhelm Conrad Röntgen awarded first Nobel Prize in Physics 1901. William Lawrence Bragg won Nobel Prize in Physics 1915. Maria Goeppert-Mayer won Nobel Prize in Physics 1963,” clearly retaining the key information needed to answer the question.
PROMPT-SAW infers the key information in the prompt using the knowledge graph structure and significantly improves the compression ratio through an adaptive graph compression strategy while maintaining the readability and utility of the compressed prompt. This represents an important advancement in prompt compression techniques. Its ingenious design and superior performance make it highly valuable for researchers engaged in prompt engineering and AI product development to learn from and draw upon.
Experimental Results
To comprehensively evaluate the performance of PROMPT-SAW on task-agnostic prompts, the research team also proposed the GSM8K-AUG dataset, an extended version of the existing GSM8k benchmark. For task-relevant datasets, they used the NATURALQUESTIONS dataset for evaluation.
The experimental results show that compared to the best baseline model, PROMPT-SAW improves performance by 14.3% and 13.7% in task-relevant and task-agnostic settings, respectively, while compressing the original prompt text by 33.0% and 56.7%. More importantly, the prompts compressed by PROMPT-SAW also outperform the baseline models in terms of readability.
On GSM8K-AUG, compared to the best baseline, PROMPT-SAW improves the EM value by 9.5%, 8.2%, and 13.7% in 1-shot, 2-shot, and 4-shot settings, respectively, while reducing the prompt size by 31.8%, 33.0%, and 33.0%. On NaturalQuestions, PROMPT-SAW improves the Span Accuracy by 14.30%, 10.62%, and 6.52% compared to GPT-4 (the best baseline) when the target compression ratios are 0.5, 0.3, and 0.1, respectively, while reducing the prompt size by 56.7%, 74.0%, and 93.7%.
Comparison with LLMLinggua
The researchers also compared the readability of prompts compressed by PROMPT-SAW and LLMLingua. The results showed that the prompts compressed by PROMPT-SAW exhibit better readability and interpretability. This may be because baseline models like LLMLingua ignore the inherent semantic relationships between tokens in the compressed prompts, leading to a loss of interpretability.
PROMPT-SAW significantly improves the performance of prompt compression by leveraging the graph structure to infer key information in the prompt while ensuring the readability and utility of the compressed prompts. This innovative approach is expected to substantially reduce the computational cost and inference latency of LLMs in practical applications, paving the way for further development and application of LLMs. For prompt engineers developing AI products, PROMPT-SAW is undoubtedly an excellent case study to learn from.
The experimental results in the paper fully demonstrate the effectiveness and superiority of PROMPT-SAW. In both task-agnostic and task-relevant settings, PROMPT-SAW significantly outperforms the current best baseline models. More importantly, the compressed prompts generated by PROMPT-SAW are far superior to the baseline models in terms of readability and interpretability. This means that while greatly improving prompt compression performance, PROMPT-SAW also ensures the human-friendliness and practicality of the compressed prompts.
How I Reproduced the Results
The core innovation of PROMPT-SAW lies in using the knowledge graph structure to represent and analyze the textual information in the prompts. By transforming the prompts into a graph structure consisting of entities and relationships, PROMPT-SAW can better infer the key information elements of the prompts, thus achieving more precise and efficient prompt compression. This method is not only theoretically innovative but also demonstrates excellent performance in practice.
Prompt compression belongs to the fields of Natural Language Processing (NLP) and Information Extraction (IE). The well-known scholar in NLP and machine learning, Professor Andrew Ng, strongly recommends researching Microsoft’s Medprompt. I have also implemented Medprompt using the cutting-edge DSPy technology in the article “Inject the Wisdom of MedPrompt into Your Q&A System with DSPy: A Plug-and-Play Code Implementation.”
Before introducing my code implementation, I first need to introduce two concepts. One is TF-RDF, which will be used in the compress_graph function, and the other is cosine similarity, which you often see but may not be familiar with. Here, I am just throwing out a brick to attract jade. If you are interested, you can research these concepts more in-depth.
TF-RDF
Using a TF-IDF vectorizer to convert node text into a TF-IDF matrix is a key step in text processing for representing text data features. Here is a detailed explanation:
The concept of TF-IDF
TF (Term Frequency): Measures the frequency of a word appearing in a document. Formula: TF = Number of times a word appears in a document / Total number of words in the document.
IDF (Inverse Document Frequency): Measures the importance of a word, considering its rarity in the entire corpus. Formula: IDF = log(Total number of documents in the corpus / Number of documents containing the word + 1).
TF-IDF: The product of TF and IDF, used to measure the importance of a word in a document. Formula: TF-IDF = TF * IDF.
Cosine Similarity
Cosine similarity matching is a common method for measuring the similarity between two vectors, particularly suitable for comparing high-dimensional vectors. It determines their similarity by calculating the cosine value of the angle between the two vectors. Cosine similarity ranges from -1 to 1, with 1 indicating that the two vectors are identical, 0 indicating that they are orthogonal, and -1 indicating that the two vectors are in opposite directions.
Formula (related to the sinecosine you learned in school)
Let’s illustrate the application of cosine similarity in vector queries through a simple text processing example.
Suppose we have the following three pieces of text:
- “I love reading books, especially science fiction novels.”
- “Science fiction movies fascinate me; I often go to the cinema.”
- “I enjoy outdoor activities, especially hiking and camping.”
Our goal is to find the text most similar to the query sentence, “I’ve been reading a great science fiction novel recently.”
1) Convert these texts into vectors.
This is usually done through bag-of-words models, TF-IDF, Word2Vec, or other NLP techniques. In this example, we simplify the process and assume that each text has already been converted into a vector.
2) Query vectorization
Convert the query sentence, “I’ve been reading a great science fiction novel recently,” into a vector as well.
3) Calculate cosine similarity
Calculate the cosine similarity between the query vector and each text vector using the formula above.
4) Select the most similar text
After calculating these values, we find that the first text is most similar to the query sentence because it has the highest cosine value. This indicates that the first text is closest to the query sentence in terms of topic and content.
What is cosine similarity used for?
In practical applications, cosine similarity matching is often used in the following situations:
- Information retrieval: Finding the most similar documents to a query among a vast collection of documents by representing texts as vectors.
- Recommendation systems: Calculating the similarity between users and items to provide personalized recommendations.
- Cluster analysis: Determining the similarity between different data points for data classification or clustering.
Why can it solve vector queries?
Cosine similarity matching can solve vector query problems because it focuses on comparing the direction of vectors rather than their magnitude. This means that even if two vectors represent data of different scales, they can be identified as similar as long as their directions are consistent. Moreover, in high-dimensional spaces, cosine similarity is relatively simple to compute and has high computational efficiency, making it suitable for fast matching on large amounts of data.
With this foundation, we can implement it using the Graph approach, which essentially involves four main steps:
- Extract entity labels
def extract_entities_relations(text):
# Extract entities and their labels
entities = [(e.text, e.label_) for e in doc.ents]
relations = []
for sent in doc.sents:
for tok in sent:
if tok.dep_ == "ROOT":
relation = (tok.text, tok.lemma_, tok.dep_)
relations.append((sent.text, relation))
return entities, relations
- Build the graph by adding nodes and edges
def build_graph(entities, relations):
# Create an undirected graph
g = nx.Graph()
# Iterate over all entities and add them as nodes to the graph, with the entity label as a node attribute
for ent, label in entities:
g.add_node(ent, label=label)
# Iterate over all relations and add them as edges to the graph
for sent, (word, lemma, dep) in relations:
rel_parts = sent.split() # Split the sentence into a list of words
if len(rel_parts) > 1:
subj = rel_parts[0] # Take the first word as the subject
obj = rel_parts[-1] # Take the last word as the object
if subj in g.nodes() and obj in g.nodes():
# If both the subject and object are in the graph nodes, add an edge between them
g.add_edge(subj, obj, relation=lemma) # The edge attribute is the lemma of the relation
return g
- Compress the graph
def compress_graph(graph, ratio=0.5):
nodes = list(graph.nodes()) # Get a list of all nodes in the graph
num_nodes_to_remove = int(len(nodes) * (1 - ratio)) # Calculate the number of nodes to remove
vectorizer = TfidfVectorizer() # Initialize the TF-IDF vectorizer
node_texts = [str(node) for node in graph.nodes()] # Convert nodes to text
tfidf_matrix = vectorizer.fit_transform(node_texts) # Generate the TF-IDF matrix
- Calculate cosine similarity and remove the most similar nodes to achieve compression
# Calculate the cosine similarity for each node and select the most similar node to remove
for i, node in enumerate(nodes):
if graph.has_node(node):
similarities = cosine_similarity(tfidf_matrix[i:i+1], tfidf_matrix)[0]
avg_similarity = sum(similarities) / len(similarities)
# Remove the selected node
if node_to_remove:
graph.remove_node(node_to_remove)
nodes.remove(node_to_remove)
return graph
The above code examples are part of the implementation of Prompt-SAW, which differs slightly from the algorithm in the paper. I was unable to implement the natural language encoding (Encoder) part from the original text. The results were still obtained using the deepseek API. Why choose this API? Only by actually using other APIs will you understand the role of parameters in passing and generating results. True knowledge comes from practice.
What is PROMPT-SAW and how does it enhance prompt compression?
PROMPT-SAW is an advanced framework designed for compressing textual prompts used in large language models (LLMs). By leveraging graph structures to retain key information, it enhances readability and maintains task performance. For more details, visit the official Linnk AI article on PROMPT-SAW.
How does prompt compression affect the performance of language models?
Prompt compression optimizes the input size, reducing the number of tokens processed by language models. This leads to faster response times and lower operational costs, making it crucial for applications where efficiency is paramount. Learn more about the impact of prompt compression on LLMs at DataCamp.
What are the advantages of using graph-based techniques in prompt compression?
Graph-based techniques, like those used in PROMPT-SAW, allow for a more nuanced analysis of prompts by capturing relationships between entities. This results in better retention of essential information and improved output quality compared to traditional token-level methods. For a deeper dive, check out the research paper on PROMPT-SAW.
Can PROMPT-SAW be implemented in existing Python projects?
Yes, PROMPT-SAW can be easily integrated into existing Python projects. It is designed to work with various frameworks and libraries, allowing developers to enhance their applications without significant code overhauls. For implementation guidance, refer to the MongoDB Developer Hub.
Are there any limitations to using PROMPT-SAW for prompt compression?
While PROMPT-SAW is effective, it may not be suitable for all types of prompts, especially in conversational contexts where nuance is critical. Over-compression can lead to loss of important details. It’s important to evaluate its applicability based on specific use cases. For more insights, explore the SitePoint article on prompt compression.