The Cost-Performance Trade-off of Large Language Models
Large Language Models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks. However, as evident from the performance-cost graph in Figure 1, there are significant differences in their costs and abilities. Generally speaking, more capable models tend to be more expensive than less capable ones. This leads to a dilemma when deploying LLMs in real-world scenarios: routing all queries to the largest, most capable model can yield the highest quality responses but may come at a high cost, while routing queries to smaller models can save costs but may result in lower response quality.
Performance-cost graph of various LLMs. Performance measured by Elo on Chatbot Arena, cost per million tokens assumes a 1:1 input/output ratio. By routing between two models, we ideally achieve a better performance:cost ratio than either model alone.
RouteLLM: A Solution for Efficient LLM Deployment
Lmsys, the maintainer of the most authoritative LLM evaluation leaderboard, has open-sourced RouteLLM, providing a solution where each query is first processed by a system that decides which LLM to route it to. Ideally, all queries that can be handled by weaker models should be routed to those models, while all other queries are routed to stronger models to minimize costs while maintaining response quality.
A Principled Framework for LLM Routing
RouteLLM is a principled framework for LLM routing based on preference data. It formalizes the LLM routing problem and explores augmentation techniques to improve router performance. Four different routers were trained using public data from Chatbot Arena, demonstrating that they can significantly reduce costs without impacting quality. Compared to using GPT-4 alone, costs were reduced by over 85% on MT Bench, 45% on MMLU, and 35% on GSM8K, while still achieving 95% of GPT-4’s performance.
Training Routers with Chatbot Arena Data and Augmentation
Four routers were trained using a combination of Chatbot Arena data and data augmentation:
- Similarity-Weighted (SW) Ranking Router, which performs a “weighted Elo calculation” based on similarity
- Matrix Factorization Model, used to learn a scoring function to evaluate a model’s ability to answer a prompt
- BERT Classifier, which can predict which model can provide a better response
- Causal LLM Classifier, which can also predict which model can provide a better response
These routers were evaluated on three popular benchmarks: MT Bench, MMLU, and GSM8K. For evaluation, routing was performed between GPT-4 Turbo as the strong model and Mixtral 8x7B as the weak model. A previous random router was used as a baseline.
Impressive Performance and Cost Savings
Matrix Factorization and SW Ranking both exhibit excellent performance. Notably, Matrix Factorization is able to achieve 95% of GPT-4’s performance using only 26% of GPT-4 calls, approximately 48% cheaper than the random baseline.
Augmenting the Arena data with an LLM Judge significantly improves the performance of all routers. When trained on this augmented dataset, Matrix Factorization again emerges as the best-performing router, further halving the number of GPT-4 calls needed to achieve 95% GPT-4 performance to just 14% of total calls, 75% cheaper than the random baseline.
Figure 2: Router performance on MT Bench when trained only on Arena data (left), and when trained on Arena data augmented with an LLM Judge (right).
When trained only on the Arena dataset, all routers perform poorly at near-random levels, attributed to most MMLU problems being out-of-distribution. However, augmenting the training dataset with canonical label data from the MMLU validation split significantly improves the performance of all routers, with the best-performing Causal LLM router now requiring only 54% of GPT-4 calls to achieve 95% GPT-4 performance, 14% cheaper than the random baseline. Importantly, this augmentation dataset of approximately 1500 samples constitutes less than 2% of the overall training data, demonstrating the effectiveness of data augmentation even with few samples.
Figure 3: Router performance on MMLU (left) when trained only on Arena data (right) when trained on Arena data augmented with canonical label data from the MMLU validation split.
RouteLLM Demo
Lmsys has also built a small demo. For example, coding problems are routed to GPT-4-1106, while blog writing is routed to Mixtral-8x7b. Experience it at:
https://0c83f754b05f4a2208.gradio.live/
For more information, check out:
https://github.com/lm-sys/RouteLLM
https://arxiv.org/abs/2406.18665