NVIDIA Boosts Llama-3’s Context to 128K, Outperforming GPT-4

August 15, 2024

by kevin

In the rapidly evolving world of natural language processing, the race to develop more powerful and capable language models has been a constant pursuit. One critical aspect that sets apart the performance of these models is their context length—the ability to process and understand longer sequences of text. In a groundbreaking development, NVIDIA has achieved a significant milestone by extending the context length of the open-source Llama-3 model by an astounding 16 times, surpassing even the renowned GPT-4 in long-context understanding tasks.

Why Context Length Matters in Language Models

To grasp the significance of NVIDIA’s achievement, it is essential to understand the role of context length in language models. A model’s context length determines the maximum number of tokens (words or subwords) it can process in a single input sequence. Longer context lengths enable models to comprehend and generate more coherent and contextually relevant text, as they can consider a larger portion of the input at once.

The importance of context length becomes evident when considering various natural language tasks. For instance, in document summarization, a model with a longer context length can analyze and condense entire articles or reports, capturing the main ideas and key details more effectively. Similarly, in tasks involving multi-step reasoning, such as answering complex questions or solving mathematical problems, a longer context allows the model to maintain a more comprehensive understanding of the problem and its intermediate steps.

However, extending context length is not a trivial task. It requires advanced techniques and optimizations to efficiently process and store the increased amount of information. This is where NVIDIA’s research team has made significant strides, pushing the boundaries of what is possible with open-source language models.

NVIDIA’s Innovative Approach to Extending Context Length

To extend Llama-3’s context length from 8K to 128K tokens, NVIDIA’s research team employed a combination of innovative techniques. One key aspect was the use of a processed SlimPajama dataset, which allowed them to generate a massive 128K-length dataset containing 100 billion tokens. This dataset served as the foundation for training the extended model.

Another crucial optimization was adjusting the base frequency of the RoPE (Rotary Position Embedding) mechanism from 500K to 150M. RoPE is a technique that helps the model understand the relative position of tokens within the input sequence. By increasing the base frequency, the researchers enabled the model to handle longer sequences more effectively.

In the post-training stage, NVIDIA introduced a three-stage instruction fine-tuning process to further enhance the model’s performance. This process involved fine-tuning the model using high-quality instruction-following datasets, conversational QA datasets, and long-context datasets. By exposing the model to diverse types of tasks and data, the researchers aimed to improve its instruction-following ability, retrieval-augmented generation (RAG) performance, and long-context understanding.

The team also explored the combination of long-context retrievers with long-context models. By using the E5-mistral embedding model as the retriever and optimizing the chunk size, they were able to achieve better results in retrieving relevant information from large amounts of text.

Paper Title: ChatQA 2: Bridging the Gap to Proprietary LLMs in Long Context and RAG Capabilities

Paper Link: https://arxiv.org/pdf/2407.14482

Why Extend the Context Length of Large Language Models?

We know that the longer the context length of a large language model, the more resources it consumes during the computation process. Extending the context of large models appears to be a time-consuming and labor-intensive task. Many readers may wonder why it is necessary to extend the context length of large models.

Extending the context length has the following advantages:

Enhanced long-text understanding: Longer context allows models to process and understand longer documents, conversations, and code segments, which is crucial for tasks like document summarization and long dialogue analysis.
Improved multi-step reasoning: Long context enables models to retain more information in a single inference, helping solve complex multi-step problems such as mathematical proofs or intricate logical reasoning tasks.
Increased coherence in generated content: For long-text generation tasks, longer context helps models maintain better topic consistency and logical coherence.
Reduced information loss: Short-context models need to split and process long texts multiple times, which can lead to information loss. Long context can mitigate this issue.

In summary, extending the context length of large models allows them to handle complex tasks with ease.

However, there is a significant gap in context length between open-source and closed-source models. For example, the open-source Llama-3 only supports a context length of 8K, while the closed-source GPT-4 Turbo has already reached 128K.

To bridge this gap, the NVIDIA research team used the open-source Llama-3 model as a foundation and employed a series of innovative techniques to extend its context length from 8K to 128K, achieving a 16-fold increase in Llama-3’s context length.

The researchers named the extended model Llama3-ChatQA-2-70B, which reached the level of GPT-4 in long-context understanding and even surpassed GPT-4 in certain tasks.

In addition, the research team explored the combination of long-context models and retrieval-augmented generation (RAG) techniques, providing more flexible options for different application scenarios.

How to Extend Model Context Length?

The NVIDIA team employed a series of innovative techniques to extend Llama-3’s context length.

The research team first continued pre-training the model. To improve pre-training quality, they sampled the SlimPajama dataset and generated a total of 100 billion tokens of 128K-length training data.

To accommodate the longer context, the researchers increased the base frequency of RoPE from 500K to 150M.

Through research, they discovered that using the special character ~~to separate different documents was more effective than using traditional and tokens.~~

In the post-training stage, the research team designed a three-stage instruction fine-tuning process:

Fine-tuning the model using high-quality instruction-following datasets.
Fine-tuning the model using conversational QA datasets.
Focusing on long-context datasets, covering both under 32K and 32K-128K ranges.

To further enhance the model’s performance in practical applications, the team explored combining long-context retrievers with long-context models. They used the E5-mistral embedding model as the retriever and discovered that using larger chunk sizes while keeping the total token count fixed yielded better results.

Through these techniques, NVIDIA extended Llama-3’s context length from 8K to 128K, bridging the gap between open-source models and closed-source models in terms of context length. Moreover, after extending the context length, Llama3-ChatQA-2-70B’s performance in context understanding even surpassed that of GPT-4.

Experimental Results

The NVIDIA team designed a comprehensive set of experiments to evaluate the performance of the Llama3-ChatQA-2-70B model. These experiments covered tasks with varying context lengths, from short to ultra-long texts, and compared the model with several top-performing models.

First, in the “needle in a haystack” test, Llama3-ChatQA-2-70B achieved 100% accuracy within a 128K token length, demonstrating its outstanding long-context retrieval capabilities.

For tasks with context lengths exceeding 100K tokens, the team used the InfiniteBench benchmark, testing on four tasks: long-text summarization (En.Sum), long-text question answering (En.QA), long-text multiple-choice (En.MC), and long-text dialogue (En.Dia).

Llama3-ChatQA-2-70B’s average score was 34.11, outperforming GPT-4-Turbo-2024-04-09 (33.16) and Claude 2 (33.96), and only slightly lower than Qwen2-72B-Instruct (34.88). Notably, in the En.QA task, Llama3-ChatQA-2-70B led with a score of 44.22.

In addition, the research team tested on medium-length context tasks within 32K tokens. Llama3-ChatQA-2-70B’s average score was 47.37, although lower than GPT-4-Turbo-2024-04-09 (51.93) and Qwen2-72B-Instruct (49.94), it still outperformed Llama-3-70B-Instruct-Gradient-262k (40.51).

For short-text tasks within 4K tokens, the team used ChatRAG Bench. Llama3-ChatQA-2-70B surpassed GPT-4-Turbo-2024-04-09 and Qwen2-72B-Instruct.

The team also compared the effects of retrieval-augmented generation (RAG) and directly using long-context models. For tasks within 32K tokens, directly using long-context models slightly outperformed the RAG method.

However, for tasks exceeding 100K tokens, the RAG method outperformed directly using long-context models.

Conclusion

Long context plays a crucial role in enhancing the understanding capabilities of large language models. NVIDIA combined multiple techniques to extend Llama-3’s context length from 8K to 128K, bridging the gap with closed-source models in terms of context length.

The extended model, Llama3-ChatQA-2-70B, surpassed closed-source models like GPT-4 in long-context understanding tasks. The research also revealed the advantages of RAG techniques in specific scenarios, providing more flexible options for different applications.

Categories: AI Tools Guide