DeepMind has unveiled an exciting new technique called JEST (Joint Example Selection Technology) that is poised to significantly alleviate the computational challenges of training large AI models. JEST has demonstrated the ability to reduce model training time by up to 13x while saving 90% of the compute power typically required.
The Importance of High-Quality Training Data
When pre-training models at a large scale, the quality of the training data plays a crucial role in the model’s ultimate performance. Across language, vision, and multimodal models, the use of high-quality datasets can substantially reduce the amount of data needed and dramatically improve the model’s performance.
However, the pipelines for training data have traditionally relied on manual filtering, which is expensive and inflexible. JEST aims to address this challenge by jointly selecting batches of training data.
How JEST Works
To determine which data is easiest to learn, JEST employs two models:
- The model currently being trained
- An already trained reference model
Data that is difficult for the model being trained but easy for the reference model is considered high quality. JEST uses a strategy of scoring batches of data and sampling based on those scores. It primarily uses three scoring criteria:
- Difficult learner
- Simple reference
- Learnability
The final learnability score combines the first two scoring criteria, accelerating the large-scale learning process by prioritizing data that has not been learned but is easy to learn.
Impressive Results
To reduce the increased computational workload when evaluating “super batches,” the DeepMind team also introduced a variant called Flexi-JEST.
When achieving the same post-training model performance, JEST reduces the number of iterations by up to 13x compared to the current best method, SigLIP. Additionally, Flexi-JEST achieves the same score as SigLIP using only 10% of the training data.
JEST is able to produce more learnable batches, with the learnability of the batches being highly structured and non-diagonalized. It discovers sub-batches with high learnability equivalent to Gibbs sampling in fewer iterations.
The learnability of sampled batches increases as the filtration rate increases, allowing for selection from larger super batches. Training on the most learnable sub-batches selected by JEST from super batches significantly accelerates multimodal learning.
Joint prioritization of learnable batches produces better results than simply prioritizing individual samples. Multi-resolution training is crucial for enhancing JEST’s performance. Caching fixed reference model scores in the dataset can halve the substantial computational cost incurred by JEST when scoring large super batches per iteration.
Outperforming Existing Techniques
Compared to existing techniques, JEST enables models to achieve superior performance with the least amount of training. It achieves the best results in text-to-image (T2I) and image-to-text (I2T) applications.
The COCO performance, representing the average of image-to-text and text-to-image retrieval, shows that JEST++ significantly outperforms existing techniques while requiring notably fewer training iterations.
Conclusion
Overall, the JEST method has demonstrated significant advantages in experiments. Compared to traditional independent sample selection methods, JEST not only improves training efficiency but also substantially reduces the required computational resources.
JEST provides a new, highly efficient method for data filtering in multimodal learning, and it is likely to be seen in various new technologies in the future.
For more details, check out the research paper: https://arxiv.org/abs/2406.17711