- AiNews.com
- Posts
- Apple's New AI Models Beat Mistral and Hugging Face
Apple's New AI Models Beat Mistral and Hugging Face
Apple's New AI Models Beat Mistral and Hugging Face
As excitement builds around the new GPT-4o-mini, Apple has expanded its lineup of small models. Recently, the Apple research team, part of the DataComp for Language Models project, unveiled a family of open DCLM models on Hugging Face.
Model Details and Performance
This release includes two main models: one with 7 billion parameters and another with 1.4 billion parameters. Both models perform well on benchmarks, particularly the larger one, which has outperformed Mistral-7B and is approaching other leading open models, such as Llama 3 and Gemma. Vaishaal Shankar from the Apple ML team described these as the “best-performing” open-source models available. Notably, the project is truly open source, with the release of model weights, training code, and the pretraining dataset.
DataComp Project Overview
The DataComp project, led by a team of multidisciplinary researchers from Apple, University of Washington, Tel Aviv University, and Toyota Institute of Research, aims to design high-quality datasets for training AI models, especially in the multimodal domain. The project uses a standardized framework with fixed model architectures, training code, hyperparameters, and evaluations to test different data curation strategies for training highly performant models.
Data Curation and Training
The team discovered that model-based filtering, where machine learning models automatically select high-quality data from larger datasets, is crucial for assembling a high-quality training set. To demonstrate this, the resulting dataset, DCLM-Baseline, was used to train the new DCLM decoder-only transformer English language models with 7 billion and 1.4 billion parameters from scratch.
Performance Benchmarks
The 7B model, trained on 2.5 trillion tokens using pretraining recipes based on the OpenLM framework, has a 2K context window and delivers 63.7% 5-shot accuracy on MMLU. This represents a 6.6 percentage point improvement compared to MAP-Neo, the previous state-of-the-art in the open-data language model category, while using 40% less compute for training. Its MMLU performance is comparable to leading open models like Mistral-7B-v0.3 (62.7%), Llama3 8B (66.2%), Google’s Gemma (64.3%), and Microsoft’s Phi-3 (69.9%).
Extended Training Results
The model’s performance on Core and Extended benchmarks improved further when the researchers extended its context length to 8K by performing an additional 100B of training using the Dataset Decomposition technique, though the MMLU result remained unchanged. “Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation,” the researchers stated in a paper detailing their work on DataComp-LM.
Smaller Model Performance
Similarly, the smaller 1.4B version, trained jointly with Toyota Research Institute on 2.6 trillion tokens, also performs impressively across MMLU, Core, and Extended tests. In the 5-shot MMLU test, it scored 41.9%, significantly higher than other models in its category, including Hugging Face’s recently released SmolLM. The 1.7B version of SmolLM scored 39.97%, while Qwen-1.5B and Phi-1.5B scored 37.87% and 35.90%, respectively.
Availability and Licensing
The larger model is available under Apple’s Sample Code License, while the smaller one is released under Apache 2.0, allowing for commercial use, distribution, and modification. An instruction-tuned version of the 7B parameter model is also available in the HF library. It's important to note that this is early research, focusing on data curation effectiveness. These models are not designed for Apple devices and may exhibit biases or produce harmful responses due to test training data.