AiNews.com
Posts
Apple's New AI Models Beat Mistral and Hugging Face

Apple's New AI Models Beat Mistral and Hugging Face

Alicia Shapiro
July 22, 2024 • Estimated Reading Time: 5 minutes

Apple's New AI Models Beat Mistral and Hugging Face

As excitement builds around the new GPT-4o-mini, Apple has expanded its lineup of small models. Recently, the Apple research team, part of the DataComp for Language Models project, unveiled a family of open DCLM models on Hugging Face.

Model Details and Performance

This release includes two main models: one with 7 billion parameters and another with 1.4 billion parameters. Both models perform well on benchmarks, particularly the larger one, which has outperformed Mistral-7B and is approaching other leading open models, such as Llama 3 and Gemma. Vaishaal Shankar from the Apple ML team described these as the “best-performing” open-source models available. Notably, the project is truly open source, with the release of model weights, training code, and the pretraining dataset.

DataComp Project Overview

The DataComp project, led by a team of multidisciplinary researchers from Apple, University of Washington, Tel Aviv University, and Toyota Institute of Research, aims to design high-quality datasets for training AI models, especially in the multimodal domain. The project uses a standardized framework with fixed model architectures, training code, hyperparameters, and evaluations to test different data curation strategies for training highly performant models.

Data Curation and Training

The team discovered that model-based filtering, where machine learning models automatically select high-quality data from larger datasets, is crucial for assembling a high-quality training set. To demonstrate this, the resulting dataset, DCLM-Baseline, was used to train the new DCLM decoder-only transformer English language models with 7 billion and 1.4 billion parameters from scratch.

Performance Benchmarks

The 7B model, trained on 2.5 trillion tokens using pretraining recipes based on the OpenLM framework, has a 2K context window and delivers 63.7% 5-shot accuracy on MMLU. This represents a 6.6 percentage point improvement compared to MAP-Neo, the previous state-of-the-art in the open-data language model category, while using 40% less compute for training. Its MMLU performance is comparable to leading open models like Mistral-7B-v0.3 (62.7%), Llama3 8B (66.2%), Google’s Gemma (64.3%), and Microsoft’s Phi-3 (69.9%).

A table comparing the performance of various AI models on CORE, MMLU, and EXTENDED benchmarks. The models are divided into two categories: 'Open weights, closed datasets' and 'Open weights, open datasets.' The table includes models such as Llama2, DeepSeek, Mistral-0.3, QWEN-2, Llama3, Gemma, Phi-3, Falcon, OLMo-1.7, MAP-Neo, and Apple's DCLM-7B. Each model's parameters, token count, and benchmark scores for CORE, MMLU, and EXTENDED tasks are listed. Apple's DCLM-7B model, highlighted in the 'Open weights, open datasets' section, shows competitive scores with 56.1 on CORE, 63.7 on MMLU, and 43.6 on EXTENDED

Image Source: Philipp Schmid X Post

Extended Training Results

The model’s performance on Core and Extended benchmarks improved further when the researchers extended its context length to 8K by performing an additional 100B of training using the Dataset Decomposition technique, though the MMLU result remained unchanged. “Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation,” the researchers stated in a paper detailing their work on DataComp-LM.

Smaller Model Performance

Similarly, the smaller 1.4B version, trained jointly with Toyota Research Institute on 2.6 trillion tokens, also performs impressively across MMLU, Core, and Extended tests. In the 5-shot MMLU test, it scored 41.9%, significantly higher than other models in its category, including Hugging Face’s recently released SmolLM. The 1.7B version of SmolLM scored 39.97%, while Qwen-1.5B and Phi-1.5B scored 37.87% and 35.90%, respectively.

Availability and Licensing

The larger model is available under Apple’s Sample Code License, while the smaller one is released under Apache 2.0, allowing for commercial use, distribution, and modification. An instruction-tuned version of the 7B parameter model is also available in the HF library. It's important to note that this is early research, focusing on data curation effectiveness. These models are not designed for Apple devices and may exhibit biases or produce harmful responses due to test training data.