AiNews.com
Posts
Inception Unveils Faster, Cheaper AI Model Using Diffusion Technology

Inception Unveils Faster, Cheaper AI Model Using Diffusion Technology

Alicia Shapiro
March 03, 2025 • Estimated Reading Time: 7 minutes

A futuristic AI lab where scientists and engineers analyze Inception’s diffusion-based language model (DLM). Holographic screens display AI-generated text alongside comparisons to traditional LLM outputs. Digital graphs highlight the model's speed and cost efficiency, emphasizing its 10x faster performance. The sleek, modern setting represents cutting-edge AI advancements in language processing.

Image Source: ChatGPT-4o

Inception Unveils Faster, Cheaper AI Model Using Diffusion Technology

Palo Alto startup Inception has emerged from stealth with Mercury, a diffusion-based large language model (DLM) designed for faster and more efficient AI processing. Developed by Stanford professor Stefano Ermon, Mercury leverages diffusion technology to improve text generation, achieving speeds up to 10x faster and costs 10x lower than traditional LLMs.

A New Approach to AI Language Models

Unlike standard LLMs, which generate text sequentially—one word at a time—diffusion models start with a rough estimate (ex. an image) and refine it in parallel. This approach has been widely used in AI-generated images, video, and audio (e.g., Midjourney, OpenAI’s Sora), but Inception is applying it to text generation for the first time at scale.

Generation Process:

Traditional LLMs: Generate text sequentially, one token at a time, where each token depends on the previous ones. This sequential nature can lead to slower generation speeds.
dLLMs: Utilize a "coarse-to-fine" generation process, starting with random, noisy text and refining it over several steps. This allows for parallel processing, resulting in faster text generation.

Reasoning and Error Correction:

Traditional LLMs: Once a token is generated, it's fixed, making real-time error correction challenging.
dLLMs: Can iteratively refine their outputs, enabling them to correct mistakes and reduce hallucinations during the generation process.

Performance and Efficiency:

Traditional LLMs: May require more computational resources due to their sequential processing nature.
dLLMs: Achieve higher speeds and efficiency by leveraging parallel processing, leading to reduced computational costs and faster outputs.

Ermon’s research, developed over years at Stanford, led to a breakthrough in diffusion-based text generation, resulting in what Inception calls a diffusion-based large language model (DLM).

Mercury Coder – High-Speed Code Generation

Mercury Coder, Inception’s diffusion-based AI model for coding, is designed to deliver exceptional speed and accuracy for software development tasks. The model achieves high performance across industry-standard coding benchmarks while running at significantly higher speeds than competing AI models.

A benchmark table comparing the performance of Mercury Coder Mini and Mercury Coder Small against AI models like Gemini 2.0 Flash-Lite, Claude 3.5 Haiku, and GPT-4o Mini. The table includes throughput speed (tokens per second) and accuracy scores across various coding benchmarks, highlighting Mercury Coder’s superior speed and competitive accuracy.

Mercury Coder Benchmark Comparison Table. Image Source: Inception

Key Highlights:

Lightning-Fast Throughput – Mercury Coder Mini generates 1,109 tokens per second, while Mercury Coder Small reaches 737 tokens per second—far surpassing models like GPT-4o Mini and Gemini 2.0 Flash-Lite.
Competitive Accuracy – The model scores competitively on coding benchmarks, including HumanEval, MBPP, and EvalPlus, making it a strong contender against leading AI coding models.
Optimized for Developers – Available for testing in a playground environment, Mercury Coder is tailored for fast, reliable code generation across various programming tasks.

A scatter plot from Artificial Analysis comparing coding model performance based on output speed and coding index scores. Mercury Coder Mini and Small are positioned in the highest-performing area, significantly outpacing models like GPT-4o Mini, Claude 3.5 Haiku, and Gemini 2.0 Flash-Lite.

Mercury Coder Performance on Artificial Analysis Coding Index. Image Source: Inception

Independent evaluations, including those from Artificial Analysis, confirm Mercury Coder’s ability to outperform traditional models in both speed and efficiency.

A bar chart comparing AI model throughput in output tokens per second. Mercury Coder Mini and Mercury Coder Small lead the rankings with speeds of 1109 and 737 tokens per second, respectively, far surpassing competing models like Qwen2.5 Coder 7B, Gemini 2.0 Flash-Lite, and GPT-4o Mini.

Mercury Coder Speed Comparison Across AI Models. Image Source: Inception

Faster AI with Lower Costs and Improved Efficiency

Inception claims its DLMs can run up to 10x faster and cost 10x less than traditional LLMs. The company has already secured customers, including Fortune 100 companies, who are looking for AI solutions with reduced latency and increased speed.

“What we found is that our models can leverage the GPUs much more efficiently,” Ermon said. “I think this is a big deal. This is going to change the way people build language models.”

Performance Comparison:

Its "mini" model reportedly outperforms Meta’s Llama 3.1 8B, generating over 1,000 tokens per second.
Independent benchmarking by Artificial Analysis found Inception’s models to be 10x faster than leading speed-optimized models like GPT-4o mini and Claude 3.5 Haiku, achieving performance levels previously only possible with specialized hardware.
On Copilot Arena, a leading LLM performance leaderboard, developers have ranked Inception’s model ahead of frontier closed-source models, including GPT-4o.

Mercury runs at over 1,000 tokens per second on NVIDIA H100 GPUs, a speed previously achievable only with custom-built AI hardware.

Business Model and Deployment Options

Inception offers its DLM technology in multiple formats:

Mercury Coder, a code generation model, is available for testing in a playground environment.
API-based access for developers and businesses.
On-premises and edge deployment for companies needing local AI processing.
Fine-tuning support to customize models for specific applications.
A suite of pre-trained DLMs optimized for various use cases.

While Ermon has declined to discuss funding, TechCrunch reports that Mayfield Fund has invested in the company. Inception has already attracted customers, including unnamed Fortune 100 companies, by meeting the demand for lower AI latency and faster processing speeds

What This Means

If Inception’s claims hold up, diffusion-based large language models (DLMs) could revolutionize AI text generation, offering significant speed and cost advantages over LLMs. This could mean:

Faster AI-powered applications with lower computing demands.
More efficient GPU usage, reducing infrastructure costs for companies.
Potential competition with AI leaders like OpenAI and Meta.

Looking ahead, Mercury Coder is just the first in a series of upcoming dLLMs. Inception is currently testing a chat-focused model in closed beta, and future diffusion-based models could unlock new AI capabilities, including:

Improved AI agents – Faster and more efficient AI assistants for complex planning tasks.
Advanced reasoning – Real-time error correction to reduce hallucinations and improve AI-generated answers.
Controllable generation – The ability to edit outputs, infill missing text, and align results with user goals.
Edge applications – dLLMs' efficiency makes them ideal for running AI on personal devices like phones and laptops, expanding AI accessibility.

For everyday users, this could lead to quicker, more responsive AI assistants and lower costs for AI-driven services.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.