- AiNews.com
- Posts
- Inception Unveils Faster, Cheaper AI Model Using Diffusion Technology
Inception Unveils Faster, Cheaper AI Model Using Diffusion Technology

Image Source: ChatGPT-4o
Inception Unveils Faster, Cheaper AI Model Using Diffusion Technology
Palo Alto startup Inception has emerged from stealth with Mercury, a diffusion-based large language model (DLM) designed for faster and more efficient AI processing. Developed by Stanford professor Stefano Ermon, Mercury leverages diffusion technology to improve text generation, achieving speeds up to 10x faster and costs 10x lower than traditional LLMs.
A New Approach to AI Language Models
Unlike standard LLMs, which generate text sequentially—one word at a time—diffusion models start with a rough estimate (ex. an image) and refine it in parallel. This approach has been widely used in AI-generated images, video, and audio (e.g., Midjourney, OpenAI’s Sora), but Inception is applying it to text generation for the first time at scale.
Generation Process:
Traditional LLMs: Generate text sequentially, one token at a time, where each token depends on the previous ones. This sequential nature can lead to slower generation speeds.
dLLMs: Utilize a "coarse-to-fine" generation process, starting with random, noisy text and refining it over several steps. This allows for parallel processing, resulting in faster text generation.
Reasoning and Error Correction:
Traditional LLMs: Once a token is generated, it's fixed, making real-time error correction challenging.
dLLMs: Can iteratively refine their outputs, enabling them to correct mistakes and reduce hallucinations during the generation process.
Performance and Efficiency:
Traditional LLMs: May require more computational resources due to their sequential processing nature.
dLLMs: Achieve higher speeds and efficiency by leveraging parallel processing, leading to reduced computational costs and faster outputs.
Ermon’s research, developed over years at Stanford, led to a breakthrough in diffusion-based text generation, resulting in what Inception calls a diffusion-based large language model (DLM).
Mercury Coder – High-Speed Code Generation
Mercury Coder, Inception’s diffusion-based AI model for coding, is designed to deliver exceptional speed and accuracy for software development tasks. The model achieves high performance across industry-standard coding benchmarks while running at significantly higher speeds than competing AI models.
Key Highlights:
Lightning-Fast Throughput – Mercury Coder Mini generates 1,109 tokens per second, while Mercury Coder Small reaches 737 tokens per second—far surpassing models like GPT-4o Mini and Gemini 2.0 Flash-Lite.
Competitive Accuracy – The model scores competitively on coding benchmarks, including HumanEval, MBPP, and EvalPlus, making it a strong contender against leading AI coding models.
Optimized for Developers – Available for testing in a playground environment, Mercury Coder is tailored for fast, reliable code generation across various programming tasks.
Independent evaluations, including those from Artificial Analysis, confirm Mercury Coder’s ability to outperform traditional models in both speed and efficiency.
Faster AI with Lower Costs and Improved Efficiency
Inception claims its DLMs can run up to 10x faster and cost 10x less than traditional LLMs. The company has already secured customers, including Fortune 100 companies, who are looking for AI solutions with reduced latency and increased speed.
“What we found is that our models can leverage the GPUs much more efficiently,” Ermon said. “I think this is a big deal. This is going to change the way people build language models.”
Performance Comparison:
Its "mini" model reportedly outperforms Meta’s Llama 3.1 8B, generating over 1,000 tokens per second.
Independent benchmarking by Artificial Analysis found Inception’s models to be 10x faster than leading speed-optimized models like GPT-4o mini and Claude 3.5 Haiku, achieving performance levels previously only possible with specialized hardware.
On Copilot Arena, a leading LLM performance leaderboard, developers have ranked Inception’s model ahead of frontier closed-source models, including GPT-4o.
Mercury runs at over 1,000 tokens per second on NVIDIA H100 GPUs, a speed previously achievable only with custom-built AI hardware.
Business Model and Deployment Options
Inception offers its DLM technology in multiple formats:
Mercury Coder, a code generation model, is available for testing in a playground environment.
API-based access for developers and businesses.
On-premises and edge deployment for companies needing local AI processing.
Fine-tuning support to customize models for specific applications.
A suite of pre-trained DLMs optimized for various use cases.
While Ermon has declined to discuss funding, TechCrunch reports that Mayfield Fund has invested in the company. Inception has already attracted customers, including unnamed Fortune 100 companies, by meeting the demand for lower AI latency and faster processing speeds
What This Means
If Inception’s claims hold up, diffusion-based large language models (DLMs) could revolutionize AI text generation, offering significant speed and cost advantages over LLMs. This could mean:
Faster AI-powered applications with lower computing demands.
More efficient GPU usage, reducing infrastructure costs for companies.
Potential competition with AI leaders like OpenAI and Meta.
Looking ahead, Mercury Coder is just the first in a series of upcoming dLLMs. Inception is currently testing a chat-focused model in closed beta, and future diffusion-based models could unlock new AI capabilities, including:
Improved AI agents – Faster and more efficient AI assistants for complex planning tasks.
Advanced reasoning – Real-time error correction to reduce hallucinations and improve AI-generated answers.
Controllable generation – The ability to edit outputs, infill missing text, and align results with user goals.
Edge applications – dLLMs' efficiency makes them ideal for running AI on personal devices like phones and laptops, expanding AI accessibility.
For everyday users, this could lead to quicker, more responsive AI assistants and lower costs for AI-driven services.
Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.