- AiNews.com
- Posts
- DeepSeek-V3 Runs Locally at 20 Tokens/Second, Challenging OpenAI
DeepSeek-V3 Runs Locally at 20 Tokens/Second, Challenging OpenAI

Image Source: ChatGPT-4o
DeepSeek-V3 Runs Locally at 20 Tokens/Second, Challenging OpenAI
Chinese AI startup DeepSeek has quietly released its latest large language model, DeepSeek-V3-0324, sending shockwaves through the AI industry. The 641GB model, now live on Hugging Face, offers commercial use under an MIT license and can run locally at over 20 tokens per second on Apple’s high-end Mac Studio with the M3 Ultra chip—a significant shift from the GPU-intensive infrastructure typically required for models like ChatGPT or Claude.
“The new DeepSeek-V3-0324 in 4-bit runs at > 20 tokens/second on a 512GB M3 Ultra with mlx-lm!” — Awni Hannun, AI Researcher
Local AI Performance Redefines Expectations
While the $9,499 Mac Studio may stretch the label of "consumer-grade," the ability to run such a massive model locally is a major leap. In its 4-bit quantized form, the model shrinks to 352GB, allowing for efficient deployment on high-end personal machines.
This shift dramatically contrasts with traditional server-based models that require extensive GPU resources. Instead, Mac Studio draws under 200 watts during inference—compared to the multiple kilowatts needed by standard AI infrastructure—introducing a far more energy-efficient model of operation.
No Hype, Just Performance
DeepSeek released V3-0324 without a whitepaper or launch campaign—just a README and model weights. This low-key strategy deviates from Western AI firms' polished marketing rollouts.
Still, early testers report impressive results over the previous version:
“Tested the new DeepSeek V3 on my internal bench and it has a huge jump in all metrics on all tests. It is now the best non-reasoning model, dethroning Sonnet 3.5.” — Xeophon, AI Researcher
Unlike Sonnet, which sits behind a subscription paywall, DeepSeek-V3-0324’s weights are freely available for anyone to download and use.
Inside the Model: Smarter, Faster, Leaner
DeepSeek-V3-0324 utilizes a mixture-of-experts (MoE) architecture, activating only 37 billion of its 685 billion parameters per task—unlike traditional models, which activate their entire parameter count for every task. This design delivers powerful performance while drastically cutting compute demands. By selecting only the most relevant “expert” parameters for each task, DeepSeek-V3-0324 delivers performance on par with fully-activated models—at a fraction of the computational cost.
Additional innovations include:
Multi-Head Latent Attention (MLA): Boosts context retention across long passages.
Multi-Token Prediction (MTP): Generates several tokens per step, increasing output speed by 80%.
These features—paired with the 4-bit quantized version of the model—enable lightning-fast performance with significantly reduced memory and power consumption.
Strategic Shift: Open Source vs. Closed Walls
DeepSeek’s model is freely available, contrasting sharply with the subscription-only models from OpenAI and Anthropic. This strategy reflects a growing divergence between Chinese and Western AI philosophies:
U.S. firms: Monetize via paywalled APIs and closed models.
Chinese firms: Gain ecosystem dominance by open-sourcing foundational models, creating a “multiplier effect” that enables anyone to build without significant expense.
Major Chinese companies like Baidu, Tencent, and Alibaba are following suit, further accelerating the trend. This open approach supports innovation among smaller players while addressing hardware constraints due to Nvidia export restrictions.
What’s Next: DeepSeek-R2 on the Horizon
Rumors suggest DeepSeek-V3-0324 is the foundation for an upcoming reasoning model, DeepSeek-R2, expected within two months. If R2 rivals GPT-5, as many anticipate, it could reshape the AI race.
“This lines up with how they released V3 around Christmas followed by R1 a few weeks later. R2 is rumored for April so this could be it.” — Reddit user mxforest
Notably, Nvidia CEO Jensen Huang recently revealed that DeepSeek's R1 model consumes 100 times more compute than non-reasoning AIs, underscoring the achievement behind these efficient systems. It underscores the remarkable achievement of DeepSeek’s models, which match top-tier performance while operating under far tighter resource constraints than their Western counterparts.
How to Access DeepSeek-V3-0324
Developers and researchers can experiment with the model in several ways:
Direct download: Weights available via Hugging Face (641GB uncompressed, 352GB in 4-bit form).
Cloud access: OpenRouter offers free API access and a chat interface.
DeepSeek Chat: Likely upgraded to V3-0324, though not officially confirmed.
Inference services: Hyperbolic Labs and others are already serving the model.
A New Tone for a New Use Case
Early users have noticed a shift in the model’s communication style. While earlier DeepSeek versions were known for their conversational, human-like tone, V3-0324 adopts a more formal, technically focused persona. Some users note the model’s communication style has shifted:
“Is it only me or does this version feel less human like?” asked Reddit user nother_level.
“Yeah, it lost its aloof charm for sure, it feels too intellectual for its own good.” — Reddit user AppearanceHeavy6724
This shift toward a more formal, technical tone likely reflects a pivot toward enterprise and professional use, where precision and consistency are more valuable than conversational charm.
What This Means
DeepSeek’s quiet but powerful launch represents a broader trend: AI is no longer just about raw power—it’s about access, efficiency, and openness. By prioritizing widespread availability and optimized performance on affordable hardware, DeepSeek is transforming the landscape of AI development.
This approach is rapidly narrowing the perceived AI gap between China and the United States. While analysts recently placed China 1–2 years behind, that estimate has shrunk to just 3–6 months—with some areas nearing parity or even showing signs of Chinese leadership.
As more developers gain access to high-performing open models, the center of innovation may shift away from tightly controlled systems and toward a more decentralized, globally collaborative future. The company that enables the most people to build with AI—not just use it—could ultimately shape the next era of technology.
Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.