Image Source: ChatGPT-4o

DeepSeek Enhances AI Inference Efficiency and Cost Optimization

During OpenSource Week last week, DeepSeek unveiled a bonus feature—its optimized inference system for V3/R1 models, which leverages expert parallelism (EP) and load balancing to maximize efficiency.

Optimizing AI Inference: Higher Throughput, Lower Latency

The DeepSeek-V3/R1 inference system is designed to enhance performance by prioritizing higher throughput and lower latency. To achieve these goals, DeepSeek implements cross-node Expert Parallelism (EP), a method that:

Scales batch size, improving GPU matrix computation efficiency.
Distributes workloads across GPUs, reducing memory access demands and latency.

However, EP introduces challenges, including cross-node communication overhead and load balancing across multiple GPUs. DeepSeek addresses these by:

Overlapping computation with communication to reduce delays and optimize throughput
Ensuring balanced workloads across GPUs through Data Parallelism (DP) to prevent bottlenecks.

Cross-Node Expert Parallelism (EP) at Scale

DeepSeek-V3/R1 operates with a high level of sparsity—activating only 8 out of 256 experts per layer—requiring a large batch size for efficiency. The system employs different parallelism strategies for the prefilling and decoding phases:

Prefilling Phase: Uses EP32 (32 expert-parallel nodes) and DP32. Each deployment unit spans 4 nodes with 32 redundant routed experts, where each GPU handles 9 routed experts and 1 shared expert.
Decoding Phase: Expands to EP144 (144 expert-parallel nodes) and DP144. Each deployment unit spans 18 nodes with 32 redundant routed experts, where each GPU manages 2 routed experts and 1 shared expert.

By implementing large-scale cross-node EP, DeepSeek achieves higher throughput and lower latency while maintaining model accuracy. The diagram below illustrates the architecture of DeepSeek’s inference system, showing how expert parallelism, data distribution, and GPU workloads are structured for optimal efficiency.

DeepSeek-V3/R1 Inference System Architecture. Image Source: DeepSeek GitHub

Load Balancing for Maximum Efficiency

To prevent performance bottlenecks, DeepSeek employs specialized load-balancing techniques for different processing stages. During the prefilling phase, it uses a dual-microbatch strategy to optimize efficiency by overlapping computation with communication. In the decoding phase, load balancing ensures even KVCache usage and request distribution across GPUs, preventing slowdowns.

Prefill Load Balancer – Balances token distribution and computational workload across GPUs.
Decode Load Balancer – Ensures even KVCache usage and request distribution.
Expert-Parallel Load Balancer – Prevents individual GPUs from being overloaded with expert computations.

DeepSeek’s Inference Service Performance & Costs

DeepSeek-V3/R1 inference services run on H800 GPUs, using precision formats optimized for both speed and accuracy. FP8 (8-bit floating point) is used for matrix multiplications and dispatch transmissions to improve efficiency, while BF16 (16-bit floating point) is used for core computations to maintain accuracy. The system dynamically scales based on demand:

Peak node occupancy: 278 nodes (each with 8 H800 GPUs).
Average occupancy: 226.75 nodes.
Daily cost: $87,072 (at $2 per GPU per hour).

From Feb 27, 2025, 12:00 PM to Feb 28, 2025, 12:00 PM (UTC+8), DeepSeek processed:

608 billion input tokens, with 56.3% (342B tokens) using on-disk KV cache.
168 billion output tokens, averaging 20–22 tokens per second.
Average throughput: ~73.7k tokens/s during prefilling, ~14.8k tokens/s during decoding.

If all tokens were billed at DeepSeek-R1’s pricing, theoretical daily revenue would be $562,027, with a 545% cost profit margin. However, actual revenue is lower due to:

Lower pricing for DeepSeek-V3.
Limited monetization of services (web and app access remain free).
Automatic nighttime discounts during off-peak hours.

Beyond performance optimization, DeepSeek also closely monitors the operational costs and revenue potential of its inference services. The graph below provides a breakdown of GPU usage, token processing rates, and estimated financial metrics, offering insight into the system’s efficiency and cost-effectiveness.

DeepSeek-V3/R1 Inference Service Statistics. Image Source: DeepSeek GitHub

What This Means

DeepSeek’s efficiency-driven approach demonstrates that large-scale AI services can be both scalable and profitable, with a projected 545% cost profit margin. This highlights how AI companies sustain operations and monetize services, insights that may interest investors and industry experts. For everyday users, these optimizations could lead to more accessible AI services, whether through free tiers, lower pricing, or improved performance.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.

DeepSeek Enhances AI Inference Efficiency and Cost Optimization

DeepSeek Enhances AI Inference Efficiency and Cost Optimization

Keep Reading

AiNews.com