AiNews.com
Posts
DeepSeek Open Source Week Day 3: DeepGEMM Boosts FP8 GEMM Performance

DeepSeek Open Source Week Day 3: DeepGEMM Boosts FP8 GEMM Performance

Alicia Shapiro
February 26, 2025 • Estimated Reading Time: 5 minutes

A futuristic digital landscape featuring interconnected glowing data nodes and GPU processors, symbolizing high-performance computing. In the background, sleek NVIDIA Hopper GPUs are arranged in rows, with light trails emphasizing speed and efficiency. A subtle overlay of mathematical symbols and matrix grids highlights DeepGEMM’s role in optimizing FP8 matrix multiplication for AI computation.

Image Source: ChatGPT-4o

DeepSeek Open Source Week Day 3: DeepGEMM Boosts FP8 GEMM Performance

For Day 3 of DeepSeek's Open Source Week, DeepSeek has unveiled DeepGEMM, a lightweight yet powerful FP8 General Matrix Multiplication (GEMM) library designed to support both dense and Mix-of-Experts (MoE) GEMMs. The library plays a crucial role in optimizing DeepSeek-V3 and DeepSeek-R1 training and inference, boasting impressive efficiency gains on NVIDIA Hopper GPUs.

A screenshot of DeepSeek’s X (Twitter) post announcing DeepGEMM on Day 3 of Open Source Week. The post highlights DeepGEMM as an FP8 GEMM library supporting dense and MoE GEMMs, optimizing V3/R1 training and inference. Key features include 1350+ FP8 TFLOPS on Hopper GPUs, Just-In-Time compilation, minimal dependencies, and a compact ~300-line core logic. The post includes a GitHub link for more details and has engagement metrics such as likes, retweets, and views.

Image Source: DeepSeek X Post

Key Features of DeepGEMM

Exceptional Performance: Achieves up to 1350+ FP8 TFLOPS on Hopper GPUs.
Minimal Dependencies: Designed to be as clean as a tutorial, avoiding unnecessary complexity.
Just-In-Time Compilation: Compiles all kernels at runtime, eliminating installation overhead.
Compact Yet Powerful: Core logic spans only ~300 lines of code, yet outperforms expert-tuned kernels on most matrix sizes.
Versatile Layout Support: Compatible with dense layout and two MoE layouts.

Built in CUDA, DeepGEMM avoids heavy reliance on CUTLASS or CuTe templates, prioritizing simplicity while still using CUDA-core two-level accumulation (promotion) to counter FP8 tensor core imprecision. This makes it a valuable resource for understanding and optimizing FP8 matrix multiplication on Hopper tensor cores.

Despite its impressive speed and efficiency, DeepGEMM does not perform optimally on certain matrix shapes, and DeepSeek welcomes optimization contributions from the community.

For a detailed breakdown of DeepGEMM’s performance across different matrix shapes, visit the official GitHub repository.

DeepSeek API Off-Peak Discounts

In addition to launching DeepGEMM, DeepSeek has introduced off-peak pricing for its API platform, offering significant savings between 16:30 and 00:30 UTC daily (8:30 AM – 4:30 PM PST).

DeepSeek-V3: 50% off
DeepSeek-R1: 75% off

These discounts provide a cost-effective way for users to maximize their compute resources during designated off-peak hours.

A promotional image detailing DeepSeek’s off-peak discounts for DeepSeek-V3 and DeepSeek-R1 models. The image lists standard and discounted prices for input and output tokens, with DeepSeek-V3 receiving a 50% discount and DeepSeek-R1 receiving a 75% discount. The off-peak hours are specified as 16:30–00:30 UTC daily. The design features a clean layout with pricing tables and discount labels.

DeepSeek Announces Off-Peak Discounts for API Usage. Image Source: DeepSeek X Post

Looking Ahead

DeepGEMM’s introduction underscores DeepSeek’s focus on efficiency and accessibility in AI infrastructure. While already outperforming many expert-tuned kernels, the open-source nature of DeepGEMM leaves room for further refinement. With DeepSeek actively seeking optimization contributions, developers and researchers have the opportunity to push the boundaries of FP8 GEMM performance even further.

What This Means

DeepGEMM’s release underscores DeepSeek’s commitment to open-source innovation, providing a lightweight yet high-performance FP8 GEMM solution for AI workloads. By simplifying implementation while achieving state-of-the-art performance, DeepGEMM offers developers and researchers a valuable tool for optimizing NVIDIA Hopper-based training and inference. Meanwhile, the introduction of off-peak discounts makes DeepSeek’s API platform more cost-effective, encouraging AI practitioners to maximize compute efficiency.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.