AiNews.com
Posts
sCM Matches Diffusion Model Quality in Two Steps, Offering 50x Speedup

sCM Matches Diffusion Model Quality in Two Steps, Offering 50x Speedup

Alicia Shapiro
October 24, 2024 • Estimated Reading Time: 5 minutes

A side-by-side comparison of AI-generated images using traditional diffusion models versus sCM (simplified Continuous-time Consistency Models). The left side represents the slower, multi-step sampling process of diffusion models, while the right side illustrates the faster two-step sampling of sCM. Both images are of similar quality, but efficiency is highlighted with charts or paths behind them, visually demonstrating the difference in speed and computational load. The overall design is sleek, reflecting advancements in real-time generative AI technology

Image Source: ChatGPT-4o

sCM Matches Diffusion Model Quality in Two Steps, Offering 50x Speedup

Generative AI has made impressive strides in creating realistic images, 3D models, audio, and video—thanks in large part to diffusion models. However, despite their cutting-edge capabilities, diffusion models have a notable limitation: they are slow at sampling, often requiring dozens or even hundreds of steps to generate a single output. To address this, researchers have developed a new method called sCM (simplified Continuous-time Consistency Models), which significantly accelerates the process while maintaining high-quality results.

The Challenge: Slow Sampling in Diffusion Models

Diffusion models generate high-quality samples but are inefficient for real-time applications due to the lengthy sampling process. These models typically rely on multiple denoising steps to gradually refine the output. While various distillation techniques have attempted to speed up this process, they often come with trade-offs like increased computational costs or lower sample quality.

Introducing sCM: A Faster, Scalable Approach

Building on prior research on consistency models, the new sCM simplifies the theoretical framework and stabilizes training for continuous-time consistency models. This has allowed researchers to scale the training to an unprecedented 1.5 billion parameters on ImageNet at a resolution of 512×512. With sCM, only two sampling steps are needed to produce high-quality samples, resulting in a remarkable 50x speedup compared to traditional diffusion models.

For example, their largest model can generate a sample in just 0.11 seconds on a single A100 GPU, without any inference optimization. Further acceleration can be achieved with customized system optimization, making real-time generation of images, audio, and video more feasible than ever before.

Benchmarking Against State-of-the-Art Models

To evaluate sCM’s performance, the researchers compared its sample quality with other leading generative models using the Fréchet Inception Distance (FID) score, where a lower score indicates better quality. The results show that sCM’s two-step process produces samples comparable to the best diffusion models, while using less than 10% of the computational resources required by those models.

How Consistency Models Differ from Diffusion Models

Unlike diffusion models, which gradually convert noise into clear samples through multiple denoising steps, consistency models aim to transform noise into noise-free samples in a single step. The key advantage of consistency models lies in their ability to accelerate the sampling process without sacrificing quality, thanks to techniques like consistency training and consistency distillation.

In the case of sCM, the model distills knowledge from a pre-trained diffusion model, allowing it to scale up and improve as the underlying diffusion model grows. This means that the difference in quality between sCM and diffusion models decreases as both models increase in size.

Scaling and Quality Improvements

As sCM scales, the gap in sample quality compared to diffusion models diminishes. Even with just two sampling steps, sCM delivers results with less than a 10% difference in FID scores compared to the teacher diffusion model, which typically requires hundreds of steps.

While sCM still relies on pre-trained diffusion models for initialization, its two-step samples are nearly indistinguishable in quality, making it a promising alternative for faster, high-quality generation.

A line graph comparing the FID (Fréchet Inception Distance) scores of 1-step sCM, 2-step sCM, and diffusion models across different model sizes (S, M, L, XL, XXL). The x-axis represents single forward flops, and the y-axis shows the FID score, where lower values indicate better performance. The graph demonstrates that 2-step sCM approaches diffusion model performance with significantly fewer steps, highlighting improved efficiency.

Performance Comparison of sCM and Diffusion Models: FID vs. Single Forward Flops. Image Source: OpenAI

Looking Ahead: Real-Time Generative AI

The development of sCM points to a future where real-time, high-quality AI generation becomes increasingly viable across a range of domains, including image, audio, and video. The researchers acknowledge that while FID scores are a useful benchmark, they do not always perfectly reflect actual sample quality, especially for certain applications. Moving forward, the team aims to further improve both inference speed and sample quality to meet the evolving needs of generative AI.

By addressing the trade-off between speed and quality, sCM could unlock new possibilities for generative AI, especially in real-time applications where rapid, high-quality output is crucial.

You can read their research paper, and visit their blog for more graphs and diagrams.