AiNews.com
Posts
Innovative AI Inference Techniques at Character.AI - Driving Global Efficiency

Innovative AI Inference Techniques at Character.AI - Driving Global Efficiency

Alicia Shapiro
June 21, 2024 • Estimated Reading Time: 5 minutes

A high-tech, futuristic representation of AI inference optimization, featuring a sleek abstract AI model with glowing circuits and data streams. Integrated elements like neural networks, GPUs, and memory chips symbolize efficiency and scalability. A world map in the background represents global reach with bright lines connecting different parts, illustrating data processing and connectivity

Innovative AI Inference Techniques at Character.AI - Driving Global Efficiency

Character.AI is making significant strides toward achieving Artificial General Intelligence (AGI), aiming to transform daily life through advanced large language models (LLMs). These models are designed to improve business productivity, entertainment, education, coaching, support, brainstorming, creative writing, and more. A critical component of realizing this vision globally is optimizing AI inference—the process by which LLMs generate responses.

High-Efficiency Inference

As a comprehensive AI company, Character.AI develops its model architecture, inference stack, and products from the ground up. This approach allows for significant improvements in inference efficiency, cost reduction, and scalability to meet increasing global demand. Currently, the company handles over 20,000 inference queries per second, which is approximately 20% of the volume managed by Google Search.

Innovations in Serving Stack

To handle such a high volume of queries efficiently, Character.AI has introduced several key innovations across their serving stack.

Memory-efficient Architecture Design

Managing the cache size of attention keys and values (KV) is a significant challenge in LLM inference. Character.AI has implemented several techniques to drastically reduce KV cache size without compromising quality:

Multi-Query Attention: This technique, adopted in all attention layers, reduces KV cache size by eight times compared to the Grouped-Query Attention commonly used in open-source models. By allowing multiple queries to share the same key and value pairs, it significantly cuts down the memory requirements.
Hybrid Attention Horizons: Character.AI interleaves local and global attention layers to balance efficiency and performance. Local attention, trained with sliding windows, reduces computational complexity from O(length^2) to O(length). The attention horizon, which defines how far the model looks back in the text, is reduced to 1024 tokens for most layers. This setup maintains high performance on evaluation metrics, including long-context benchmarks, while optimizing memory use. Only one out of every six layers uses global attention, which considers the entire context.
Cross Layer KV-sharing: By sharing the KV cache across neighboring attention layers, Character.AI reduces the cache size further by two to three times. For global attention layers, KV caches are shared across multiple layers, significantly cutting down memory usage without degrading model quality. This approach ensures that the system can handle long contexts effectively.

A comparison diagram showing two transformer designs. On the left, the standard transformer design with all layers having global attention. On the right, the production model design with blue boxes indicating global attention, green boxes indicating local attention, and curves showing KV-sharing across non-adjacent layers. This illustration only depicts a subset of layers in the full model

Image Source: Character.AI

Stateful Caching

To efficiently manage long dialogues, Character.AI developed an advanced caching system for KV values. This system stores KV values on host memory between chat turns, enabling reuse for future queries. Cached KV tensors are organized in a Least Recently Used (LRU) cache with a tree structure, indexed by a rolling hash of prefix tokens. For each new query, the system retrieves the cache for the longest matching prefix, achieving a 95% cache hit rate. This drastically reduces the cost and time associated with refilling KV caches for each turn.

A diagram illustrating the LRU cache on host memory and KV cache for new queries. Blue boxes represent cached tensors on host memory, while green and yellow boxes indicate KV cache on CUDA memory. The diagram explains how, upon receiving a new query, the system retrieves the KV cache for the longest matched prefix, using a rolling hash system to allow retrieving cache for partially matched messages

Image Source: Character.AI

Quantization for Training and Serving

Character.AI employs int8 quantization for model weights, activations, and KV cache. This process involves converting floating-point calculations to 8-bit integer operations, significantly improving efficiency and reducing memory usage. Unlike post-training quantization, Character.AI trains its models in int8 precision from the start, ensuring consistency and high performance during both training and serving.

Building the Future Together

Optimizing AI inference is crucial for scaling AI systems and integrating them seamlessly into daily life. The innovations at Character.AI have reduced serving costs by a factor of 33 since late 2022. Using leading commercial APIs to handle their current traffic would be at least 13.5 times more expensive than their in-house system.

Character.AI’s journey is just beginning. The company continues to push the boundaries of AI, inviting others to join in creating a future where scalable AI systems are integral to every interaction, driving innovation and enhancing experiences worldwide.