AiNews.com
Posts
Anthropic’s Prompt Caching for Claude Models Cuts Costs, Latency

Anthropic’s Prompt Caching for Claude Models Cuts Costs, Latency

Alicia Shapiro
August 16, 2024 • Estimated Reading Time: 5 minutes

A modern and sleek image representing prompt caching in AI models. The image features digital representations of AI models interacting with large data sets, with icons symbolizing reduced costs and latency. A progress bar with speed indicators highlights the efficiency gained through prompt caching. The background is dynamic, filled with data streams and circuitry, emphasizing advanced technology and the optimization of AI performance

Image Source: ChatGPT

Anthropic’s Prompt Caching for Claude Models Cuts Costs, Latency

Anthropic has rolled out a new feature called prompt caching, now in public beta for its Claude 3.5 Sonnet and Claude 3 Haiku models, with support for Claude 3 Opus to follow soon. This feature allows developers to store frequently used prompt contexts between API calls, enabling Claude to access necessary background information and example outputs more efficiently. According to Anthropic, prompt caching can significantly reduce costs—by up to 90%—and decrease latency by as much as 85% for lengthy prompts, making it a highly beneficial tool for a variety of applications.

Efficiency Through Prompt Caching

Prompt caching is especially useful in scenarios where large prompt contexts are repeatedly needed across multiple API calls. Key applications include:

Conversational AI: Enhances performance in extended dialogues, particularly those requiring lengthy instructions or document references.
Coding Assistance: Improves code suggestions and Q&A by maintaining a summarized view of the codebase within the prompt context.
Processing of Extensive Documents: Facilitates the integration of long-form materials, such as documents and images, without adding to response times.
Detailed Instruction Sets: Allows developers to include comprehensive lists of instructions, procedures, and examples, thereby fine-tuning Claude’s output.
Agentic Tasks: Boosts efficiency in tasks requiring multiple tool uses and iterative adjustments, which typically involve numerous API calls.
Interactive Knowledge Bases: Supports embedding entire documents into the prompt, enabling users to query vast information sources effectively.

Performance Gains and Cost Savings

Anthropic reports that early adopters of prompt caching have observed substantial improvements in both speed and cost across diverse use cases. For example:

Book Chatting (100,000 tokens cached): Latency was reduced from 11.5 seconds to 2.4 seconds, with a 90% decrease in costs.
Many-Shot Prompting (10,000 tokens): Latency dropped from 1.6 seconds to 1.1 seconds, cutting costs by 86%.
Multi-Turn Conversations (10-turn dialog with complex prompts): Latency decreased from around 10 seconds to 2.5 seconds, with a 53% reduction in costs.

A table showing different use cases for prompt caching, comparing latency without caching and with caching, and the corresponding cost reduction. The use cases include chatting with a book, many-shot prompting, and multi-turn conversation.

Image Source: Anthropic

Cached prompts are priced based on the number of input tokens stored and the frequency of their use. Writing to the cache incurs a 25% premium over the base input token price for the specific model, while reading from the cache is substantially cheaper, at just 10% of the base input token cost.

Claude Model Pricing with Prompt Caching

Claude 3.5 Sonnet: Described as Anthropic’s most sophisticated model, it features a 200K context window. Prompt cache writing is priced at $3.75 per million tokens (MTok), while reading costs $0.30 per MTok.
Claude 3 Opus: Tailored for complex tasks with a 200K context window, this model will soon support prompt caching. Pricing for cache writing is set at $18.75 per MTok, with reading at $1.50 per MTok.
Claude 3 Haiku: Known for its speed and cost-efficiency, this model also features a 200K context window. Cache writing is priced at $0.30 per MTok, and reading is just $0.03 per MTok.

A table displaying the pricing structure for Claude 3.5 Sonnet, Claude 3 Opus, and Claude 3 Haiku models, including the cost of input, prompt caching, and output. The table highlights the context window size and upcoming prompt caching features for each model.

Image Source: Anthropic

Notion Enhances AI Assistant with Prompt Caching

Notion, the widely-used productivity tool, has integrated prompt caching into its Claude-powered Notion AI assistant. By incorporating this feature, Notion has optimized its internal operations, resulting in faster, more responsive AI interactions for its users. Simon Last, Co-founder at Notion, expressed enthusiasm about the enhancement, stating, "We're excited to use prompt caching to make Notion AI faster and cheaper, all while maintaining state-of-the-art quality."

Getting Started with Prompt Caching

Developers eager to leverage prompt caching can begin using the public beta through the Anthropic API. Comprehensive documentation and pricing details are available on Anthropic’s website.