Here’s an example of how to implement prompt caching with the Messages API using a cache_control block:

curl https://aihubmix.com/v1/messages \
  -H "content-type: application/json" \
  -H "x-api-key: AIHUBMIX_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "stream": true,
    "model": "claude-opus-4-20250514",
    "max_tokens": 20000,
    "system": [
      {
        "type": "text",
        "text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style."
      },
      {
        "type": "text",
        "text": "Pride and Prejudice by Jane Austen... [Place complete text content here]",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    "thinking": {
      "type": "enabled",
      "budget_tokens": 16000
    },
    "messages": [
      {
        "role": "user",
        "content": "Analyze the major themes in Pride and Prejudice."
      }
    ]
  }'

Response:

{"cache_creation_input_tokens":188086,"cache_read_input_tokens":0,"input_tokens":21,"output_tokens":393}
{"cache_creation_input_tokens":0,"cache_read_input_tokens":188086,"input_tokens":21,"output_tokens":393}

In this example, the entire text of “Pride and Prejudice” is cached using the cache_control parameter. This enables reuse of this large text across multiple API calls without reprocessing it each time. Changing only the user message allows you to ask various questions about the book while utilizing the cached content, leading to faster responses and improved efficiency.

How prompt caching works

When you send a request with prompt caching enabled:

  1. The system checks if a prompt prefix, up to a specified cache breakpoint, is already cached from a recent query.
  2. If found, it uses the cached version, reducing processing time and costs.
  3. Otherwise, it processes the full prompt and caches the prefix once the response begins. This is especially useful for:
  • Prompts with many examples
  • Large amounts of context or background information
  • Repetitive tasks with consistent instructions
  • Long multi-turn conversations

By default, the cache has a 5-minute lifetime. The cache is refreshed for no additional cost each time the cached content is used. We also support 1-hour cache version (Beta) for scenarios requiring longer cache duration.

Prompt caching caches the full prefix

Prompt caching references the entire prompt - tools, system, and messages (in that order) up to and including the block designated with cache_control.

Pricing

Prompt caching introduces a new pricing structure. The table below shows the price per million tokens for each supported model:

ModelBase Input Tokens5m Cache Writes1h Cache WritesCache Hits & RefreshesOutput Tokens
Claude Opus 4Platform pricing1.25x base price2x base price0.1x base pricePlatform pricing
Claude Sonnet 4Platform pricing1.25x base price2x base price0.1x base pricePlatform pricing
Claude Sonnet 3.7Platform pricing1.25x base price2x base price0.1x base pricePlatform pricing
Claude Sonnet 3.5Platform pricing1.25x base price2x base price0.1x base pricePlatform pricing
Claude Haiku 3.5Platform pricing1.25x base price2x base price0.1x base pricePlatform pricing
Claude Opus 3Platform pricing1.25x base price2x base price0.1x base pricePlatform pricing
Claude Haiku 3Platform pricing1.25x base price2x base price0.1x base pricePlatform pricing

Note:

  • 5-minute cache write tokens are 1.25 times the base input tokens price
  • 1-hour cache write tokens are 2 times the base input tokens price
  • Cache read tokens are 0.1 times the base input tokens price
  • Regular input and output tokens are priced at platform standard rates

How to implement prompt caching

Supported models

Prompt caching is currently supported on:

  • Claude Opus 4
  • Claude Sonnet 4
  • Claude Sonnet 3.7
  • Claude Sonnet 3.5
  • Claude Haiku 3.5
  • Claude Haiku 3
  • Claude Opus 3

Structuring your prompt

Place static content (tool definitions, system instructions, context, examples) at the beginning of your prompt. Mark the end of the reusable content for caching using the cache_control parameter.

Cache prefixes are created in the following order: tools, system, then messages.

Using the cache_control parameter, you can define up to 4 cache breakpoints, allowing you to cache different reusable sections separately. For each breakpoint, the system will automatically check for cache hits at previous positions and use the longest matching prefix if one is found.

Cache limitations

The minimum cacheable prompt length is:

  • 1024 tokens for Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5 and Claude Opus 3
  • 2048 tokens for Claude Haiku 3.5 and Claude Haiku 3

Shorter prompts cannot be cached, even if marked with cache_control. Any requests to cache fewer than this number of tokens will be processed without caching. To see if a prompt was cached, see the response usage fields.

For concurrent requests, note that a cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent requests.

Currently, two cache types are supported:

  • “ephemeral”: Default 5-minute lifetime
  • 1-hour cache (Beta): For scenarios requiring longer cache duration

1-hour cache duration (Beta)

For scenarios requiring longer cache duration, we provide a 1-hour cache option.

To use the extended cache, add extended-cache-ttl-2025-04-11 as a beta header to your request, and then include ttl in the cache_control definition:

curl https://aihubmix.com/v1/messages \
  -H "content-type: application/json" \
  -H "x-api-key: AIHUBMIX_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "anthropic-beta: extended-cache-ttl-2025-04-11" \
  -d '{
    "model": "claude-opus-4-20250514",
    "system": [
      {
        "type": "text",
        "text": "Long-term instructions...",
        "cache_control": {
          "type": "ephemeral",
          "ttl": "1h"
        }
      }
    ],
    "messages": [...]
  }'
{
  "cache_control": {
    "type": "ephemeral",
    "ttl": "5m" | "1h"
  }
}

When to use 1-hour cache

1-hour cache is particularly suitable for:

  • Batch processing: Processing large volumes of requests with common prefixes
  • Long-running sessions: Conversations requiring context maintenance over extended periods
  • Large document analysis: Multiple different types of analysis on the same document
  • Codebase Q&A: Multiple queries on the same codebase over extended periods

Mixing different TTLs

You can mix different cache durations within the same request:

{
  "system": [
    {
      "type": "text", 
      "text": "Long-term instructions...",
      "cache_control": {
        "type": "ephemeral",
        "ttl": "1h"
      }
    },
    {
      "type": "text",
      "text": "Short-term context...", 
      "cache_control": {
        "type": "ephemeral",
        "ttl": "5m"
      }
    }
  ]
}

What can be cached

Every block in the request can be designated for caching with cache_control. This includes:

  • Tools: Tool definitions in the tools array
  • System messages: Content blocks in the system array
  • Messages: Content blocks in the messages.content array, for both user and assistant turns
  • Images & Documents: Content blocks in the messages.content array, in user turns
  • Tool use and tool results: Content blocks in the messages.content array, in both user and assistant turns

Each of these elements can be marked with cache_control to enable caching for that portion of the request.

What cannot be cached

While most request blocks can be cached, there are some exceptions:

  • Thinking blocks cannot be cached directly with cache_control. However, thinking blocks CAN be cached alongside other content when they appear in previous assistant turns. When cached this way, they DO count as input tokens when read from cache.
  • Sub-content blocks (like citations) themselves cannot be cached directly. Instead, cache the top-level block.
  • Empty text blocks cannot be cached.

Tracking cache performance

Monitor cache performance using these API response fields, within usage in the response (or message_start event if streaming):

  • cache_creation_input_tokens: Number of tokens written to the cache when creating a new entry.
  • cache_read_input_tokens: Number of tokens retrieved from the cache for this request.
  • input_tokens: Number of input tokens which were not read from or used to create a cache.

Best practices for effective caching

To optimize prompt caching performance:

  • Cache stable, reusable content like system instructions, background information, large contexts, or frequent tool definitions.
  • Place cached content at the prompt’s beginning for best performance.
  • Use cache breakpoints strategically to separate different cacheable prefix sections.
  • Regularly analyze cache hit rates and adjust your strategy as needed.
  • For long-term content, consider using 1-hour cache for better cost efficiency.

Optimizing for different use cases

Tailor your prompt caching strategy to your scenario:

  • Conversational agents: Reduce cost and latency for extended conversations, especially those with long instructions or uploaded documents.
  • Coding assistants: Improve autocomplete and codebase Q&A by keeping relevant sections or a summarized version of the codebase in the prompt.
  • Large document processing: Incorporate complete long-form material including images in your prompt without increasing response latency.
  • Detailed instruction sets: Share extensive lists of instructions, procedures, and examples to fine-tune Claude’s responses. Developers often include an example or two in the prompt, but with prompt caching you can get even better performance by including 20+ diverse examples of high quality answers.
  • Agentic tool use: Enhance performance for scenarios involving multiple tool calls and iterative code changes, where each step typically requires a new API call.
  • Talk to books, papers, documentation, podcast transcripts, and other longform content: Bring any knowledge base alive by embedding the entire document(s) into the prompt, and letting users ask it questions.

Troubleshooting common issues

If experiencing unexpected behavior:

  • Ensure cached sections are identical and marked with cache_control in the same locations across calls
  • Check that calls are made within the cache lifetime (5 minutes or 1 hour)
  • Verify that tool_choice and image usage remain consistent between calls
  • Validate that you are caching at least the minimum number of tokens
  • While the system will attempt to use previously cached content at positions prior to a cache breakpoint, you may use an additional cache_control parameter to guarantee cache lookup on previous portions of the prompt, which may be useful for queries with very long lists of content blocks

Note that changes to tool_choice or the presence/absence of images anywhere in the prompt will invalidate the cache, requiring a new cache entry to be created.

Cache storage and sharing

  • Organization Isolation: Caches are isolated between organizations. Different organizations never share caches, even if they use identical prompts.
  • Exact Matching: Cache hits require 100% identical prompt segments, including all text and images up to and including the block marked with cache control. The same block must be marked with cache_control during cache reads and creation.
  • Output Token Generation: Prompt caching has no effect on output token generation. The response you receive will be identical to what you would get if prompt caching was not used.

Support Across Different Models

  • Whether Prompt Caching is supported depends on the model itself.
  • If the model inherently supports caching without requiring explicit parameter declarations, it can be supported through OpenAI-compatible forwarding.
  • OpenAI supports Prompt Caching by default. Cached prompts are not billed, cached token retrieval costs half the normal rate, and caches are automatically cleared after 5-10 minutes of inactivity. Details
  • Claude requires the native cache_control: { type: "ephemeral" } declaration. Caching rate is 1.25 times the standard input cost (5-minute) or 2 times (1-hour), cached token retrieval costs 0.1 times the normal rate, with a 5-minute or 1-hour lifecycle. Details
  • Deepseek V3 and R1 natively support caching. Caching rate equals the standard input cost, cached token retrieval costs 0.1 times the normal rate. Details
  • Gemini implicit caching support:
    • Implicit Caching: Enabled by default for all Gemini 2.5 models. If your request hits the cache, cost savings are automatically applied. This feature is effective as of May 8, 2025. The minimum input token count for context caching is 1,024 for Gemini 2.5 Flash and 2,048 for Gemini 2.5 Pro.
    • Tips to improve implicit cache hit rate:
      • Try placing large, frequently reused content at the beginning of the prompt.
      • Try sending requests with similar prefixes within a short time window.
    • You can view the number of cache-hit tokens in the usage_metadata field of the response object.
    • Cost savings are calculated based on prefilled cache hits. Only prefill cache and YouTube video preprocessing cache are eligible for implicit caching.