Prompt caching significantly reduces processing time for repetitive tasks or prompts containing consistent elements, effectively lowering token costs.
tools
, system
, and messages
(in that order) up to and including the block designated with cache_control
.Model | Base Input Tokens | 5m Cache Writes | 1h Cache Writes | Cache Hits & Refreshes | Output Tokens |
---|---|---|---|---|---|
Claude Opus 4 | Platform pricing | 1.25x base price | 2x base price | 0.1x base price | Platform pricing |
Claude Sonnet 4 | Platform pricing | 1.25x base price | 2x base price | 0.1x base price | Platform pricing |
Claude Sonnet 3.7 | Platform pricing | 1.25x base price | 2x base price | 0.1x base price | Platform pricing |
Claude Sonnet 3.5 | Platform pricing | 1.25x base price | 2x base price | 0.1x base price | Platform pricing |
Claude Haiku 3.5 | Platform pricing | 1.25x base price | 2x base price | 0.1x base price | Platform pricing |
Claude Opus 3 | Platform pricing | 1.25x base price | 2x base price | 0.1x base price | Platform pricing |
Claude Haiku 3 | Platform pricing | 1.25x base price | 2x base price | 0.1x base price | Platform pricing |
cache_control
parameter.
Cache prefixes are created in the following order: tools
, system
, then messages
.
Using the cache_control
parameter, you can define up to 4 cache breakpoints, allowing you to cache different reusable sections separately. For each breakpoint, the system will automatically check for cache hits at previous positions and use the longest matching prefix if one is found.
cache_control
. Any requests to cache fewer than this number of tokens will be processed without caching. To see if a prompt was cached, see the response usage fields.
For concurrent requests, note that a cache entry only becomes available after the first response begins. If you need cache hits for parallel requests, wait for the first response before sending subsequent requests.
Currently, two cache types are supported:
extended-cache-ttl-2025-04-11
as a beta header to your request, and then include ttl in the cache_control definition:
tools
arraysystem
arraymessages.content
array, for both user and assistant turnsmessages.content
array, in user turnsmessages.content
array, in both user and assistant turnscache_control
to enable caching for that portion of the request.
cache_control
. However, thinking blocks CAN be cached alongside other content when they appear in previous assistant turns. When cached this way, they DO count as input tokens when read from cache.usage
in the response (or message_start
event if streaming):
cache_creation_input_tokens
: Number of tokens written to the cache when creating a new entry.cache_read_input_tokens
: Number of tokens retrieved from the cache for this request.input_tokens
: Number of input tokens which were not read from or used to create a cache.tool_choice
and image usage remain consistent between callscache_control
parameter to guarantee cache lookup on previous portions of the prompt, which may be useful for queries with very long lists of content blockstool_choice
or the presence/absence of images anywhere in the prompt will invalidate the cache, requiring a new cache entry to be created.cache_control: { type: "ephemeral" }
declaration. Caching rate is 1.25 times the standard input cost (5-minute) or 2 times (1-hour), cached token retrieval costs 0.1 times the normal rate, with a 5-minute or 1-hour lifecycle. Detailsusage_metadata
field of the response object.