AiHubMix Documentation Hub

Prompt Caching 是降低模型推論成本的重要機制。透過快取先前處理過的提示內容,讓後續請求得以重用,進而減少重複運算、降低成本並提升回應效率。

原理

當您傳送啟用了 prompt caching 的請求時,系統會檢查提示前綴是否已從近期查詢中快取。若已存在,則使用快取,以縮短處理時間並降低成本;否則,系統會處理完整提示,並在回應開始後將前綴快取。這在下列情境中特別實用:

包含眾多範例的提示
大量的上下文或背景資訊
具有一致指令的重複性任務
長時間的多輪對話

核心機制

不同模型供應商對快取的支援程度不同:

自動快取

自動快取無需額外設定;系統會自動識別並快取可重用的內容,適用於 OpenAI、DeepSeek 等模型。

OpenAI

最低提示長度：1024 tokens，前綴逐字一致時自動命中
GPT-5.6 之前的模型：快取寫入不另計費，快取讀取按對應模型的快取讀取價計費
GPT-5.6 及之後（官方口徑 “GPT-5.6 models and later model families”，當前為 gpt-5.6-sol / terra / luna）：快取寫入按 1.25x 輸入價計費，讀取按 0.1x 計費；新增 prompt_cache_key 與顯式快取斷點參數
用法、計費與命中排查見 GPT 提示詞快取

Gemini

預設啟用隱式上下文快取,無需手動設定即可自動生效。
僅在內容、模型與參數完全相同時快取才會生效;任何差異都會被視為新請求且不會命中快取。
快取效期由開發者設定,也可不設定。若未指定,則預設為 1 小時。沒有最短或最長時間限制,費用取決於快取 token 數與快取時長。

DeepSeek / Grok / Moonshot / Groq

費用:寫入快取免費或同等價格,從快取讀取低於原價

Claude 模型顯式快取

需透過 cache_control 啟用：請求頂層欄位自動設定斷點（隨對話前移），或內容區塊級斷點精細控制快取位置
全部活躍 Claude 模型支援，快取寫入 5 分鐘檔 1.25x、1 小時檔 2x、讀取 0.1x，計價比例全系統一
適用於 Anthropic Claude 模型

Claude 按模型設定最小可快取 Token 門檻（512 / 1,024 / 2,048 / 4,096 不等，該門檻並非隨版本升級而提高）：例如 Claude Opus 4.8 為 1,024、Claude Opus 4.7 為 2,048、Claude Opus 4.6 / 4.5 與 Claude Haiku 4.5 為 4,096、Claude Fable 5 為 512。低於門檻的前綴即使顯式設定 cache_control 也不會被快取，且不會返回錯誤——回應中 cache_creation_input_tokens 與 cache_read_input_tokens 同時為 0 即為此情況。完整分檔與排查見 Claude 提示詞快取。

OpenAI 相容介面

您可以在 system、user(含影像)與 tools 中使用 cache_control 欄位設定快取中斷點。下列範例僅顯示關鍵結構: System 訊息快取(預設 5 分鐘 TTL):

{
  "model": "claude-opus-4-5",
  "messages": [
    {
      "role": "system",
      "content": [
        {"type": "text", "text": "You are an AI assistant"},
        {
          "type": "text",
          "text": "(long context)",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    },
    {
      "role": "user",
      "content": [{"type": "text", "text": "Hello"}]
    }
  ]
}

User 訊息快取(1 小時 TTL):

{
  "model": "claude-opus-4-5",
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are an AI assistant"}]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "(long context)",
          "cache_control": {"type": "ephemeral", "ttl": "1h"}
        },
        {"type": "text", "text": "Hello"}
      ]
    }
  ]
}

影像訊息快取:

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {"detail": "auto", "url": "data:image/jpeg;base64,/9j/4AAQ..."},
      "cache_control": {"type": "ephemeral"}
    },
    {"type": "text", "text": "What's this?"}
  ]
}

工具定義快取: 將 cache_control 放在 tool 物件的最上層(與 type 和 function 同層):

{
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather for a location",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      }
    },
    "cache_control": {"type": "ephemeral", "ttl": "1h"}
  }]
}

Anthropic 相容介面

curl https://aihubmix.com/v1/messages \
  -H "content-type: application/json" \
  -H "x-api-key: $AIHUBMIX_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -d '{
    "model": "claude-opus-4-6",
    "max_tokens": 1024,
    "system": [
      {
        "type": "text",
        "text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n"
      },
      {
        "type": "text",
        "text": "<the entire contents of Pride and Prejudice>",
        "cache_control": {"type": "ephemeral"}
      }
    ],
    "messages": [
      {
        "role": "user",
        "content": "Analyze the major themes in Pride and Prejudice."
      }
    ]
  }'

# Call the model again with the same input until the caching checkpoint
curl https://aihubmix.com/v1/messages # rest of input

import anthropic

client = Anthropic(
  api_key="<AIHUBMIX_API_KEY>",  
  base_url="https://aihubmix.com"
)

params = {
    "model": "claude-opus-4-6",
    "max_tokens": 1024,
    "system": [
        {
            "type": "text",
            "text": "You are an AI assistant tasked with analyzing literary works. Your goal is to provide insightful commentary on themes, characters, and writing style.\n",
        },
        {
            "type": "text",
            "text": "<the entire contents of 'Pride and Prejudice'>",
            "cache_control": {"type": "ephemeral"},
        },
    ],
    "messages": [
        {
            "role": "user",
            "content": "Analyze the major themes in 'Pride and Prejudice'.",
        }
    ],
}
response = client.messages.create(**params)
print(response.usage.model_dump_json())

# Call the model again with the same input until the caching checkpoint
response = client.messages.create(**params)
print(response.usage.model_dump_json())

快取時長

預設:5 分鐘
選用:1 小時(“ttl”: “1h”)

更多資訊請參考:Claude Prompt Caching

使用建議

保持穩定的前綴

將固定內容置於提示開頭,建議結構:

[系統設定 / 長文本 / RAG 資料] 
[使用者問題(可變部分)]

快取大段文字

優先快取下列內容:

RAG 資料
長文本
CSV / JSON 資料
角色設定

控制 TTL

短期對話 → 5 分鐘
長期對話 → 1 小時(更具成本效益)

減少快取寫入

避免頻繁變動的內容進入快取。不要快取時間戳記、使用者輸入變數、高頻變動資料等。

最後更新：2026-07-10

開始使用

閘道能力

模型能力

協定參考

平台管理

隱私和條款

提示詞快取

原理

核心機制

自動快取

OpenAI

Gemini

DeepSeek / Grok / Moonshot / Groq

Claude 模型顯式快取

OpenAI 相容介面

Anthropic 相容介面

快取時長

使用建議

避免頻繁變動的內容進入快取。不要快取時間戳記、使用者輸入變數、高頻變動資料等。

​原理

​核心機制

​自動快取

​OpenAI

​Gemini

​DeepSeek / Grok / Moonshot / Groq

​Claude 模型顯式快取

​OpenAI 相容介面

​Anthropic 相容介面

​快取時長

​使用建議

​避免頻繁變動的內容進入快取。不要快取時間戳記、使用者輸入變數、高頻變動資料等。

原理

核心機制

自動快取

OpenAI

Gemini

DeepSeek / Grok / Moonshot / Groq

Claude 模型顯式快取

OpenAI 相容介面

Anthropic 相容介面

快取時長

使用建議

避免頻繁變動的內容進入快取。不要快取時間戳記、使用者輸入變數、高頻變動資料等。