AiHubMix Documentation Hub

AIHubMix OpenAI 相容介面深度支援 Claude 思考、快取與 Beta 功能

我們升級了 OpenAI 相容介面,針對 Claude 系列模型進行了更深度的最佳化。您現在能更精準、更便利地控制 thinking 與快取 — 特別是多輪對話中的 interleaved thinking,我們讓它更使用者友善,無需額外參數即可無縫整合。同時也支援啟用 Anthropic 所提供的 beta 功能。

1. 模型 Thinking(Extended Thinking)

1.1 Interleaved Thinking 的優點

在未啟用 interleaved thinking 時,模型只會在 assistant 回合開始時思考一次;後續回應會在收到工具結果後直接生成,不會產生新的 thinking block:

User → [Thinking] → Tool Call → Tool Result → Response

啟用 interleaved thinking 後,模型在每次收到工具結果時都會插入新的 thinking block,形成一條推理鏈:

User → [Thinking] → Tool Call → Tool Result → [Thinking] → Response
                                                ↑ Interleaved Thinking

這讓模型能夠:

根據工具結果進行二次推理,而非單純地將輸出串接起來。
在多次工具呼叫之間鏈式推理,每個決策都基於前一步分析。

參考:Anthropic Interleaved Thinking

1.2 啟用 Thinking

您可以透過四種方式啟用 thinking,擇一即可:

方法	範例	說明
`reasoning_effort`	`"reasoning_effort": "low"`	OpenAI 標準參數,置於請求主體最上層
`reasoning.effort`	`"reasoning": {"effort": "low"}`	等同於上述方法,置於 reasoning 物件內
`reasoning.max_tokens`	`"reasoning": {"max_tokens": 1024}`	精確控制 thinking 的最大 token 數
含 `-think` 後綴的模型名稱	`"model": "claude-sonnet-4-5-think"`	最簡單的方式,無需額外參數

優先順序(同時使用多種方法時):reasoning_effort > reasoning.max_tokens > reasoning.effort > -think 後綴

effort 可能的值: minimal / low / medium / high / xhigh

1.3 Thinking 傳回值

回應訊息會新增兩個欄位:

reasoning_content:thinking 內容(字串),方便顯示。
reasoning_details:thinking 的完整結構化資訊,在多輪對話中需要原樣回傳;不同供應商之間的內部結構可能不同。

非串流範例(省略無關欄位):

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?",
      "reasoning_content": "The user is just saying hello...",
      "reasoning_details": {
        "type": "thinking",
        "thinking": "The user is just saying hello...",
        "signature": "Er8CCkYI..."
      }
    }
  }]
}

在串流回應中,thinking 內容會透過 delta.reasoning_content 與 delta.reasoning_details 分塊傳送。完整的串流串接邏輯請參考下方完整範例。

1.4 在多輪對話中保留 Thinking (Interleaved Thinking 內建,無需額外參數)

為了讓模型在多輪對話中持續其推理能力,只需將前一次傳回的 reasoning_details 原樣放入下一輪的 assistant 訊息中:

messages = [
    {"role": "user", "content": "What's the weather like in Boston?"},
    {
        "role": "assistant",
        "content": response.choices[0].message.content,
        "tool_calls": response.choices[0].message.tool_calls,
        "reasoning_details": response.choices[0].message.reasoning_details,
    },
    {
        "role": "tool",
        "tool_call_id": "toolu_xxx",
        "content": '{"temperature": 45, "condition": "rainy"}',
    }
]

AihubMix 在偵測到請求中存在歷史 thinking 資訊時,會自動啟用 interleaved thinking,讓模型在收到工具呼叫結果後能持續進行深度推理,無需額外傳入參數。

1.5 完整範例

下列兩個範例示範了完整的多輪 Tool Call + interleaved thinking 流程:使用者提問 → 模型思考並呼叫工具 → 注入工具結果(保留 reasoning_details)→ 模型 interleaved thinking 給出最終回應。非串流 · Interleaved Thinking

import os
import json
from openai import OpenAI

client = OpenAI(
    base_url="https://aihubmix.com/v1",
    api_key=os.environ.get("AIHUBMIX_API_KEY", "sk-***"),
)

# ── Tool definition ───────────────────────────────────────────
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "City name"}},
            "required": ["location"]
        }
    }
}]

# ── Mock tool execution ───────────────────────────────────────
WEATHER_DB = {
    "boston": {"temperature": "45°F (7°C)", "condition": "rainy", "humidity": "85%", "wind": "15 mph NE"},
    "tokyo":  {"temperature": "72°F (22°C)", "condition": "sunny", "humidity": "45%", "wind": "5 mph S"},
}

def execute_tool(name: str, args: dict) -> str:
    if name == "get_weather":
        key = next((k for k in WEATHER_DB if k in args.get("location", "").lower()), None)
        return json.dumps(WEATHER_DB.get(key, {"temperature": "65°F", "condition": "clear"}))
    return "{}"

# ── Multi-turn conversation loop ─────────────────────────────
messages = [
    {"role": "user", "content": "What's the weather like in Boston? Then recommend what to wear."}
]

turn = 0
while True:
    turn += 1
    print(f"\n── Turn {turn} ──")

    response = client.chat.completions.create(
        model="claude-sonnet-4-5",
        messages=messages,
        tools=tools,
        extra_body={"reasoning": {"max_tokens": 2000}},
    )
    msg = response.choices[0].message

    # Print thinking process
    if msg.reasoning_content:
        label = "Interleaved Thinking" if turn > 1 else "Thinking"
        print(f"[{label}] {msg.reasoning_content}")

    # Print response content
    if msg.content:
        print(f"[Response] {msg.content}")

    # Print tool calls
    if msg.tool_calls:
        for tc in msg.tool_calls:
            print(f"[Tool Call: {tc.function.name}] {tc.function.arguments}")

    # Build assistant message, preserve reasoning_details (critical!)
    assistant_msg = {"role": "assistant", "content": msg.content}
    if msg.tool_calls:
        assistant_msg["tool_calls"] = msg.tool_calls
    if msg.reasoning_details:
        assistant_msg["reasoning_details"] = msg.reasoning_details  # pass back unmodified
    messages.append(assistant_msg)

    # No tool_calls means conversation is done
    if not msg.tool_calls:
        break

    # Execute tools and append results to messages
    for tc in msg.tool_calls:
        args = json.loads(tc.function.arguments)
        result = execute_tool(tc.function.name, args)
        print(f"[Tool Result: {tc.function.name}] {result}")
        messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})

串流 · Interleaved Thinking

import os
import sys
import json
from openai import OpenAI

client = OpenAI(
    base_url="https://aihubmix.com/v1",
    api_key=os.environ.get("AIHUBMIX_API_KEY", "sk-***"),
)

# ── Tool definition & mock execution ─────────────────────────
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "City name"}},
            "required": ["location"]
        }
    }
}]

WEATHER_DB = {
    "boston": {"temperature": "45°F (7°C)", "condition": "rainy", "humidity": "85%", "wind": "15 mph NE"},
    "tokyo":  {"temperature": "72°F (22°C)", "condition": "sunny", "humidity": "45%", "wind": "5 mph S"},
}

def execute_tool(name: str, args: dict) -> str:
    if name == "get_weather":
        key = next((k for k in WEATHER_DB if k in args.get("location", "").lower()), None)
        return json.dumps(WEATHER_DB.get(key, {"temperature": "65°F", "condition": "clear"}))
    return "{}"

# ── Stream response collector ────────────────────────────────
def stream_and_collect(turn: int, **kwargs):
    """Stream response, print thinking/content in real-time, accumulate reasoning_details/tool_calls."""
    rd = {}            # accumulated reasoning_details
    content = ""       # accumulated response text
    tc_map = {}        # accumulated tool_calls (by index)
    cur = "none"       # current output section: none / thinking / content

    stream = client.chat.completions.create(stream=True, **kwargs)
    for chunk in stream:
        if not chunk.choices:
            continue
        delta = chunk.choices[0].delta

        # ── Handle thinking ──
        rd_delta = getattr(delta, "reasoning_details", None)
        if rd_delta and isinstance(rd_delta, dict):
            for k, v in rd_delta.items():
                if k == "type":
                    rd[k] = v
                elif isinstance(v, str):
                    rd[k] = rd.get(k, "") + v
                elif v is not None:
                    rd[k] = v
            # Print thinking chunks in real-time
            thinking_chunk = rd_delta.get("thinking", "")
            if thinking_chunk:
                if cur != "thinking":
                    cur = "thinking"
                    label = "Interleaved Thinking" if turn > 1 else "Thinking"
                    sys.stdout.write(f"\n[{label}] ")
                sys.stdout.write(thinking_chunk)
                sys.stdout.flush()

        # ── Handle content ──
        if delta.content:
            if cur != "content":
                if cur == "thinking":
                    sys.stdout.write("\n")
                cur = "content"
                sys.stdout.write("\n[Response] ")
            sys.stdout.write(delta.content)
            sys.stdout.flush()
            content += delta.content

        # ── Handle tool_calls ──
        for tc in delta.tool_calls or []:
            i = tc.index
            if i not in tc_map:
                tc_map[i] = {"id": "", "type": "function",
                             "function": {"name": "", "arguments": ""}}
            if tc.id:
                tc_map[i]["id"] = tc.id
            if tc.function:
                tc_map[i]["function"]["name"] += tc.function.name or ""
                tc_map[i]["function"]["arguments"] += tc.function.arguments or ""

    # End current output section
    if cur in ("thinking", "content"):
        sys.stdout.write("\n")

    tool_calls = [tc_map[i] for i in sorted(tc_map)] if tc_map else None
    return {
        "content": content or None,
        "reasoning_details": rd or None,
        "tool_calls": tool_calls,
    }

# ── Multi-turn conversation loop ─────────────────────────────
messages = [
    {"role": "user", "content": "What's the weather like in Boston? Then recommend what to wear."}
]

turn = 0
while True:
    turn += 1
    print(f"\n── Turn {turn} ──")

    result = stream_and_collect(
        turn,
        model="claude-sonnet-4-5",
        messages=messages,
        tools=tools,
        extra_body={"reasoning": {"max_tokens": 2000}},
    )

    # Print tool calls
    if result["tool_calls"]:
        for tc in result["tool_calls"]:
            print(f"[Tool Call: {tc['function']['name']}] {tc['function']['arguments']}")

    # Build assistant message, preserve reasoning_details (critical!)
    assistant_msg = {"role": "assistant", "content": result["content"]}
    if result["tool_calls"]:
        assistant_msg["tool_calls"] = result["tool_calls"]
    if result["reasoning_details"]:
        assistant_msg["reasoning_details"] = result["reasoning_details"]  # pass back unmodified
    messages.append(assistant_msg)

    # No tool_calls means conversation is done
    if not result["tool_calls"]:
        break

    # Execute tools and append results to messages
    for tc in result["tool_calls"]:
        args = json.loads(tc["function"]["arguments"])
        tool_result = execute_tool(tc["function"]["name"], args)
        print(f"[Tool Result: {tc['function']['name']}] {tool_result}")
        messages.append({"role": "tool", "tool_call_id": tc["id"], "content": tool_result})

1.6 Thinking 強度對應規則

Effort 模式:

Opus 4.6 / Sonnet 4.6 及以上版本:對應 Anthropic 原生的 Adaptive Thinking effort 層級。
其他模型:依下列公式計算 budget_tokens:

budget_tokens = max(min(max_tokens × effort_ratio, 128000), 1024)

effort	effort_ratio
xhigh	0.95
high	0.80
medium	0.50
low	0.20
minimal	0.10

Adaptive Thinking Effort 對應:

傳入 Effort	Opus 4.6	Sonnet 4.6
xhigh	max	high
high	high	high
medium	medium	medium
low	low	low
minimal	low	low

max_tokens 模式: 直接指派為 Anthropic 的 budget_tokens。 -think 後綴: Opus/Sonnet 4.6+ 使用 adaptive thinking(effort=medium);其他模型設定 budget_tokens = min(10240, max_tokens - 1),預設 max_tokens 為 4096。

2. Prompt Caching

透過 Chat 介面向 Claude 模型發出請求時,您可以使用 Prompt Caching。在訊息中設定 cache_control 中斷點,大段文字(如角色卡、RAG 資料、書籍章節等)可被快取以供重用,後續請求能直接命中快取並大幅降低成本。

Claude 官方文件:Prompt Caching

2.1 快取費用

操作	價格倍率(相對於原輸入價格)
快取寫入(5 分鐘 TTL)	1.25x
快取寫入(1 小時 TTL)	2x
快取讀取	0.1x

2.2 支援的模型與最小快取長度

模型	最小快取 Token 數
Claude Opus 4.8	1024
Claude Opus 4.7	2048
Claude Opus 4.6 / Opus 4.5	4096
Claude Sonnet 4.6 / Sonnet 4.5 / Opus 4.1 / Opus 4 / Sonnet 4 / Sonnet 3.7(已停用)	1024
Claude Haiku 4.5	4096
Claude Haiku 3.5(已停用)/ Haiku 3	2048

中斷點數量限制: 每次請求最多 4 個 cache_control 中斷點。

2.3 快取 TTL

TTL	語法	適用情境
5 分鐘(預設)	`"cache_control": {"type": "ephemeral"}`	短期對話、例行請求
1 小時	`"cache_control": {"type": "ephemeral", "ttl": "1h"}`	長期對話,避免重複的快取寫入

1 小時 TTL 的寫入費用較高,但可以透過減少長時間對話中的重複寫入來節省總費用。Claude 4.5 之後的所有模型,在所有供應商(包含 Anthropic、Amazon Bedrock、Google Vertex AI)上都支援 1 小時 TTL。

2.4 用法

您可以在 system、user(含影像)與 tools 中使用 cache_control 欄位設定快取中斷點。下列範例僅顯示關鍵結構,省略大段文字。 System 訊息快取(預設 5 分鐘 TTL):

{
  "model": "claude-opus-4-5",
  "messages": [
    {
      "role": "system",
      "content": [
        {"type": "text", "text": "You are an AI assistant"},
        {
          "type": "text",
          "text": "(long context)",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    },
    {
      "role": "user",
      "content": [{"type": "text", "text": "Hello"}]
    }
  ]
}

User 訊息快取(1 小時 TTL):

{
  "model": "claude-opus-4-5",
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are an AI assistant"}]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "(long context)",
          "cache_control": {"type": "ephemeral", "ttl": "1h"}
        },
        {"type": "text", "text": "Hello"}
      ]
    }
  ]
}

影像訊息快取:

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {"detail": "auto", "url": "data:image/jpeg;base64,/9j/4AAQ..."},
      "cache_control": {"type": "ephemeral"}
    },
    {"type": "text", "text": "What's this?"}
  ]
}

工具定義快取: cache_control 置於 tool 物件的頂層(與 type 和 function 同層):

{
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather for a location",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      }
    },
    "cache_control": {"type": "ephemeral", "ttl": "1h"}
  }]
}

2.5 檢視快取狀態

回應的 usage 會傳回 claude_cache_tokens_details,記錄詳細的快取資訊: 第一次請求(建立快取):

{
  "usage": {
    "prompt_tokens": 22,
    "completion_tokens": 890,
    "total_tokens": 912,
    "claude_cache_tokens_details": {
      "cache_creation_input_tokens": 6266,
      "cache_read_input_tokens": 0,
      "cache_write_5_minutes_input_tokens": 6266,
      "cache_write_1_hour_input_tokens": 0
    }
  }
}

後續請求(快取命中):

{
  "usage": {
    "prompt_tokens": 22,
    "completion_tokens": 810,
    "total_tokens": 832,
    "prompt_tokens_details": {
      "cached_tokens": 6266
    },
    "claude_cache_tokens_details": {
      "cache_creation_input_tokens": 0,
      "cache_read_input_tokens": 6266,
      "cache_write_5_minutes_input_tokens": 0,
      "cache_write_1_hour_input_tokens": 0
    }
  }
}

欄位	意義
`cache_creation_input_tokens`	此次請求中寫入快取的 token 數量
`cache_read_input_tokens`	此次請求中從快取讀取的 token 數量
`cache_write_5_minutes_input_tokens`	寫入 5 分鐘 TTL 快取的 token 數量
`cache_write_1_hour_input_tokens`	寫入 1 小時 TTL 快取的 token 數量
`prompt_tokens_details.cached_tokens`	快取命中時的快取 token 數,相容於 OpenAI 格式

3. anthropic-beta 請求標頭

您可以透過 HTTP 標頭 anthropic-beta 啟用 Claude 模型的 beta 功能,AihubMix 會將其傳遞給 Anthropic API。

用法

在請求標頭中加入 anthropic-beta,值為對應的 beta 功能識別碼:

curl "https://aihubmix.com/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-***" \
  -H "anthropic-beta: context-1m-2025-08-07" \
  -d '{
  "model": "claude-opus-4-5",
  "messages": [
    {
      "role": "system",
      "content": [
        {"type": "text", "text": "You are an AI assistant"},
        {
          "type": "text",
          "text": "(long context)",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    },
    {"role": "user", "content": [{"type": "text", "text": "hello"}]}
  ]
}'

詳細可用的 beta 識別碼請參考 Anthropic API 文件。

最後更新：2026-06-01

​1. 模型 Thinking(Extended Thinking)

​1.1 Interleaved Thinking 的優點

​1.2 啟用 Thinking

​1.3 Thinking 傳回值

​1.4 在多輪對話中保留 Thinking (Interleaved Thinking 內建,無需額外參數)

​1.5 完整範例

​1.6 Thinking 強度對應規則

​2. Prompt Caching

​2.1 快取費用

​2.2 支援的模型與最小快取長度

​2.3 快取 TTL

​2.4 用法

​2.5 檢視快取狀態

​3. anthropic-beta 請求標頭

​用法