AiHubMix Documentation Hub

AIHubMix OpenAI 호환 인터페이스가 Claude Thinking, Caching 및 Beta 기능을 심층 지원

OpenAI 호환 인터페이스를 Claude 시리즈 모델에 특화된 더 깊은 최적화로 업그레이드했습니다. 이제 thinking과 caching을 더 정확하고 편리하게 제어할 수 있습니다. 특히 다중 턴 대화의 interleaved thinking은 추가 파라미터 없이 원활한 통합을 가능하게 하여, 더 사용자 친화적으로 만들었습니다. 또한 Anthropic이 제공하는 beta 기능 활성화도 지원합니다.

1. 모델 Thinking (Extended Thinking)

1.1 Interleaved Thinking의 장점

Interleaved thinking이 활성화되지 않은 경우, 모델은 assistant 턴 시작 시 한 번만 thinking을 수행합니다. 도구 결과를 받은 후 후속 응답은 새 thinking 블록을 생성하지 않고 직접 생성됩니다:

User → [Thinking] → Tool Call → Tool Result → Response

Interleaved thinking이 활성화되면, 모델은 도구 결과를 받을 때마다 새 thinking 블록을 삽입하여 추론 체인을 형성합니다:

User → [Thinking] → Tool Call → Tool Result → [Thinking] → Response
                                                ↑ Interleaved Thinking

이를 통해 모델은 다음을 수행할 수 있습니다:

도구 결과를 기반으로 2차 추론 수행, 단순히 출력을 연결하는 것이 아님.
여러 도구 호출 사이의 체인 추론, 각 결정은 이전 단계의 분석에 기반합니다.

참조: Anthropic Interleaved Thinking

1.2 Thinking 활성화

다음 네 가지 방법 중 하나를 선택하여 thinking을 활성화할 수 있습니다:

방법	예시	설명
`reasoning_effort`	`"reasoning_effort": "low"`	OpenAI 표준 파라미터, 요청 본문의 최상위 수준에 배치
`reasoning.effort`	`"reasoning": {"effort": "low"}`	이전 방법과 동등하며, reasoning 객체 내에 배치
`reasoning.max_tokens`	`"reasoning": {"max_tokens": 1024}`	thinking을 위한 최대 토큰 수를 정확히 제어
`-think` 접미사가 붙은 모델 이름	`"model": "claude-sonnet-4-5-think"`	가장 간단한 방법, 추가 파라미터 필요 없음

우선순위 (여러 방법이 사용되는 경우): reasoning_effort > reasoning.max_tokens > reasoning.effort > -think 접미사

effort의 가능한 값: minimal / low / medium / high / xhigh

1.3 Thinking 반환

응답 메시지에는 두 개의 새 필드가 포함됩니다:

reasoning_content: Thinking 콘텐츠 (문자열), 표시용.
reasoning_details: Thinking에 대한 완전한 구조화된 정보, 다중 턴 대화에서 그대로 반환되어야 하며, 내부 구조는 공급자마다 다를 수 있습니다.

비스트리밍 예시 (관련 없는 필드 생략):

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": "Hello! How can I help you today?",
      "reasoning_content": "The user is just saying hello...",
      "reasoning_details": {
        "type": "thinking",
        "thinking": "The user is just saying hello...",
        "signature": "Er8CCkYI..."
      }
    }
  }]
}

스트리밍 응답에서 thinking 콘텐츠는 delta.reasoning_content 및 delta.reasoning_details를 통해 청크 단위로 전송됩니다. 완전한 스트리밍 연결 로직은 아래의 전체 예시를 참조하세요.

1.4 다중 턴 대화에서 Thinking 유지 (Interleaved Thinking은 내장되어 있으며 추가 파라미터가 필요하지 않음)

다중 턴 대화에서 모델이 추론 기능을 계속 유지할 수 있도록 하려면, 이전에 반환된 reasoning_details를 다음 라운드의 assistant 메시지에 그대로 배치하기만 하면 됩니다:

messages = [
    {"role": "user", "content": "What's the weather like in Boston?"},
    {
        "role": "assistant",
        "content": response.choices[0].message.content,
        "tool_calls": response.choices[0].message.tool_calls,
        "reasoning_details": response.choices[0].message.reasoning_details,
    },
    {
        "role": "tool",
        "tool_call_id": "toolu_xxx",
        "content": '{"temperature": 45, "condition": "rainy"}',
    }
]

AihubMix는 요청에서 과거 thinking 정보를 감지하면 자동으로 interleaved thinking을 활성화하여, 모델이 도구 호출 결과를 받은 후 추가 파라미터 없이 심층 추론을 계속할 수 있도록 합니다.

1.5 완전한 예시

다음 두 예시는 완전한 다중 턴 Tool Call + interleaved thinking 프로세스를 보여줍니다: 사용자 문의 → 모델이 생각하고 도구를 호출 → 도구 결과 주입 (reasoning_details 유지) → 모델의 interleaved thinking이 최종 응답 제공. 비스트리밍 · Interleaved Thinking

import os
import json
from openai import OpenAI

client = OpenAI(
    base_url="https://aihubmix.com/v1",
    api_key=os.environ.get("AIHUBMIX_API_KEY", "sk-***"),
)

# ── Tool definition ───────────────────────────────────────────
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "City name"}},
            "required": ["location"]
        }
    }
}]

# ── Mock tool execution ───────────────────────────────────────
WEATHER_DB = {
    "boston": {"temperature": "45°F (7°C)", "condition": "rainy", "humidity": "85%", "wind": "15 mph NE"},
    "tokyo":  {"temperature": "72°F (22°C)", "condition": "sunny", "humidity": "45%", "wind": "5 mph S"},
}

def execute_tool(name: str, args: dict) -> str:
    if name == "get_weather":
        key = next((k for k in WEATHER_DB if k in args.get("location", "").lower()), None)
        return json.dumps(WEATHER_DB.get(key, {"temperature": "65°F", "condition": "clear"}))
    return "{}"

# ── Multi-turn conversation loop ─────────────────────────────
messages = [
    {"role": "user", "content": "What's the weather like in Boston? Then recommend what to wear."}
]

turn = 0
while True:
    turn += 1
    print(f"\n── Turn {turn} ──")

    response = client.chat.completions.create(
        model="claude-sonnet-4-5",
        messages=messages,
        tools=tools,
        extra_body={"reasoning": {"max_tokens": 2000}},
    )
    msg = response.choices[0].message

    # Print thinking process
    if msg.reasoning_content:
        label = "Interleaved Thinking" if turn > 1 else "Thinking"
        print(f"[{label}] {msg.reasoning_content}")

    # Print response content
    if msg.content:
        print(f"[Response] {msg.content}")

    # Print tool calls
    if msg.tool_calls:
        for tc in msg.tool_calls:
            print(f"[Tool Call: {tc.function.name}] {tc.function.arguments}")

    # Build assistant message, preserve reasoning_details (critical!)
    assistant_msg = {"role": "assistant", "content": msg.content}
    if msg.tool_calls:
        assistant_msg["tool_calls"] = msg.tool_calls
    if msg.reasoning_details:
        assistant_msg["reasoning_details"] = msg.reasoning_details  # pass back unmodified
    messages.append(assistant_msg)

    # No tool_calls means conversation is done
    if not msg.tool_calls:
        break

    # Execute tools and append results to messages
    for tc in msg.tool_calls:
        args = json.loads(tc.function.arguments)
        result = execute_tool(tc.function.name, args)
        print(f"[Tool Result: {tc.function.name}] {result}")
        messages.append({"role": "tool", "tool_call_id": tc.id, "content": result})

스트리밍 · Interleaved Thinking

import os
import sys
import json
from openai import OpenAI

client = OpenAI(
    base_url="https://aihubmix.com/v1",
    api_key=os.environ.get("AIHUBMIX_API_KEY", "sk-***"),
)

# ── Tool definition & mock execution ─────────────────────────
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string", "description": "City name"}},
            "required": ["location"]
        }
    }
}]

WEATHER_DB = {
    "boston": {"temperature": "45°F (7°C)", "condition": "rainy", "humidity": "85%", "wind": "15 mph NE"},
    "tokyo":  {"temperature": "72°F (22°C)", "condition": "sunny", "humidity": "45%", "wind": "5 mph S"},
}

def execute_tool(name: str, args: dict) -> str:
    if name == "get_weather":
        key = next((k for k in WEATHER_DB if k in args.get("location", "").lower()), None)
        return json.dumps(WEATHER_DB.get(key, {"temperature": "65°F", "condition": "clear"}))
    return "{}"

# ── Stream response collector ────────────────────────────────
def stream_and_collect(turn: int, **kwargs):
    """Stream response, print thinking/content in real-time, accumulate reasoning_details/tool_calls."""
    rd = {}            # accumulated reasoning_details
    content = ""       # accumulated response text
    tc_map = {}        # accumulated tool_calls (by index)
    cur = "none"       # current output section: none / thinking / content

    stream = client.chat.completions.create(stream=True, **kwargs)
    for chunk in stream:
        if not chunk.choices:
            continue
        delta = chunk.choices[0].delta

        # ── Handle thinking ──
        rd_delta = getattr(delta, "reasoning_details", None)
        if rd_delta and isinstance(rd_delta, dict):
            for k, v in rd_delta.items():
                if k == "type":
                    rd[k] = v
                elif isinstance(v, str):
                    rd[k] = rd.get(k, "") + v
                elif v is not None:
                    rd[k] = v
            # Print thinking chunks in real-time
            thinking_chunk = rd_delta.get("thinking", "")
            if thinking_chunk:
                if cur != "thinking":
                    cur = "thinking"
                    label = "Interleaved Thinking" if turn > 1 else "Thinking"
                    sys.stdout.write(f"\n[{label}] ")
                sys.stdout.write(thinking_chunk)
                sys.stdout.flush()

        # ── Handle content ──
        if delta.content:
            if cur != "content":
                if cur == "thinking":
                    sys.stdout.write("\n")
                cur = "content"
                sys.stdout.write("\n[Response] ")
            sys.stdout.write(delta.content)
            sys.stdout.flush()
            content += delta.content

        # ── Handle tool_calls ──
        for tc in delta.tool_calls or []:
            i = tc.index
            if i not in tc_map:
                tc_map[i] = {"id": "", "type": "function",
                             "function": {"name": "", "arguments": ""}}
            if tc.id:
                tc_map[i]["id"] = tc.id
            if tc.function:
                tc_map[i]["function"]["name"] += tc.function.name or ""
                tc_map[i]["function"]["arguments"] += tc.function.arguments or ""

    # End current output section
    if cur in ("thinking", "content"):
        sys.stdout.write("\n")

    tool_calls = [tc_map[i] for i in sorted(tc_map)] if tc_map else None
    return {
        "content": content or None,
        "reasoning_details": rd or None,
        "tool_calls": tool_calls,
    }

# ── Multi-turn conversation loop ─────────────────────────────
messages = [
    {"role": "user", "content": "What's the weather like in Boston? Then recommend what to wear."}
]

turn = 0
while True:
    turn += 1
    print(f"\n── Turn {turn} ──")

    result = stream_and_collect(
        turn,
        model="claude-sonnet-4-5",
        messages=messages,
        tools=tools,
        extra_body={"reasoning": {"max_tokens": 2000}},
    )

    # Print tool calls
    if result["tool_calls"]:
        for tc in result["tool_calls"]:
            print(f"[Tool Call: {tc['function']['name']}] {tc['function']['arguments']}")

    # Build assistant message, preserve reasoning_details (critical!)
    assistant_msg = {"role": "assistant", "content": result["content"]}
    if result["tool_calls"]:
        assistant_msg["tool_calls"] = result["tool_calls"]
    if result["reasoning_details"]:
        assistant_msg["reasoning_details"] = result["reasoning_details"]  # pass back unmodified
    messages.append(assistant_msg)

    # No tool_calls means conversation is done
    if not result["tool_calls"]:
        break

    # Execute tools and append results to messages
    for tc in result["tool_calls"]:
        args = json.loads(tc["function"]["arguments"])
        tool_result = execute_tool(tc["function"]["name"], args)
        print(f"[Tool Result: {tc['function']['name']}] {tool_result}")
        messages.append({"role": "tool", "tool_call_id": tc["id"], "content": tool_result})

1.6 Thinking 강도 매핑 규칙

Effort 모드:

Opus 4.6 / Sonnet 4.6 이상: Anthropic의 네이티브 Adaptive Thinking effort 수준에 매핑됩니다.
기타 모델: budget_tokens의 공식을 사용하여 계산됩니다:

budget_tokens = max(min(max_tokens × effort_ratio, 128000), 1024)

effort	effort_ratio
xhigh	0.95
high	0.80
medium	0.50
low	0.20
minimal	0.10

Adaptive Thinking Effort 매핑:

들어오는 Effort	Opus 4.6	Sonnet 4.6
xhigh	max	high
high	high	high
medium	medium	medium
low	low	low
minimal	low	low

max_tokens 모드: Anthropic의 budget_tokens로 직접 할당됩니다. -think 접미사: Opus/Sonnet 4.6+는 adaptive thinking (effort=medium)을 사용하고, 기타 모델은 budget_tokens = min(10240, max_tokens - 1)로 설정되며, 기본 max_tokens는 4096입니다.

2. Prompt Caching

Chat 인터페이스를 통해 Claude 모델에 요청할 때 Prompt Caching을 사용할 수 있습니다. 메시지에 cache_control 중단점을 설정함으로써, 대형 텍스트 블록(역할 카드, RAG 데이터, 책 챕터 등)을 캐시하여 재사용할 수 있으며, 후속 요청은 캐시를 직접 명중하여 비용을 크게 절감할 수 있습니다.

Claude 공식 문서: Prompt Caching

2.1 캐싱 비용

작업	가격 배수 (원래 입력 가격 대비)
캐시 쓰기 (5분 TTL)	1.25x
캐시 쓰기 (1시간 TTL)	2x
캐시 읽기	0.1x

2.2 지원되는 모델 및 최소 캐시 길이

모델	최소 캐시 토큰 수
Claude Opus 4.8	1024
Claude Opus 4.7	2048
Claude Opus 4.6 / Opus 4.5	4096
Claude Sonnet 4.6 / Sonnet 4.5 / Opus 4.1 / Opus 4 / Sonnet 4 / Sonnet 3.7 (deprecated)	1024
Claude Haiku 4.5	4096
Claude Haiku 3.5 (deprecated) / Haiku 3	2048

중단점 수 제한: 요청당 최대 4개의 cache_control 중단점.

2.3 캐시 TTL

TTL	구문	적용 시나리오
5분 (기본값)	`"cache_control": {"type": "ephemeral"}`	짧은 세션, 일반 요청
1시간	`"cache_control": {"type": "ephemeral", "ttl": "1h"}`	긴 세션, 반복 캐시 쓰기 방지

1시간 TTL의 쓰기 비용은 더 높지만, 긴 세션에서 반복 쓰기를 줄여 총 비용을 절약할 수 있습니다. Claude 4.5 이상의 모든 공급자(Anthropic, Amazon Bedrock, Google Vertex AI 포함)의 모든 모델은 1시간 TTL을 지원합니다.

2.4 사용법

system, user (이미지 포함), tools에서 cache_control 필드를 사용하여 캐시 중단점을 설정할 수 있습니다. 다음 예시는 주요 구조만 표시하며, 대형 텍스트 블록은 생략합니다. System 메시지 캐싱 (기본 5분 TTL):

{
  "model": "claude-opus-4-5",
  "messages": [
    {
      "role": "system",
      "content": [
        {"type": "text", "text": "You are an AI assistant"},
        {
          "type": "text",
          "text": "(long context)",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    },
    {
      "role": "user",
      "content": [{"type": "text", "text": "Hello"}]
    }
  ]
}

User 메시지 캐싱 (1시간 TTL):

{
  "model": "claude-opus-4-5",
  "messages": [
    {
      "role": "system",
      "content": [{"type": "text", "text": "You are an AI assistant"}]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "(long context)",
          "cache_control": {"type": "ephemeral", "ttl": "1h"}
        },
        {"type": "text", "text": "Hello"}
      ]
    }
  ]
}

이미지 메시지 캐싱:

{
  "role": "user",
  "content": [
    {
      "type": "image_url",
      "image_url": {"detail": "auto", "url": "data:image/jpeg;base64,/9j/4AAQ..."},
      "cache_control": {"type": "ephemeral"}
    },
    {"type": "text", "text": "What's this?"}
  ]
}

Tool 정의 캐싱: cache_control은 tool 객체의 최상위 수준(type 및 function과 동일한 수준)에 배치됩니다:

{
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather for a location",
      "parameters": {
        "type": "object",
        "properties": {"city": {"type": "string"}},
        "required": ["city"]
      }
    },
    "cache_control": {"type": "ephemeral", "ttl": "1h"}
  }]
}

2.5 캐시 상태 보기

응답의 usage는 자세한 캐시 정보를 기록하는 claude_cache_tokens_details를 반환합니다: 첫 번째 요청 (캐시 생성):

{
  "usage": {
    "prompt_tokens": 22,
    "completion_tokens": 890,
    "total_tokens": 912,
    "claude_cache_tokens_details": {
      "cache_creation_input_tokens": 6266,
      "cache_read_input_tokens": 0,
      "cache_write_5_minutes_input_tokens": 6266,
      "cache_write_1_hour_input_tokens": 0
    }
  }
}

후속 요청 (캐시 적중):

{
  "usage": {
    "prompt_tokens": 22,
    "completion_tokens": 810,
    "total_tokens": 832,
    "prompt_tokens_details": {
      "cached_tokens": 6266
    },
    "claude_cache_tokens_details": {
      "cache_creation_input_tokens": 0,
      "cache_read_input_tokens": 6266,
      "cache_write_5_minutes_input_tokens": 0,
      "cache_write_1_hour_input_tokens": 0
    }
  }
}

필드	의미
`cache_creation_input_tokens`	이번 요청에서 캐시에 쓰여진 토큰 수
`cache_read_input_tokens`	이번 요청에서 캐시에서 읽은 토큰 수
`cache_write_5_minutes_input_tokens`	5분 TTL 캐시에 쓰여진 토큰 수
`cache_write_1_hour_input_tokens`	1시간 TTL 캐시에 쓰여진 토큰 수
`prompt_tokens_details.cached_tokens`	캐시 적중 시 캐시된 토큰 수, OpenAI 형식과 호환

3. anthropic-beta 요청 헤더

HTTP 헤더 anthropic-beta를 통해 Claude 모델의 beta 기능을 활성화할 수 있으며, AihubMix가 이를 Anthropic API에 그대로 전달합니다.

사용법

요청 헤더에 anthropic-beta를 추가하고, 값은 해당 beta 기능 식별자입니다:

curl "https://aihubmix.com/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-***" \
  -H "anthropic-beta: context-1m-2025-08-07" \
  -d '{
  "model": "claude-opus-4-5",
  "messages": [
    {
      "role": "system",
      "content": [
        {"type": "text", "text": "You are an AI assistant"},
        {
          "type": "text",
          "text": "(long context)",
          "cache_control": {"type": "ephemeral"}
        }
      ]
    },
    {"role": "user", "content": [{"type": "text", "text": "hello"}]}
  ]
}'

사용 가능한 특정 beta 식별자에 대해서는 Anthropic API 문서를 참조하세요.

마지막 업데이트: 2026-06-01

​1. 모델 Thinking (Extended Thinking)

​1.1 Interleaved Thinking의 장점

​1.2 Thinking 활성화

​1.3 Thinking 반환

​1.4 다중 턴 대화에서 Thinking 유지 (Interleaved Thinking은 내장되어 있으며 추가 파라미터가 필요하지 않음)

​1.5 완전한 예시

​1.6 Thinking 강도 매핑 규칙

​2. Prompt Caching

​2.1 캐싱 비용

​2.2 지원되는 모델 및 최소 캐시 길이

​2.3 캐시 TTL

​2.4 사용법

​2.5 캐시 상태 보기

​3. anthropic-beta 요청 헤더

​사용법