AiHubMix Documentation Hub

One model=auto, and the gateway decides “which model to use.”

The Auto Router analyzes each request and selects the most suitable model in real time from the hundreds of models on the platform. All you do is set model to auto—no picking models, no comparing prices, no tracking model releases.

Billed by the resolved model, with no surcharge and zero client code changes. Which model was hit is written into the response headers and body (see How to confirm which model was used), so it is fully traceable.

Use cases

General-purpose apps: when you don’t know what kind of request a user will send, hand it to auto to dispatch by content.
Cost optimization: let simple tasks land on cheaper, faster models automatically (auto is cost-first by default).
Quality optimization: ensure complex requests are routed to more capable models (auto:quality_first).
Latency-critical scenarios: real-time voice and multi-turn agent loops prefer the fastest-responding model (auto:latency_critical).
Single entry point, no model selection: different request types are dispatched to their own optimal models—no need to maintain a “task → model” mapping table, and no need to keep tracking model releases, comparing prices, and swapping names by hand.

Quick start

Set model to auto; the rest of your request body stays exactly the same as a normal call. Use https://aihubmix.com/v1 as the base_url.

curl https://aihubmix.com/v1/chat/completions \
  -H "Authorization: Bearer <AIHUBMIX_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      { "role": "user", "content": "What is the meaning of life?" }
    ]
  }'

The Auto Router runs prompt analysis before the request reaches the upstream, treating streaming (stream: true) and non-streaming requests identically, with no extra parameters; the whole decision adds only about 1ms of overhead, with virtually no impact on end-to-end latency.

How to confirm which model was used

This is the trust anchor of the Auto Router: you always know which model ended up handling this request.

The model field in the response body is backfilled with the real resolved model (e.g. mimo-v2.5-pro), not auto.
The response headers give you the full decision detail:

Header	Meaning	Example value
`X-Aihubmix-Router-Resolved-Model`	The model actually hit and billed accordingly	`xiaomi-mimo-v2.5-pro`
`X-Aihubmix-Router-Policy`	The policy used for this request	`cost_optimized`
`X-Aihubmix-Router-Dimension`	The detected task dimension	`text.overall`
`X-Aihubmix-Router-Decision-Id`	A unique ID for this decision, for troubleshooting	`05dbad09-33c5-42de-…`
`X-Aihubmix-Router-Reason`	A brief decision summary (policy / dimension / top score / candidate count)	`policy=cost_optimized dim=text.overall top=0.182 survivors=20/33`
`X-Aihubmix-Router-Fallback`	Present only when the no-candidate fallback is triggered	`true`

HTTP response headers are case-insensitive: the table above capitalizes them by convention, but the actual HTTP/2 response returns them lowercased as x-aihubmix-router-*—the two are equivalent.

Read the routing decision (curl to see the headers; SDKs use the raw response object to get headers):

curl -i https://aihubmix.com/v1/chat/completions \
  -H "Authorization: Bearer <AIHUBMIX_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto",
    "messages": [
      { "role": "user", "content": "What is the meaning of life?" }
    ]
  }' | grep -i "^x-aihubmix-router"

Actual curl output (production; the resolved model varies with the live catalog):

x-aihubmix-router-decision-id: 05dbad09-33c5-42de-85b5-559fdb73eb4c
x-aihubmix-router-dimension: text.overall
x-aihubmix-router-policy: cost_optimized
x-aihubmix-router-reason: policy=cost_optimized dim=text.overall top=0.182 survivors=20/33
x-aihubmix-router-resolved-model: xiaomi-mimo-v2.5-pro

How to read reason: survivors=20/33 means that out of 33 candidates, 20 passed hard filtering and entered scoring; top=0.182 is the winning model’s normalized composite score within the candidate pool (capability / cost / latency weighted by policy).

The Resolved-Model in the example depends on the current candidates and prices in the live catalog, and will change as platform models come and go—which is exactly the value of the Auto Router: you don’t have to track these changes. To keep your decisions auditable, rely on the real model name in the response headers / body rather than assuming it stays fixed.

Routing policies

auto without a suffix uses the default policy cost_optimized. You can use auto:<policy> to explicitly specify the emphasis:

Policy syntax	Emphasis	Use case
`auto` (= `auto:cost_optimized`)	Cost first: pick the cheapest model that meets the capability bar	Batch tasks, cost-sensitive workloads
`auto:balanced`	Balanced: weighs capability / cost / latency	General purpose, a safe choice when unsure
`auto:quality_first`	Quality first: prefer the most capable model	Complex reasoning, critical output
`auto:latency_critical`	Latency first: prefer the fastest-responding model	Real-time voice, agent loops

To specify a policy, just append the suffix to model:

# Quality first + a coding task
curl -i https://aihubmix.com/v1/chat/completions \
  -H "Authorization: Bearer <AIHUBMIX_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "auto:quality_first",
    "messages": [
      { "role": "user", "content": "Write a Python function to reverse a linked list." }
    ]
  }' | grep -i "^x-aihubmix-router"

Same request, different policies → different resolved models (measured in production, the same prompt What is the meaning of life?, all landing on the text.overall dimension):

Policy	Resolved model	top score
`auto` (= `cost_optimized`)	`xiaomi-mimo-v2.5-pro`	0.182
`auto:balanced`	`claude-opus-4-6-think`	0.488
`auto:latency_critical`	`claude-opus-4-6`	0.646
`auto:quality_first`	`claude-opus-4-6-think`	0.758

latency_critical picked the non--think version—thinking variants have higher reasoning latency, so the latency policy actively avoids them. This shows that policy weights genuinely act on the “capability / cost / latency” tradeoff, not on capability alone.

Content changes the result too: applying the same auto:quality_first to a coding task (the request in the example above), the dimension shifts from text.overall to text.coding, and the measured resolved model is claude-opus-4-6-think—the policy and the request content together determine the final model.

An unknown policy suffix (e.g. auto:fast) falls back to the default policy cost_optimized without raising an error.

How it works

Upon receiving model=auto, the gateway turns “intent” into “a concrete model” in three steps:

Extract request features

It analyzes this request’s input / output modalities (text, image, file), content intent (code, math, OCR, diagrams, language, whether web search is on, etc.), and request scale (estimated input / output tokens), normalizing them into a single task dimension. For example: a question containing code → text.coding; an image plus an OCR request → vision.ocr; plain text → text.overall.

Hard-filter the candidates

Models that don’t satisfy hard constraints are excluded outright: not supporting the required input / output modalities, a context window that can’t fit the request, removed by the circuit breaker (see Reliability and fault tolerance), or not within your Key’s allowed model range.

Score by policy weighting

For the candidates that pass filtering, the router takes model capability scores from authoritative industry benchmarks, overlays real-time price and performance data, applies a three-dimensional weighted score over “capability / cost / latency” according to the chosen policy, and picks the highest-scoring one. The final model name is written back into the request and the response headers.

A real scoring example (under the quality_first policy, the top 3 of the same candidate pool, taken from production decision logs):

Candidate model	Capability score	Relative cost	Latency	Composite score
`claude-opus-4-6-think`	1504	220	1963ms	0.758
`claude-opus-4-6`	1498	220	822ms	0.721
`claude-fable-5`	1510	484	11130ms	0.600

Note that claude-fable-5 has the highest capability score (1510), yet its higher cost and larger latency push its composite score down to third. This is exactly the point of weighted scoring: not “capability above all,” but a policy-driven tradeoff among capability / cost / latency.

Dimension detection is automatic—the Auto Router has 30+ fine-grained task dimensions built in (code / math / OCR / diagrams / long text / Chinese / web search…), far more precise than “coarse routing by model family.” With the same auto, different content routes to different dimensions:

Your request	Detected dimension
Plain text question	`text.overall`
Contains code, asks to write / debug a program	`text.coding`
Math proof / solving	`text.math`
A very long question (around 500+ tokens)	`text.longer_query`
A question in Chinese	`text.language.chinese`
Image input + “What is in this image?”	`vision.overall`
Image input + “OCR…” / “extract the text”	`vision.ocr`
Image input + chart / flowchart	`vision.diagram`
Web search enabled	`search.overall`

Dimension detection uses conservative matching (high precision, low false positives): long-tail, ambiguous requests fall back to more general dimensions (e.g. text.overall / vision.overall) rather than being forced into a category, which avoids misrouting.

Image input also goes through the Auto Router: when you include an image in /v1/chat/completions, it is routed by the image task to a model with strong vision capabilities. Measured in production—“OCR this image” → vision.ocr, resolved to qwen3.5-397b-a17b; a general “What is in this image?” → vision.overall, resolved to gpt-5.4-mini. (This refers to image understanding; image generation at /v1/images/* does not yet support auto, see FAQ.)

Reliability and fault tolerance

The Auto Router has multiple layers of fault tolerance built in, so the auto path never fails for no reason:

Circuit breaker: automatically remove failing models

The gateway maintains a sliding-window failure-rate statistic for each model. When a model fails enough times within the window and its failure rate exceeds the threshold, it is temporarily removed from the candidate pool and automatically recovers after a cooldown period—avoiding sending subsequent requests to a model that is currently misbehaving. The failure signal comes from errors the upstream returns for that request; the gateway’s own “no available channel” does not count (that is not a problem with the model itself).

No-candidate fallback: never return 400 on auto

If hard filtering happens to exclude every candidate (for example, a certain modality combination temporarily has no available model), the gateway does not error out. Instead, it assigns a fallback model by output type to guarantee a response, and adds X-Aihubmix-Router-Fallback: true to the response headers to let you know.

Privilege boundary: a restricted Key is not bypassed by the fallback

If your Key limits the allowed model range, the model selected by the Auto Router (including the fallback) is always within that range. If there really is no model within the range that can serve this request, it explicitly returns 403 rather than silently using a model outside the range (which might be more expensive).

Billing

Billed at the list price of the resolved model; the Auto Router itself charges no surcharge. Whichever model ends up responding is what you’re billed for, by that model’s price, capability, and context limits—and that model is the value in the response header X-Aihubmix-Router-Resolved-Model and the response body model field. In other words, the Auto Router will not “secretly use an expensive model”: every resolution is written into the response and can be reconciled line by line.

Limitations

The Auto Router currently targets the chat completions endpoint /v1/chat/completions (see FAQ: which endpoints are supported).

?router=off or the request header X-Router-Off makes model=auto return 400 directly—this is an explicit refusal of the ambiguous “want auto but also turn routing off” usage, rather than a silent ignore:

curl -i "https://aihubmix.com/v1/chat/completions?router=off" \
  -H "Authorization: Bearer <AIHUBMIX_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"hi"}]}'
# → HTTP/1.1 400 Bad Request
# {"error":{"message":"auto requires router enabled; remove ?router=off / X-Router-Off", ...}}

The candidate pool changes dynamically with the platform catalog: the same auto may resolve to different models at different times (this is by design, and reviewable via the response headers).

Differences from OpenRouter / LiteLLM

“Automatic model selection” is not unique to AIHubMix; OpenRouter and LiteLLM both offer similar capabilities. The differences are mainly in integration cost and hosting model:

Difference	OpenRouter	LiteLLM	AIHubMix
Automatic model selection by request content	✅	✅	✅
Zero config, works out of the box (no routing rules / utterances to write)	✅	❌	✅
Platform-hosted, no self-built / self-deployed proxy	✅	❌	✅
Cost / quality / latency multi-policy, switchable with one parameter	❌	❌	✅
Traceable resolution decisions (response headers include dimension / policy / reason)	❌	❌	✅
Billed by the final resolved model	✅	❌	✅

FAQ

Q: Which endpoints does the Auto Router support? A: Currently model=auto supports the OpenAI-compatible chat completions endpoint /v1/chat/completions. Image generation / editing (/v1/images/*), audio, /v1/embeddings, /v1/rerank, and other endpoints do not yet support auto—specify a concrete model directly. Q: Does the Auto Router support image input? A: Yes. Asking with an image (image_url) in /v1/chat/completions is image understanding, and it is routed by the image task to a model with strong vision capabilities (vision.ocr / vision.diagram / vision.overall, etc.). Note the distinction: image generation goes through the /v1/images/* endpoint and does not yet support auto. Q: How do I know which model this request actually used? A: Check the response header X-Aihubmix-Router-Resolved-Model, or the model field in the response body—both are backfilled with the real model name. See How to confirm which model was used. Q: Will the Auto Router secretly use an expensive model? A: No. The default policy cost_optimized is cost-first; and every resolved model is written into the response and billed at its list price, so it can be reconciled line by line. See Billing. Q: How do I control / estimate cost? A: Three measures stack together—① the default auto (cost_optimized) is already cost-first; ② use the Key’s allowed model range to lock candidates within a price level you accept, which effectively sets an upper bound on cost; ③ each resolution is billed at the list price of the model in the response header Resolved-Model, reconcilable line by line. When you need stronger capability, explicitly use auto:quality_first. Q: What’s the difference between auto and “model mapping / fallback”? A: Model mapping / fallback is a Key-level fixed alias + an ordered fallback on failure (the same target every time); the Auto Router selects a model dynamically based on the content of each request. The former solves “the client only knows a certain name / switch to a backup when the primary is down,” while the latter solves “I don’t care which one, just give me the most suitable.” Q: Can I restrict the Auto Router to choose only among a few models? A: Yes—constrain it via the Key’s allowed model range: the Auto Router will only choose among the models that Key permits, and out-of-range models will not be resolved. Q: Are streaming requests supported? A: Yes. Routing completes before the request reaches the upstream, treating streaming / non-streaming identically. Q: Why did the same sentence resolve to different models across two calls? A: The candidate pool and prices change dynamically with the platform catalog—this is by design. Use the Decision-Id and Resolved-Model in the response headers to review each decision. Q: How do I make requests consistently resolve to the same model (e.g. to reuse prompt caching)? A: auto selects a model dynamically against the current catalog and does not guarantee determinism. If you need to consistently resolve to the same model (for example, relying on the upstream’s prompt cache, or requiring strict reproducibility), specify a concrete model name directly, or use the Key to limit the allowed range to a single model—under both approaches the resolution is deterministic.

Model Mapping and Fallback: Key-level fixed aliases + failure fallback, complementary to the Auto Router.
Unified Inference Parameters: consistent request parameters across models.
AIHubMix models page: look up model names, prices, and Input Modalities.

​Use cases

​Quick start

​How to confirm which model was used

​Routing policies

​How it works

​Reliability and fault tolerance

​Billing

​Limitations

​Differences from OpenRouter / LiteLLM

​FAQ

​Related resources

Use cases

Quick start

How to confirm which model was used

Routing policies

How it works

Reliability and fault tolerance

Billing

Limitations

Differences from OpenRouter / LiteLLM

FAQ

Related resources