One model=auto, and the gateway decides “which model to use.”
The Auto Router analyzes each request and selects the most suitable model in real time from the hundreds of models on the platform. All you do is set model to auto—no picking models, no comparing prices, no tracking model releases.
Billed by the resolved model, with no surcharge and zero client code changes. Which model was hit is written into the response headers and body (see How to confirm which model was used), so it is fully traceable.
Use cases
- General-purpose apps: when you don’t know what kind of request a user will send, hand it to
autoto dispatch by content. - Cost optimization: let simple tasks land on cheaper, faster models automatically (
autois cost-first by default). - Quality optimization: ensure complex requests are routed to more capable models (
auto:quality_first). - Latency-critical scenarios: real-time voice and multi-turn agent loops prefer the fastest-responding model (
auto:latency_critical). - Single entry point, no model selection: different request types are dispatched to their own optimal models—no need to maintain a “task → model” mapping table, and no need to keep tracking model releases, comparing prices, and swapping names by hand.
Quick start
Setmodel to auto; the rest of your request body stays exactly the same as a normal call. Use https://aihubmix.com/v1 as the base_url.
How to confirm which model was used
This is the trust anchor of the Auto Router: you always know which model ended up handling this request.- The
modelfield in the response body is backfilled with the real resolved model (e.g.mimo-v2.5-pro), notauto. - The response headers give you the full decision detail:
| Header | Meaning | Example value |
|---|---|---|
X-Aihubmix-Router-Resolved-Model | The model actually hit and billed accordingly | xiaomi-mimo-v2.5-pro |
X-Aihubmix-Router-Policy | The policy used for this request | cost_optimized |
X-Aihubmix-Router-Dimension | The detected task dimension | text.overall |
X-Aihubmix-Router-Decision-Id | A unique ID for this decision, for troubleshooting | 05dbad09-33c5-42de-… |
X-Aihubmix-Router-Reason | A brief decision summary (policy / dimension / top score / candidate count) | policy=cost_optimized dim=text.overall top=0.182 survivors=20/33 |
X-Aihubmix-Router-Fallback | Present only when the no-candidate fallback is triggered | true |
HTTP response headers are case-insensitive: the table above capitalizes them by convention, but the actual HTTP/2 response returns them lowercased as x-aihubmix-router-*—the two are equivalent.
Read the routing decision (curl to see the headers; SDKs use the raw response object to get headers):
reason: survivors=20/33 means that out of 33 candidates, 20 passed hard filtering and entered scoring; top=0.182 is the winning model’s normalized composite score within the candidate pool (capability / cost / latency weighted by policy).
The
Resolved-Model in the example depends on the current candidates and prices in the live catalog, and will change as platform models come and go—which is exactly the value of the Auto Router: you don’t have to track these changes. To keep your decisions auditable, rely on the real model name in the response headers / body rather than assuming it stays fixed.Routing policies
auto without a suffix uses the default policy cost_optimized. You can use auto:<policy> to explicitly specify the emphasis:
| Policy syntax | Emphasis | Use case |
|---|---|---|
auto (= auto:cost_optimized) | Cost first: pick the cheapest model that meets the capability bar | Batch tasks, cost-sensitive workloads |
auto:balanced | Balanced: weighs capability / cost / latency | General purpose, a safe choice when unsure |
auto:quality_first | Quality first: prefer the most capable model | Complex reasoning, critical output |
auto:latency_critical | Latency first: prefer the fastest-responding model | Real-time voice, agent loops |
model:
What is the meaning of life?, all landing on the text.overall dimension):
| Policy | Resolved model | top score |
|---|---|---|
auto (= cost_optimized) | xiaomi-mimo-v2.5-pro | 0.182 |
auto:balanced | claude-opus-4-6-think | 0.488 |
auto:latency_critical | claude-opus-4-6 | 0.646 |
auto:quality_first | claude-opus-4-6-think | 0.758 |
latency_criticalpicked the non--thinkversion—thinking variants have higher reasoning latency, so the latency policy actively avoids them. This shows that policy weights genuinely act on the “capability / cost / latency” tradeoff, not on capability alone.
Content changes the result too: applying the sameauto:quality_firstto a coding task (the request in the example above), the dimension shifts fromtext.overalltotext.coding, and the measured resolved model isclaude-opus-4-6-think—the policy and the request content together determine the final model.
An unknown policy suffix (e.g.
auto:fast) falls back to the default policy cost_optimized without raising an error.How it works
Upon receivingmodel=auto, the gateway turns “intent” into “a concrete model” in three steps:
Extract request features
It analyzes this request’s input / output modalities (text, image, file), content intent (code, math, OCR, diagrams, language, whether web search is on, etc.), and request scale (estimated input / output tokens), normalizing them into a single task dimension. For example: a question containing code →
text.coding; an image plus an OCR request → vision.ocr; plain text → text.overall.Hard-filter the candidates
Models that don’t satisfy hard constraints are excluded outright: not supporting the required input / output modalities, a context window that can’t fit the request, removed by the circuit breaker (see Reliability and fault tolerance), or not within your Key’s allowed model range.
Score by policy weighting
For the candidates that pass filtering, the router takes model capability scores from authoritative industry benchmarks, overlays real-time price and performance data, applies a three-dimensional weighted score over “capability / cost / latency” according to the chosen policy, and picks the highest-scoring one. The final model name is written back into the request and the response headers.
quality_first policy, the top 3 of the same candidate pool, taken from production decision logs):
| Candidate model | Capability score | Relative cost | Latency | Composite score |
|---|---|---|---|---|
claude-opus-4-6-think | 1504 | 220 | 1963ms | 0.758 |
claude-opus-4-6 | 1498 | 220 | 822ms | 0.721 |
claude-fable-5 | 1510 | 484 | 11130ms | 0.600 |
Note that claude-fable-5 has the highest capability score (1510), yet its higher cost and larger latency push its composite score down to third. This is exactly the point of weighted scoring: not “capability above all,” but a policy-driven tradeoff among capability / cost / latency.
Dimension detection is automatic—the Auto Router has 30+ fine-grained task dimensions built in (code / math / OCR / diagrams / long text / Chinese / web search…), far more precise than “coarse routing by model family.” With the same auto, different content routes to different dimensions:
| Your request | Detected dimension |
|---|---|
| Plain text question | text.overall |
| Contains code, asks to write / debug a program | text.coding |
| Math proof / solving | text.math |
| A very long question (around 500+ tokens) | text.longer_query |
| A question in Chinese | text.language.chinese |
| Image input + “What is in this image?” | vision.overall |
| Image input + “OCR…” / “extract the text” | vision.ocr |
| Image input + chart / flowchart | vision.diagram |
| Web search enabled | search.overall |
Image input also goes through the Auto Router: when you include an image in
/v1/chat/completions, it is routed by the image task to a model with strong vision capabilities. Measured in production—“OCR this image” → vision.ocr, resolved to qwen3.5-397b-a17b; a general “What is in this image?” → vision.overall, resolved to gpt-5.4-mini. (This refers to image understanding; image generation at /v1/images/* does not yet support auto, see FAQ.)Reliability and fault tolerance
The Auto Router has multiple layers of fault tolerance built in, so theauto path never fails for no reason:
Circuit breaker: automatically remove failing models
Circuit breaker: automatically remove failing models
The gateway maintains a sliding-window failure-rate statistic for each model. When a model fails enough times within the window and its failure rate exceeds the threshold, it is temporarily removed from the candidate pool and automatically recovers after a cooldown period—avoiding sending subsequent requests to a model that is currently misbehaving. The failure signal comes from errors the upstream returns for that request; the gateway’s own “no available channel” does not count (that is not a problem with the model itself).
No-candidate fallback: never return 400 on auto
No-candidate fallback: never return 400 on auto
If hard filtering happens to exclude every candidate (for example, a certain modality combination temporarily has no available model), the gateway does not error out. Instead, it assigns a fallback model by output type to guarantee a response, and adds
X-Aihubmix-Router-Fallback: true to the response headers to let you know.Privilege boundary: a restricted Key is not bypassed by the fallback
Privilege boundary: a restricted Key is not bypassed by the fallback
If your Key limits the allowed model range, the model selected by the Auto Router (including the fallback) is always within that range. If there really is no model within the range that can serve this request, it explicitly returns 403 rather than silently using a model outside the range (which might be more expensive).
Billing
Billed at the list price of the resolved model; the Auto Router itself charges no surcharge. Whichever model ends up responding is what you’re billed for, by that model’s price, capability, and context limits—and that model is the value in the response headerX-Aihubmix-Router-Resolved-Model and the response body model field. In other words, the Auto Router will not “secretly use an expensive model”: every resolution is written into the response and can be reconciled line by line.
Limitations
-
The Auto Router currently targets the chat completions endpoint
/v1/chat/completions(see FAQ: which endpoints are supported). -
?router=offor the request headerX-Router-Offmakesmodel=autoreturn 400 directly—this is an explicit refusal of the ambiguous “want auto but also turn routing off” usage, rather than a silent ignore: -
The candidate pool changes dynamically with the platform catalog: the same
automay resolve to different models at different times (this is by design, and reviewable via the response headers).
Differences from OpenRouter / LiteLLM
“Automatic model selection” is not unique to AIHubMix; OpenRouter and LiteLLM both offer similar capabilities. The differences are mainly in integration cost and hosting model:| Difference | OpenRouter | LiteLLM | AIHubMix |
|---|---|---|---|
| Automatic model selection by request content | ✅ | ✅ | ✅ |
| Zero config, works out of the box (no routing rules / utterances to write) | ✅ | ❌ | ✅ |
| Platform-hosted, no self-built / self-deployed proxy | ✅ | ❌ | ✅ |
| Cost / quality / latency multi-policy, switchable with one parameter | ❌ | ❌ | ✅ |
| Traceable resolution decisions (response headers include dimension / policy / reason) | ❌ | ❌ | ✅ |
| Billed by the final resolved model | ✅ | ❌ | ✅ |
FAQ
Q: Which endpoints does the Auto Router support? A: Currentlymodel=auto supports the OpenAI-compatible chat completions endpoint /v1/chat/completions. Image generation / editing (/v1/images/*), audio, /v1/embeddings, /v1/rerank, and other endpoints do not yet support auto—specify a concrete model directly.
Q: Does the Auto Router support image input?
A: Yes. Asking with an image (image_url) in /v1/chat/completions is image understanding, and it is routed by the image task to a model with strong vision capabilities (vision.ocr / vision.diagram / vision.overall, etc.). Note the distinction: image generation goes through the /v1/images/* endpoint and does not yet support auto.
Q: How do I know which model this request actually used?
A: Check the response header X-Aihubmix-Router-Resolved-Model, or the model field in the response body—both are backfilled with the real model name. See How to confirm which model was used.
Q: Will the Auto Router secretly use an expensive model?
A: No. The default policy cost_optimized is cost-first; and every resolved model is written into the response and billed at its list price, so it can be reconciled line by line. See Billing.
Q: How do I control / estimate cost?
A: Three measures stack together—① the default auto (cost_optimized) is already cost-first; ② use the Key’s allowed model range to lock candidates within a price level you accept, which effectively sets an upper bound on cost; ③ each resolution is billed at the list price of the model in the response header Resolved-Model, reconcilable line by line. When you need stronger capability, explicitly use auto:quality_first.
Q: What’s the difference between auto and “model mapping / fallback”?
A: Model mapping / fallback is a Key-level fixed alias + an ordered fallback on failure (the same target every time); the Auto Router selects a model dynamically based on the content of each request. The former solves “the client only knows a certain name / switch to a backup when the primary is down,” while the latter solves “I don’t care which one, just give me the most suitable.”
Q: Can I restrict the Auto Router to choose only among a few models?
A: Yes—constrain it via the Key’s allowed model range: the Auto Router will only choose among the models that Key permits, and out-of-range models will not be resolved.
Q: Are streaming requests supported?
A: Yes. Routing completes before the request reaches the upstream, treating streaming / non-streaming identically.
Q: Why did the same sentence resolve to different models across two calls?
A: The candidate pool and prices change dynamically with the platform catalog—this is by design. Use the Decision-Id and Resolved-Model in the response headers to review each decision.
Q: How do I make requests consistently resolve to the same model (e.g. to reuse prompt caching)?
A: auto selects a model dynamically against the current catalog and does not guarantee determinism. If you need to consistently resolve to the same model (for example, relying on the upstream’s prompt cache, or requiring strict reproducibility), specify a concrete model name directly, or use the Key to limit the allowed range to a single model—under both approaches the resolution is deterministic.
Related resources
- Model Mapping and Fallback: Key-level fixed aliases + failure fallback, complementary to the Auto Router.
- Unified Inference Parameters: consistent request parameters across models.
- AIHubMix models page: look up model names, prices, and
Input Modalities.