Forwarding for Gemini Models
For the Gemini series, we provide two invocation methods: native API calls and OpenAI-compatible calls.Before you start, make sure to install or update the native dependency by running either
pip install google-genai
or pip install -U google-genai
.
1️⃣ For native integration, Gemini takes care of routing traffic between AI Studio and VertexAI automatically. Just supply your AiHubMix API key and the appropriate request URL. Remember, this URL is different from the usual base_url
—follow the example below to ensure proper setup.
v1
endpoint.
- Native invocation: Pass
include_thoughts=True
- OpenAI-compatible method: Pass
reasoning_effort
About Gemini 2.5 Inference Models
- The entire 2.5 series consists of inference models.
- 2.5 Flash is a hybrid model, similar to Claude Sonnet 3.7. You can fine-tune its reasoning behavior by adjusting the
thinking_budget
parameter for optimal control. - 2.5 Pro is a pure inference model. Thinking cannot be disabled, and
thinking_budget
should not be explicitly set.
Gemini 2.5 Flash: Quick Task Support
Example for OpenAI-compatible invocation:- For complex tasks, simply set the model id to the default
gemini-2.5-flash-preview-04-17
to enable thinking. - Gemini 2.5 Flash uses the
budget
parameter to control the depth of thinking, ranging from 0 to 16K. The default budget is 1024, and the optimal marginal effect is 16K.
Media Understanding
Aihubmix currently supports uploading multimedia files (images, audio, and video) up to 20MB viainline_data
.
For files exceeding 20MB, a File API will be required. This functionality is not yet available; progress tracking and upload_url retrieval are under development.
By adding the
EDIARESOLUTION_MEDIUM
parameter, you can adjust the image resolution, which significantly reduces input costs and minimizes the risk of errors with large images.Supported media resolution values:Name | Description |
---|---|
MEDIA_RESOLUTION_UNSPECIFIED | Media resolution has not been set. |
MEDIA_RESOLUTION_LOW | Media resolution set to low (64 tokens). |
MEDIA_RESOLUTION_MEDIUM | Media resolution set to medium (256 tokens). |
MEDIA_RESOLUTION_HIGH | Media resolution set to high (zoomed reframing with 256 tokens). |
Code Execution
The code execution feature enables the model to generate and run Python code and learn iteratively from the results until it arrives at a final output. You can use this code execution capability to build applications that benefit from code-based reasoning and that produce text output. For example, you could use code execution in an application that solves equations or processes text.Python
Context caching
Gemini’s native API enables implicit context caching by default—no setup required. For everygenerate_content
request, the system automatically caches the input content. If a subsequent request uses the exact same content, model, and parameters, the system will instantly return the previous result, dramatically speeding up response time and potentially reducing input token costs.
- Caching is automatic—no manual configuration needed.
- The cache is only hit when the content, model, and all parameters are exactly the same; any difference will result in a cache miss.
- The cache time-to-live (TTL) can be set by the developer, or left unset (defaults to 1 hour). There is no minimum or maximum TTL enforced by Google. Costs depend on the number of cached tokens and the cache duration.
- While Google places no restriction on TTL, as a forwarding platform, we only support a limited TTL range. For requirements beyond our platform’s limits, please contact us.
Notes
- No guaranteed cost savings: Cache tokens are billed at 25% of the standard input price—so theoretically, caching can save you up to 75% of input token costs. However, Google’s official docs make no guarantee of cost savings; the real-world effect depends on your cache hit rate, token types, and storage duration.
- Cache hit conditions: To maximize cache effectiveness, place repeatable context at the start of your input and dynamic content (like user input) at the end.
-
How to detect cache hits: If a response comes from the cache,
response.usage_metadata
will include thecache_tokens_details
field andcached_content_token_count
. You can use these to determine cache usage.
Example fields when a cache hit occurs:
When a cache hit occurs,Core conclusion: Implicit caching is automatic and provides clear cache hit feedback. Developers can check usage_metadata for cache status. Cost savings are not guaranteed—actual benefit depends on request structure and cache hit rates.response.usage_metadata
will contain:
Function calling
By using the openai compatible way to call Gemini’s function calling, you need to pass intool_choice="auto"
in the request body, otherwise it will report an error.
Token Usage Tracking Made Simple
-
Gemini tracks token usage using
usage_metadata
. Here’s what each field means:prompt_token_count
: number of input tokenscandidates_token_count
: number of output tokensthoughts_token_count
: tokens used during reasoning (also counted as output)total_token_count
: total tokens used (input + output)
-
For APIs using the OpenAI-compatible format, token usage is tracked under
.usage
with the following fields:usage.completion_tokens
: number of input tokensusage.prompt_tokens
: number of output tokens (including reasoning)usage.total_tokens
: total token usage
Here’s how to use it in code: