Gemini Guides
A comprehensive guide to Gemini API calls on our platform.
Forwarding for Gemini Models
For the Gemini series, we provide two invocation methods: native API calls and OpenAI-compatible calls.
Before you start, make sure to install or update the native dependency by running either pip install google-genai
or pip install -U google-genai
.
1️⃣ For native forwarding, you mainly need to inject your AiHubMix API key and request URL into the internal client setup.
⚡️ Note: the URL format differs from the conventional base_url
usage. Please refer to the example below:
2️⃣ For OpenAI-compatible formats, retain the universal v1
endpoint.
3️⃣ For the 2.5 series, if you need to display the reasoning process, there are two ways to do it:
- Native invocation: Pass
include_thoughts=True
- OpenAI-compatible method: Pass
reasoning_effort
You can refer to the code examples below for detailed usage.
About Gemini 2.5 Inference Models
- The entire 2.5 series consists of inference models.
- 2.5 Flash is a hybrid model, similar to Claude Sonnet 3.7. You can fine-tune its reasoning behavior by adjusting the
thinking_budget
parameter for optimal control. - 2.5 Pro is a pure inference model. Thinking cannot be disabled, and
thinking_budget
should not be explicitly set.
Python usage examples:
Gemini 2.5 Flash: Quick Task Support
Example for OpenAI-compatible invocation:
- For complex tasks, simply set the model id to the default
gemini-2.5-flash-preview-04-17
to enable thinking. - Gemini 2.5 Flash uses the
budget
parameter to control the depth of thinking, ranging from 0 to 16K. The default budget is 1024, and the optimal marginal effect is 16K.
Media Understanding
Aihubmix currently supports uploading multimedia files (images, audio, and video) up to 20MB via inline_data
.
For files exceeding 20MB, a File API will be required. This functionality is not yet available; progress tracking and upload_url retrieval are under development.
Code Execution
The code execution feature enables the model to generate and run Python code and learn iteratively from the results until it arrives at a final output. You can use this code execution capability to build applications that benefit from code-based reasoning and that produce text output. For example, you could use code execution in an application that solves equations or processes text.
Context caching
Gemini’s native API enables implicit context caching by default—no setup required. For every generate_content
request, the system automatically caches the input content. If a subsequent request uses the exact same content, model, and parameters, the system will instantly return the previous result, dramatically speeding up response time and potentially reducing input token costs.
- Caching is automatic—no manual configuration needed.
- The cache is only hit when the content, model, and all parameters are exactly the same; any difference will result in a cache miss.
- The cache time-to-live (TTL) can be set by the developer, or left unset (defaults to 1 hour). There is no minimum or maximum TTL enforced by Google. Costs depend on the number of cached tokens and the cache duration.
- While Google places no restriction on TTL, as a forwarding platform, we only support a limited TTL range. For requirements beyond our platform’s limits, please contact us.
Notes
-
No guaranteed cost savings: Cache tokens are billed at 25% of the standard input price—so theoretically, caching can save you up to 75% of input token costs. However, Google’s official docs make no guarantee of cost savings; the real-world effect depends on your cache hit rate, token types, and storage duration.
-
Cache hit conditions: To maximize cache effectiveness, place repeatable context at the start of your input and dynamic content (like user input) at the end.
-
How to detect cache hits: If a response comes from the cache,
response.usage_metadata
will include thecache_tokens_details
field andcached_content_token_count
. You can use these to determine cache usage.
Example fields when a cache hit occurs:
Code example:
When a cache hit occurs,
response.usage_metadata
will contain:
Core conclusion: Implicit caching is automatic and provides clear cache hit feedback. Developers can check usage_metadata for cache status. Cost savings are not guaranteed—actual benefit depends on request structure and cache hit rates.
Function calling
By using the openai compatible way to call Gemini’s function calling, you need to pass in tool_choice="auto"
in the request body, otherwise it will report an error.
Output Example:
Token Usage Tracking Made Simple
-
Gemini tracks token usage using
usage_metadata
. Here’s what each field means:prompt_token_count
: number of input tokenscandidates_token_count
: number of output tokensthoughts_token_count
: tokens used during reasoning (also counted as output)total_token_count
: total tokens used (input + output)
For more details, check out their official docs.
-
For APIs using the OpenAI-compatible format, token usage is tracked under
.usage
with the following fields:usage.completion_tokens
: number of input tokensusage.prompt_tokens
: number of output tokens (including reasoning)usage.total_tokens
: total token usage
Here’s how to use it in code: