AiHubMix Documentation Hub

Forwarding for Gemini Models

For the Gemini series, we provide two invocation methods: native API calls and OpenAI-compatible calls.
Before you start, make sure to install or update the native dependency by running either pip install google-genai or pip install -U google-genai.

1️⃣ For native integration, Gemini takes care of routing traffic between AI Studio and VertexAI automatically. Just supply your AiHubMix API key and the appropriate request URL. Remember, this URL is different from the usual base_url—follow the example below to ensure proper setup.

client = genai.Client(
    api_key="sk-***",  # Replace with the key you generated from AiHubMix
    http_options={"base_url": "https://aihubmix.com"},
)

2️⃣ For OpenAI-compatible formats, retain the universal v1 endpoint.

client = OpenAI(
    api_key="sk-***", # Replace with the key you generated from AiHubMix
    base_url="https://aihubmix.com/v1",
)

3️⃣ For the 2.5 series, if you need to display the reasoning process, there are two ways to do it:

Native invocation: Pass include_thoughts=True
OpenAI-compatible method: Pass reasoning_effort

You can refer to the code examples below for detailed usage.

About Gemini 2.5 Inference Models

The entire 2.5 series consists of inference models.
2.5 Flash is a hybrid model, similar to Claude Sonnet 3.7. You can fine-tune its reasoning behavior by adjusting the thinking_budget parameter for optimal control.
2.5 Pro is a pure inference model. Thinking cannot be disabled, and thinking_budget should not be explicitly set.

Python usage examples:

from google import genai
from google.genai import types

def generate():
    client = genai.Client(
        api_key="sk-***", # 🔑 Replace it by your AiHubMix Key
        http_options={"base_url": "https://aihubmix.com/gemini"},
    )

    model = "gemini-2.0-flash"
    contents = [
        types.Content(
            role="user",
            parts=[
                types.Part.from_text(text="""How do I know I'm not wasting time?"""),
            ],
        ),
    ]

    print(client.models.generate_content(
        model=model,
        contents=contents,
    ))

if __name__ == "__main__":
    generate()

Gemini 2.5 Flash: Quick Task Support

Example for OpenAI-compatible invocation:

from openai import OpenAI

client = OpenAI(
    api_key="AIHUBMIX_API_KEY", # Replace with the key you generated in AiHubMix
    base_url="https://aihubmix.com/v1",
)

completion = client.chat.completions.create(
    model="gemini-2.5-flash-preview-04-17-nothink",
    messages=[
        {
            "role": "user",
            "content": "Explain the Occam's Razor concept and provide everyday examples of it"
        }
    ]
)

print(completion.choices[0].message.content)

For complex tasks, simply set the model id to the default gemini-2.5-flash-preview-04-17 to enable thinking.
Gemini 2.5 Flash uses the budget parameter to control the depth of thinking, ranging from 0 to 16K. The default budget is 1024, and the optimal marginal effect is 16K.

Media Understanding

Aihubmix currently supports uploading multimedia files (images, audio, and video) up to 20MB via inline_data. For files exceeding 20MB, a File API will be required. This functionality is not yet available; progress tracking and upload_url retrieval are under development.

By adding the EDIARESOLUTION_MEDIUM parameter, you can adjust the image resolution, which significantly reduces input costs and minimizes the risk of errors with large images.

Supported media resolution values:

Name	Description
MEDIA_RESOLUTION_UNSPECIFIED	Media resolution has not been set.
MEDIA_RESOLUTION_LOW	Media resolution set to low (64 tokens).
MEDIA_RESOLUTION_MEDIUM	Media resolution set to medium (256 tokens).
MEDIA_RESOLUTION_HIGH	Media resolution set to high (zoomed reframing with 256 tokens).

from google import genai
from google.genai import types

file_path = "yourpath/file.jpeg"
with open(file_path, "rb") as f:
    file_bytes = f.read()

client = genai.Client(
    api_key="sk-***", # Replace with the key you generated in AiHubMix
    http_options={"base_url": "https://aihubmix.com/gemini"}
)

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=types.Content(
        parts=[
            types.Part(
                inline_data=types.Blob(
                    data=file_bytes,
                    mime_type="image/jpeg"
                )
            ),
            types.Part(
                text="Describe the image."
            )
        ]
    ),
    config=types.GenerateContentConfig(
        system_instruction="You are a helpful assistant that can describe images.",
        max_output_tokens=768,
        temperature=0.1,
        thinking_config=types.ThinkingConfig(
            thinking_budget=0, include_thoughts=False
        ),
        media_resolution=types.MediaResolution.MEDIA_RESOLUTION_MEDIUM # 256 tokens
    )
)

print(response.text)

Code Execution

The code execution feature enables the model to generate and run Python code and learn iteratively from the results until it arrives at a final output. You can use this code execution capability to build applications that benefit from code-based reasoning and that produce text output. For example, you could use code execution in an application that solves equations or processes text.

Python

from google import genai
from google.genai import types

file_path = "yourpath/file.csv"
with open(file_path, "rb") as f:
    file_bytes = f.read()

client = genai.Client(
    api_key="sk-***", # Replace with the key you generated in AiHubMix
    http_options={"base_url": "https://aihubmix.com/gemini"}
)

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=types.Content(
        parts=[
            types.Part(
                inline_data=types.Blob(
                    data=file_bytes,
                    mime_type="text/csv"
                )
            ),
            types.Part(
                text="Please analyze this CSV and summarize the key statistics. Use code execution if needed."
            )
        ]
    ),
    config=types.GenerateContentConfig(
        tools=[types.Tool(
            code_execution=types.ToolCodeExecution
        )]
    )
)

for part in response.candidates[0].content.parts:
    if part.text is not None:
        print(part.text)
    if getattr(part, "executable_code", None) is not None:
        print("Generated code:\n", part.executable_code.code)
    if getattr(part, "code_execution_result", None) is not None:
        print("Execution result:\n", part.code_execution_result.output)

Context caching

Gemini’s native API enables implicit context caching by default—no setup required. For every generate_content request, the system automatically caches the input content. If a subsequent request uses the exact same content, model, and parameters, the system will instantly return the previous result, dramatically speeding up response time and potentially reducing input token costs.

Caching is automatic—no manual configuration needed.
The cache is only hit when the content, model, and all parameters are exactly the same; any difference will result in a cache miss.
The cache time-to-live (TTL) can be set by the developer, or left unset (defaults to 1 hour). There is no minimum or maximum TTL enforced by Google. Costs depend on the number of cached tokens and the cache duration.
- While Google places no restriction on TTL, as a forwarding platform, we only support a limited TTL range. For requirements beyond our platform’s limits, please contact us.

Notes

No guaranteed cost savings: Cache tokens are billed at 25% of the standard input price—so theoretically, caching can save you up to 75% of input token costs. However, Google’s official docs make no guarantee of cost savings; the real-world effect depends on your cache hit rate, token types, and storage duration.
Cache hit conditions: To maximize cache effectiveness, place repeatable context at the start of your input and dynamic content (like user input) at the end.
How to detect cache hits: If a response comes from the cache, response.usage_metadata will include the cache_tokens_details field and cached_content_token_count. You can use these to determine cache usage.
Example fields when a cache hit occurs:
```
cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=2003)]
cached_content_token_count=2003
```

Code example:

from google import genai

client = genai.Client(
    http_options={"base_url": "https://aihubmix.com/gemini"},
    api_key="sk-***",  # Replace with your AiHubMix API key
)

prompt = """
        <the entire contents of 'Pride and Prejudice'>
"""

def generate_content_sync():
    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-05-20",
        contents=prompt + "Analyze the major themes in 'Pride and Prejudice'.",
    )
    print(response.usage_metadata)  # When cache is hit, cache_tokens_details and cached_content_token_count will appear
    return response

generate_content_sync()

When a cache hit occurs, response.usage_metadata will contain:
cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=2003)]
cached_content_token_count=2003

Core conclusion: Implicit caching is automatic and provides clear cache hit feedback. Developers can check usage_metadata for cache status. Cost savings are not guaranteed—actual benefit depends on request structure and cache hit rates.

Function calling

By using the openai compatible way to call Gemini’s function calling, you need to pass in tool_choice="auto" in the request body, otherwise it will report an error.

from openai import OpenAI

# Define the function declaration for the model
schedule_meeting_function = {
    "name": "schedule_meeting",
    "description": "Schedules a meeting with specified attendees at a given time and date.",
    "parameters": {
        "type": "object",
        "properties": {
            "attendees": {
                "type": "array",
                "items": {"type": "string"},
                "description": "List of people attending the meeting.",
            },
            "date": {
                "type": "string",
                "description": "Date of the meeting (e.g., '2024-07-29')",
            },
            "time": {
                "type": "string",
                "description": "Time of the meeting (e.g., '15:00')",
            },
            "topic": {
                "type": "string",
                "description": "The subject or topic of the meeting.",
            },
        },
        "required": ["attendees", "date", "time", "topic"],
    },
}

# Configure the client
client = OpenAI(
    api_key="AIHUBMIX_API_KEY", # Replace with the key you generated in AiHubMix
    base_url="https://aihubmix.com/v1",
)

# Send request with function declarations using OpenAI compatible format
response = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=[
        {"role": "user", "content": "Schedule a meeting with Bob and Alice for 03/14/2025 at 10:00 AM about the Q3 planning."}
    ],
    tools=[{"type": "function", "function": schedule_meeting_function}],
    tool_choice="auto" ## 📍 Added Aihubmix compatibility, more stable request method
)

# Check for a function call
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    function_call = tool_call.function
    print(f"Function to call: {function_call.name}")
    print(f"Arguments: {function_call.arguments}")
    print(response.usage)
    #  In a real app, you would call your function here:
    #  result = schedule_meeting(**json.loads(function_call.arguments))
else:
    print("No function call found in the response.")
    print(response.choices[0].message.content)

Output Example:

Function to call: schedule_meeting
Arguments: {"attendees":["Bob","Alice"],"date":"2025-03-14","time":"10:00","topic":"Q3 planning"}
CompletionUsage(completion_tokens=28, prompt_tokens=111, total_tokens=139, completion_tokens_details=None, prompt_tokens_details=None)

Token Usage Tracking Made Simple

Gemini tracks token usage using usage_metadata. Here’s what each field means:
- prompt_token_count: number of input tokens
- candidates_token_count: number of output tokens
- thoughts_token_count: tokens used during reasoning (also counted as output)
- total_token_count: total tokens used (input + output)
For more details, check out their official docs.
For APIs using the OpenAI-compatible format, token usage is tracked under .usage with the following fields:
- usage.completion_tokens: number of input tokens
- usage.prompt_tokens: number of output tokens (including reasoning)
- usage.total_tokens: total token usage

Here’s how to use it in code:

from google import genai
from google.genai import types
import time

def generate():
    client = genai.Client(
        api_key="sk-***", # Replace this with your key from AiHubMix
        http_options={"base_url": "https://aihubmix.com/gemini"},
    )

    model = "gemini-2.5-pro-preview-03-25"
    contents = [
        types.Content(
            role="user",
            parts=[
                types.Part.from_text(text="""How is the "Rule of 72" derived in the financial world?"""),
            ],
        ),
    ]
    generate_content_config = types.GenerateContentConfig(
        response_mime_type="text/plain",
    )

    final_usage_metadata = None
    
    for chunk in client.models.generate_content_stream(
        model=model,
        contents=contents,
        config=generate_content_config,
    ):
        print(chunk.text, end="")
        if chunk.usage_metadata:
            final_usage_metadata = chunk.usage_metadata
    
    # Once all chunks are processed, print the full token usage
    if final_usage_metadata:
        print(f"\nUsage: {final_usage_metadata}")

if __name__ == "__main__":
    generate()

Basics

API

Terms and Privacy

Gemini Guides

Forwarding for Gemini Models

About Gemini 2.5 Inference Models

Gemini 2.5 Flash: Quick Task Support

Media Understanding

Code Execution

Context caching

Notes

Function calling

Token Usage Tracking Made Simple

Basics

API

Terms and Privacy

​Forwarding for Gemini Models

​About Gemini 2.5 Inference Models

​Gemini 2.5 Flash: Quick Task Support

​Media Understanding

​Code Execution

​Context caching

​Notes

​Function calling

​Token Usage Tracking Made Simple

Forwarding for Gemini Models

About Gemini 2.5 Inference Models

Gemini 2.5 Flash: Quick Task Support

Media Understanding

Code Execution

Context caching

Notes

Function calling

Token Usage Tracking Made Simple