Forwarding for Gemini Models

For the Gemini series, we provide two invocation methods: native API calls and OpenAI-compatible calls.
Before you start, make sure to install or update the native dependency by running either pip install google-genai or pip install -U google-genai.

1️⃣ For native forwarding, you mainly need to inject your AiHubMix API key and request URL into the internal client setup.
⚡️ Note: the URL format differs from the conventional base_url usage. Please refer to the example below:

client = genai.Client(
    api_key="sk-***",  # Replace with the key you generated from AiHubMix
    http_options={"base_url": "https://aihubmix.com"},
)

2️⃣ For OpenAI-compatible formats, retain the universal v1 endpoint.

client = OpenAI(
    api_key="sk-***", # Replace with the key you generated from AiHubMix
    base_url="https://aihubmix.com/v1",
)

3️⃣ For the 2.5 series, if you need to display the reasoning process, there are two ways to do it:

  1. Native invocation: Pass include_thoughts=True
  2. OpenAI-compatible method: Pass reasoning_effort

You can refer to the code examples below for detailed usage.

About Gemini 2.5 Inference Models

  1. The entire 2.5 series consists of inference models.
  2. 2.5 Flash is a hybrid model, similar to Claude Sonnet 3.7. You can fine-tune its reasoning behavior by adjusting the thinking_budget parameter for optimal control.
  3. 2.5 Pro is a pure inference model. Thinking cannot be disabled, and thinking_budget should not be explicitly set.

Python usage examples:

from google import genai
from google.genai import types

def generate():
    client = genai.Client(
        api_key="sk-***", # 🔑 Replace it by your AiHubMix Key
        http_options={"base_url": "https://aihubmix.com/gemini"},
    )

    model = "gemini-2.0-flash"
    contents = [
        types.Content(
            role="user",
            parts=[
                types.Part.from_text(text="""How do I know I'm not wasting time?"""),
            ],
        ),
    ]

    print(client.models.generate_content(
        model=model,
        contents=contents,
    ))

if __name__ == "__main__":
    generate()

Gemini 2.5 Flash: Quick Task Support

Example for OpenAI-compatible invocation:

from openai import OpenAI

client = OpenAI(
    api_key="AIHUBMIX_API_KEY", # Replace with the key you generated in AiHubMix
    base_url="https://aihubmix.com/v1",
)

completion = client.chat.completions.create(
    model="gemini-2.5-flash-preview-04-17-nothink",
    messages=[
        {
            "role": "user",
            "content": "Explain the Occam's Razor concept and provide everyday examples of it"
        }
    ]
)

print(completion.choices[0].message.content)
  1. For complex tasks, simply set the model id to the default gemini-2.5-flash-preview-04-17 to enable thinking.
  2. Gemini 2.5 Flash uses the budget parameter to control the depth of thinking, ranging from 0 to 16K. The default budget is 1024, and the optimal marginal effect is 16K.

Media Understanding

Aihubmix currently supports uploading multimedia files (images, audio, and video) up to 20MB via inline_data. For files exceeding 20MB, a File API will be required. This functionality is not yet available; progress tracking and upload_url retrieval are under development.

from google import genai
from google.genai import types

file_path = "yourpath/file.jpeg"
with open(file_path, "rb") as f:
    file_bytes = f.read()

client = genai.Client(
    api_key="sk-***", # Replace with the key you generated in AiHubMix
    http_options={"base_url": "https://aihubmix.com/gemini"}
)

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=types.Content(
        parts=[
            types.Part(
                inline_data=types.Blob(
                    data=file_bytes,
                    mime_type="image/jpeg"
                )
            ),
            types.Part(
                text="Describe the image."
            )
        ]
    )
)

print(response.text)

Code Execution

The code execution feature enables the model to generate and run Python code and learn iteratively from the results until it arrives at a final output. You can use this code execution capability to build applications that benefit from code-based reasoning and that produce text output. For example, you could use code execution in an application that solves equations or processes text.

Python
from google import genai
from google.genai import types

file_path = "yourpath/file.csv"
with open(file_path, "rb") as f:
    file_bytes = f.read()

client = genai.Client(
    api_key="sk-***", # Replace with the key you generated in AiHubMix
    http_options={"base_url": "https://aihubmix.com/gemini"}
)

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents=types.Content(
        parts=[
            types.Part(
                inline_data=types.Blob(
                    data=file_bytes,
                    mime_type="text/csv"
                )
            ),
            types.Part(
                text="Please analyze this CSV and summarize the key statistics. Use code execution if needed."
            )
        ]
    ),
    config=types.GenerateContentConfig(
        tools=[types.Tool(
            code_execution=types.ToolCodeExecution
        )]
    )
)

for part in response.candidates[0].content.parts:
    if part.text is not None:
        print(part.text)
    if getattr(part, "executable_code", None) is not None:
        print("Generated code:\n", part.executable_code.code)
    if getattr(part, "code_execution_result", None) is not None:
        print("Execution result:\n", part.code_execution_result.output)

Context caching

Gemini’s native API enables implicit context caching by default—no setup required. For every generate_content request, the system automatically caches the input content. If a subsequent request uses the exact same content, model, and parameters, the system will instantly return the previous result, dramatically speeding up response time and potentially reducing input token costs.

  • Caching is automatic—no manual configuration needed.
  • The cache is only hit when the content, model, and all parameters are exactly the same; any difference will result in a cache miss.
  • The cache time-to-live (TTL) can be set by the developer, or left unset (defaults to 1 hour). There is no minimum or maximum TTL enforced by Google. Costs depend on the number of cached tokens and the cache duration.
    • While Google places no restriction on TTL, as a forwarding platform, we only support a limited TTL range. For requirements beyond our platform’s limits, please contact us.

Notes

  • No guaranteed cost savings: Cache tokens are billed at 25% of the standard input price—so theoretically, caching can save you up to 75% of input token costs. However, Google’s official docs make no guarantee of cost savings; the real-world effect depends on your cache hit rate, token types, and storage duration.

  • Cache hit conditions: To maximize cache effectiveness, place repeatable context at the start of your input and dynamic content (like user input) at the end.

  • How to detect cache hits: If a response comes from the cache, response.usage_metadata will include the cache_tokens_details field and cached_content_token_count. You can use these to determine cache usage.
    Example fields when a cache hit occurs:

    cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=2003)]
    cached_content_token_count=2003

Code example:

from google import genai

client = genai.Client(
    http_options={"base_url": "https://aihubmix.com/gemini"},
    api_key="sk-***",  # Replace with your AiHubMix API key
)

prompt = """
        <the entire contents of 'Pride and Prejudice'>
"""

def generate_content_sync():
    response = client.models.generate_content(
        model="gemini-2.5-flash-preview-05-20",
        contents=prompt + "Analyze the major themes in 'Pride and Prejudice'.",
    )
    print(response.usage_metadata)  # When cache is hit, cache_tokens_details and cached_content_token_count will appear
    return response

generate_content_sync()

When a cache hit occurs, response.usage_metadata will contain:

cache_tokens_details=[ModalityTokenCount(modality=<MediaModality.TEXT: 'TEXT'>, token_count=2003)]
cached_content_token_count=2003

Core conclusion: Implicit caching is automatic and provides clear cache hit feedback. Developers can check usage_metadata for cache status. Cost savings are not guaranteed—actual benefit depends on request structure and cache hit rates.

Function calling

By using the openai compatible way to call Gemini’s function calling, you need to pass in tool_choice="auto" in the request body, otherwise it will report an error.

from openai import OpenAI

# Define the function declaration for the model
schedule_meeting_function = {
    "name": "schedule_meeting",
    "description": "Schedules a meeting with specified attendees at a given time and date.",
    "parameters": {
        "type": "object",
        "properties": {
            "attendees": {
                "type": "array",
                "items": {"type": "string"},
                "description": "List of people attending the meeting.",
            },
            "date": {
                "type": "string",
                "description": "Date of the meeting (e.g., '2024-07-29')",
            },
            "time": {
                "type": "string",
                "description": "Time of the meeting (e.g., '15:00')",
            },
            "topic": {
                "type": "string",
                "description": "The subject or topic of the meeting.",
            },
        },
        "required": ["attendees", "date", "time", "topic"],
    },
}

# Configure the client
client = OpenAI(
    api_key="AIHUBMIX_API_KEY", # Replace with the key you generated in AiHubMix
    base_url="https://aihubmix.com/v1",
)

# Send request with function declarations using OpenAI compatible format
response = client.chat.completions.create(
    model="gemini-2.0-flash",
    messages=[
        {"role": "user", "content": "Schedule a meeting with Bob and Alice for 03/14/2025 at 10:00 AM about the Q3 planning."}
    ],
    tools=[{"type": "function", "function": schedule_meeting_function}],
    tool_choice="auto" ## 📍 Added Aihubmix compatibility, more stable request method
)

# Check for a function call
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    function_call = tool_call.function
    print(f"Function to call: {function_call.name}")
    print(f"Arguments: {function_call.arguments}")
    print(response.usage)
    #  In a real app, you would call your function here:
    #  result = schedule_meeting(**json.loads(function_call.arguments))
else:
    print("No function call found in the response.")
    print(response.choices[0].message.content)

Output Example:

Function to call: schedule_meeting
Arguments: {"attendees":["Bob","Alice"],"date":"2025-03-14","time":"10:00","topic":"Q3 planning"}
CompletionUsage(completion_tokens=28, prompt_tokens=111, total_tokens=139, completion_tokens_details=None, prompt_tokens_details=None)

Token Usage Tracking Made Simple

  1. Gemini tracks token usage using usage_metadata. Here’s what each field means:

    • prompt_token_count: number of input tokens
    • candidates_token_count: number of output tokens
    • thoughts_token_count: tokens used during reasoning (also counted as output)
    • total_token_count: total tokens used (input + output)

    For more details, check out their official docs.

  2. For APIs using the OpenAI-compatible format, token usage is tracked under .usage with the following fields:

    • usage.completion_tokens: number of input tokens
    • usage.prompt_tokens: number of output tokens (including reasoning)
    • usage.total_tokens: total token usage

Here’s how to use it in code:

from google import genai
from google.genai import types
import time

def generate():
    client = genai.Client(
        api_key="sk-***", # Replace this with your key from AiHubMix
        http_options={"base_url": "https://aihubmix.com/gemini"},
    )

    model = "gemini-2.5-pro-preview-03-25"
    contents = [
        types.Content(
            role="user",
            parts=[
                types.Part.from_text(text="""How is the "Rule of 72" derived in the financial world?"""),
            ],
        ),
    ]
    generate_content_config = types.GenerateContentConfig(
        response_mime_type="text/plain",
    )

    final_usage_metadata = None
    
    for chunk in client.models.generate_content_stream(
        model=model,
        contents=contents,
        config=generate_content_config,
    ):
        print(chunk.text, end="")
        if chunk.usage_metadata:
            final_usage_metadata = chunk.usage_metadata
    
    # Once all chunks are processed, print the full token usage
    if final_usage_metadata:
        print(f"\nUsage: {final_usage_metadata}")

if __name__ == "__main__":
    generate()