For the Gemini series, we provide two invocation methods: native API calls and OpenAI-compatible calls.
Before you start, make sure to install or update the native dependency by running either pip install google-genai or pip install -U google-genai.
1️⃣ For native integration, Gemini takes care of routing traffic between AI Studio and VertexAI automatically. Just supply your AiHubMix API key and the appropriate request URL. Remember, this URL is different from the usual base_url—follow the example below to ensure proper setup.
Copy
client = genai.Client( api_key="sk-***", # Replace with the key you generated from AiHubMix http_options={"base_url": "https://aihubmix.com"},)
2️⃣ For OpenAI-compatible formats, retain the universal v1 endpoint.
Copy
client = OpenAI( api_key="sk-***", # Replace with the key you generated from AiHubMix base_url="https://aihubmix.com/v1",)
3️⃣ For the 2.5 series, if you need to display the reasoning process, there are two ways to do it:
Native invocation: Pass include_thoughts=True
OpenAI-compatible method: Pass reasoning_effort
You can refer to the code examples below for detailed usage.
The entire 2.5 series consists of inference models.
2.5 Flash is a hybrid model, similar to Claude Sonnet 3.7. You can fine-tune its reasoning behavior by adjusting the thinking_budget parameter for optimal control.
2.5 Pro is a pure inference model. Thinking cannot be disabled, and thinking_budget should not be explicitly set.
Python usage examples:
Copy
from google import genaifrom google.genai import typesdef generate(): client = genai.Client( api_key="sk-***", # 🔑 Replace it by your AiHubMix Key http_options={"base_url": "https://aihubmix.com/gemini"}, ) model = "gemini-2.0-flash" contents = [ types.Content( role="user", parts=[ types.Part.from_text(text="""How do I know I'm not wasting time?"""), ], ), ] print(client.models.generate_content( model=model, contents=contents, ))if __name__ == "__main__": generate()
from openai import OpenAIclient = OpenAI( api_key="AIHUBMIX_API_KEY", # Replace with the key you generated in AiHubMix base_url="https://aihubmix.com/v1",)completion = client.chat.completions.create( model="gemini-2.5-flash-preview-04-17-nothink", messages=[ { "role": "user", "content": "Explain the Occam's Razor concept and provide everyday examples of it" } ])print(completion.choices[0].message.content)
For complex tasks, simply set the model id to the default gemini-2.5-flash-preview-04-17 to enable thinking.
Gemini 2.5 Flash uses the budget parameter to control the depth of thinking, ranging from 0 to 16K. The default budget is 1024, and the optimal marginal effect is 16K.
Aihubmix currently supports uploading multimedia files (images, audio, and video) up to 20MB via inline_data.
For files exceeding 20MB, a File API will be required. This functionality is not yet available; progress tracking and upload_url retrieval are under development.
By adding the EDIARESOLUTION_MEDIUM parameter, you can adjust the image resolution, which significantly reduces input costs and minimizes the risk of errors with large images.
Supported media resolution values:
Name
Description
MEDIA_RESOLUTION_UNSPECIFIED
Media resolution has not been set.
MEDIA_RESOLUTION_LOW
Media resolution set to low (64 tokens).
MEDIA_RESOLUTION_MEDIUM
Media resolution set to medium (256 tokens).
MEDIA_RESOLUTION_HIGH
Media resolution set to high (zoomed reframing with 256 tokens).
Copy
from google import genaifrom google.genai import typesfile_path = "yourpath/file.jpeg"with open(file_path, "rb") as f: file_bytes = f.read()client = genai.Client( api_key="sk-***", # Replace with the key you generated in AiHubMix http_options={"base_url": "https://aihubmix.com/gemini"})response = client.models.generate_content( model="gemini-2.5-flash", contents=types.Content( parts=[ types.Part( inline_data=types.Blob( data=file_bytes, mime_type="image/jpeg" ) ), types.Part( text="Describe the image." ) ] ), config=types.GenerateContentConfig( system_instruction="You are a helpful assistant that can describe images.", max_output_tokens=768, temperature=0.1, thinking_config=types.ThinkingConfig( thinking_budget=0, include_thoughts=False ), media_resolution=types.MediaResolution.MEDIA_RESOLUTION_MEDIUM # 256 tokens ))print(response.text)
The code execution feature enables the model to generate and run Python code and learn iteratively from the results until it arrives at a final output. You can use this code execution capability to build applications that benefit from code-based reasoning and that produce text output. For example, you could use code execution in an application that solves equations or processes text.
Python
Copy
from google import genaifrom google.genai import typesfile_path = "yourpath/file.csv"with open(file_path, "rb") as f: file_bytes = f.read()client = genai.Client( api_key="sk-***", # Replace with the key you generated in AiHubMix http_options={"base_url": "https://aihubmix.com/gemini"})response = client.models.generate_content( model="gemini-2.0-flash", contents=types.Content( parts=[ types.Part( inline_data=types.Blob( data=file_bytes, mime_type="text/csv" ) ), types.Part( text="Please analyze this CSV and summarize the key statistics. Use code execution if needed." ) ] ), config=types.GenerateContentConfig( tools=[types.Tool( code_execution=types.ToolCodeExecution )] ))for part in response.candidates[0].content.parts: if part.text is not None: print(part.text) if getattr(part, "executable_code", None) is not None: print("Generated code:\n", part.executable_code.code) if getattr(part, "code_execution_result", None) is not None: print("Execution result:\n", part.code_execution_result.output)
Gemini’s native API enables implicit context caching by default—no setup required. For every generate_content request, the system automatically caches the input content. If a subsequent request uses the exact same content, model, and parameters, the system will instantly return the previous result, dramatically speeding up response time and potentially reducing input token costs.
Caching is automatic—no manual configuration needed.
The cache is only hit when the content, model, and all parameters are exactly the same; any difference will result in a cache miss.
The cache time-to-live (TTL) can be set by the developer, or left unset (defaults to 1 hour). There is no minimum or maximum TTL enforced by Google. Costs depend on the number of cached tokens and the cache duration.
While Google places no restriction on TTL, as a forwarding platform, we only support a limited TTL range. For requirements beyond our platform’s limits, please contact us.
No guaranteed cost savings: Cache tokens are billed at 25% of the standard input price—so theoretically, caching can save you up to 75% of input token costs. However, Google’s official docs make no guarantee of cost savings; the real-world effect depends on your cache hit rate, token types, and storage duration.
Cache hit conditions: To maximize cache effectiveness, place repeatable context at the start of your input and dynamic content (like user input) at the end.
How to detect cache hits: If a response comes from the cache, response.usage_metadata will include the cache_tokens_details field and cached_content_token_count. You can use these to determine cache usage.
Example fields when a cache hit occurs:
from google import genaiclient = genai.Client( http_options={"base_url": "https://aihubmix.com/gemini"}, api_key="sk-***", # Replace with your AiHubMix API key)prompt = """ <the entire contents of 'Pride and Prejudice'>"""def generate_content_sync(): response = client.models.generate_content( model="gemini-2.5-flash-preview-05-20", contents=prompt + "Analyze the major themes in 'Pride and Prejudice'.", ) print(response.usage_metadata) # When cache is hit, cache_tokens_details and cached_content_token_count will appear return responsegenerate_content_sync()
When a cache hit occurs, response.usage_metadata will contain:
Core conclusion: Implicit caching is automatic and provides clear cache hit feedback. Developers can check usage_metadata for cache status. Cost savings are not guaranteed—actual benefit depends on request structure and cache hit rates.
By using the openai compatible way to call Gemini’s function calling, you need to pass in tool_choice="auto" in the request body, otherwise it will report an error.
Copy
from openai import OpenAI# Define the function declaration for the modelschedule_meeting_function = { "name": "schedule_meeting", "description": "Schedules a meeting with specified attendees at a given time and date.", "parameters": { "type": "object", "properties": { "attendees": { "type": "array", "items": {"type": "string"}, "description": "List of people attending the meeting.", }, "date": { "type": "string", "description": "Date of the meeting (e.g., '2024-07-29')", }, "time": { "type": "string", "description": "Time of the meeting (e.g., '15:00')", }, "topic": { "type": "string", "description": "The subject or topic of the meeting.", }, }, "required": ["attendees", "date", "time", "topic"], },}# Configure the clientclient = OpenAI( api_key="AIHUBMIX_API_KEY", # Replace with the key you generated in AiHubMix base_url="https://aihubmix.com/v1",)# Send request with function declarations using OpenAI compatible formatresponse = client.chat.completions.create( model="gemini-2.0-flash", messages=[ {"role": "user", "content": "Schedule a meeting with Bob and Alice for 03/14/2025 at 10:00 AM about the Q3 planning."} ], tools=[{"type": "function", "function": schedule_meeting_function}], tool_choice="auto" ## 📍 Added Aihubmix compatibility, more stable request method)# Check for a function callif response.choices[0].message.tool_calls: tool_call = response.choices[0].message.tool_calls[0] function_call = tool_call.function print(f"Function to call: {function_call.name}") print(f"Arguments: {function_call.arguments}") print(response.usage) # In a real app, you would call your function here: # result = schedule_meeting(**json.loads(function_call.arguments))else: print("No function call found in the response.") print(response.choices[0].message.content)
Output Example:
Copy
Function to call: schedule_meetingArguments: {"attendees":["Bob","Alice"],"date":"2025-03-14","time":"10:00","topic":"Q3 planning"}CompletionUsage(completion_tokens=28, prompt_tokens=111, total_tokens=139, completion_tokens_details=None, prompt_tokens_details=None)
For APIs using the OpenAI-compatible format, token usage is tracked under .usage with the following fields:
usage.completion_tokens: number of input tokens
usage.prompt_tokens: number of output tokens (including reasoning)
usage.total_tokens: total token usage
Here’s how to use it in code:
Copy
from google import genaifrom google.genai import typesimport timedef generate(): client = genai.Client( api_key="sk-***", # Replace this with your key from AiHubMix http_options={"base_url": "https://aihubmix.com/gemini"}, ) model = "gemini-2.5-pro-preview-03-25" contents = [ types.Content( role="user", parts=[ types.Part.from_text(text="""How is the "Rule of 72" derived in the financial world?"""), ], ), ] generate_content_config = types.GenerateContentConfig( response_mime_type="text/plain", ) final_usage_metadata = None for chunk in client.models.generate_content_stream( model=model, contents=contents, config=generate_content_config, ): print(chunk.text, end="") if chunk.usage_metadata: final_usage_metadata = chunk.usage_metadata # Once all chunks are processed, print the full token usage if final_usage_metadata: print(f"\nUsage: {final_usage_metadata}")if __name__ == "__main__": generate()