AiHubMix Documentation Hub

LiteLLM Overview

LiteLLM is an open-source unified AI gateway developed by BerriAI. It provides a single standardized interface to call almost every major LLM on the market. Repository: https://github.com/BerriAI/litellm

Every LLM provider ships its own SDK and API format — OpenAI, Anthropic, and Google all differ. Switching models or using multiple models at once means maintaining separate codebases. LiteLLM solves this: write once, change one parameter, call any model.

Two Usage Modes

Mode	Description	Best For
Python SDK	`pip install litellm`, call directly in code	Personal projects, rapid prototyping
Proxy Server	Standalone deployable AI gateway	Team sharing, enterprise access control

Core Capabilities

Unified OpenAI format: supports 100+ providers including OpenAI, Anthropic, Gemini, Bedrock, Azure, and more
Virtual key management: centrally manage team API keys without exposing the originals
Cost tracking: monitor token usage and spend per user or project
Load balancing: automatic traffic distribution across models with failover support
High performance: P95 latency of ~8ms at 1,000 RPS

Installation

Requirements

Python 3.8+ macOS Install via Homebrew:

brew install python

Verify:

python3 --version

Windows Download the installer from python.org/downloads. During installation, check “Add Python to PATH”. Verify:

python --version

Linux (Ubuntu/Debian)

sudo apt update
sudo apt install python3 python3-pip

pip

pip is usually bundled with Python. Verify it is available:

pip --version
# or
pip3 --version

If not found, install manually:

# Universal method
python3 -m ensurepip --upgrade

# Ubuntu/Debian
sudo apt install python3-pip

# Upgrade to latest
pip install --upgrade pip

Install LiteLLM

Once your environment is ready:

python3 -m pip install litellm

Verify the installation:

python3 -m pip show litellm

Optional Dependencies

Some providers require additional packages:

# AWS Bedrock
pip install litellm[bedrock]

# Google Vertex AI
pip install litellm[vertex]

# All dependencies (not recommended for production)
pip install litellm[all]

Install Proxy Server

To deploy a standalone gateway:

pip install 'litellm[proxy]'

Docker (Optional)

docker pull ghcr.io/berriai/litellm:main-latest

Recommendation: use pip install litellm for personal development; choose Proxy + Docker for team deployments.

Configure API Key and Make Your First Call

Get Your AiHubMix API Key

Go to the aihubmix.com dashboard and create an API key.

Set the Environment Variable

export AIHUBMIX_API_KEY="your-aihubmix-key"

First Call

import os
from litellm import completion

response = completion(
    model="openai/gpt-4o-mini",
    api_base="https://aihubmix.com/v1",
    api_key=os.environ.get("AIHUBMIX_API_KEY"),
    messages=[{"role": "user", "content": "Hello, introduce yourself"}]
)

print(response.choices[0].message.content)

Basic Usage

1. Switching Models

AiHubMix supports all major models. Switching only requires changing the model parameter:

import os
from litellm import completion

response = completion(
    model="openai/claude-sonnet-4-6",  # change this
    api_base="https://aihubmix.com/v1",
    api_key=os.environ.get("AIHUBMIX_API_KEY"),
    messages=[{"role": "user", "content": "Hello, introduce yourself"}]
)

print(response.choices[0].message.content)

2. Streaming

Add stream=True to receive output token by token:

import os
from litellm import completion

response = completion(
    model="openai/claude-sonnet-4-6",
    api_base="https://aihubmix.com/v1",
    api_key=os.environ.get("AIHUBMIX_API_KEY"),
    messages=[{"role": "user", "content": "Explain Python in 100 words"}],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content or "", end="", flush=True)
print()

3. Multi-Turn Conversation

Pass the conversation history in the messages list so the model remembers context:

import os
from litellm import completion

messages = [
    {"role": "user", "content": "My name is Alex"},
    {"role": "assistant", "content": "Hello, Alex!"},
    {"role": "user", "content": "What is my name?"}
]

response = completion(
    model="openai/claude-sonnet-4-6",
    api_base="https://aihubmix.com/v1",
    api_key=os.environ.get("AIHUBMIX_API_KEY"),
    messages=messages
)

print(response.choices[0].message.content)

4. Async Calls

Send multiple requests concurrently without waiting for each to finish:

import os
import asyncio
from litellm import acompletion

async def ask(question):
    response = await acompletion(
        model="openai/claude-sonnet-4-6",
        api_base="https://aihubmix.com/v1",
        api_key=os.environ.get("AIHUBMIX_API_KEY"),
        messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content

async def main():
    questions = [
        "What color is an apple?",
        "What color is the sky?",
        "What color is grass?"
    ]
    results = await asyncio.gather(*[ask(q) for q in questions])
    for q, r in zip(questions, results):
        print(f"Q: {q}")
        print(f"A: {r}")
        print()

asyncio.run(main())

5. Timeout and Retry

Prevent requests from hanging or failing due to network issues:

import os
from litellm import completion

response = completion(
    model="openai/claude-sonnet-4-6",
    api_base="https://aihubmix.com/v1",
    api_key=os.environ.get("AIHUBMIX_API_KEY"),
    messages=[{"role": "user", "content": "Hello"}],
    timeout=10,      # raise an error after 10 seconds
    num_retries=3    # retry up to 3 times on failure
)

print(response.choices[0].message.content)

timeout is in seconds. Set num_retries to 2-3; higher values slow down responses.

6. Token Usage and Cost Tracking

Every response includes token usage data:

import os
from litellm import completion

response = completion(
    model="openai/claude-sonnet-4-6",
    api_base="https://aihubmix.com/v1",
    api_key=os.environ.get("AIHUBMIX_API_KEY"),
    messages=[{"role": "user", "content": "Explain Python in 100 words"}]
)

print(response.choices[0].message.content)
print()
print("Token usage:")
print(f"  Input:  {response.usage.prompt_tokens}")
print(f"  Output: {response.usage.completion_tokens}")
print(f"  Total:  {response.usage.total_tokens}")

Track cost per call:

import os
from litellm import completion, completion_cost

response = completion(
    model="openai/claude-sonnet-4-6",
    api_base="https://aihubmix.com/v1",
    api_key=os.environ.get("AIHUBMIX_API_KEY"),
    messages=[{"role": "user", "content": "Explain Python in 100 words"}]
)

cost = completion_cost(completion_response=response)
print(f"Cost: ${cost:.6f}")

7. Load Balancing and Failover

Configure multiple models to automatically distribute traffic or switch to a backup when one fails:

import os
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "my-model",
            "litellm_params": {
                "model": "openai/claude-sonnet-4-6",
                "api_base": "https://aihubmix.com/v1",
                "api_key": os.environ.get("AIHUBMIX_API_KEY"),
            }
        },
        {
            "model_name": "my-model",
            "litellm_params": {
                "model": "openai/gpt-4o",
                "api_base": "https://aihubmix.com/v1",
                "api_key": os.environ.get("AIHUBMIX_API_KEY"),
            }
        }
    ]
)

response = router.completion(
    model="my-model",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response.choices[0].message.content)

Both models share the same model_name. LiteLLM round-robins between them and automatically fails over if one returns an error.

8. Deploy Proxy Server

The Proxy Server is a standalone gateway. Team members route all requests through it without needing their own API keys. Install

python3 -m pip install 'litellm[proxy]'

Create config.yaml

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_base: https://aihubmix.com/v1
      api_key: os.environ/AIHUBMIX_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: openai/claude-sonnet-4-6
      api_base: https://aihubmix.com/v1
      api_key: os.environ/AIHUBMIX_API_KEY

  - model_name: gemini-flash
    litellm_params:
      model: openai/gemini-2.0-flash
      api_base: https://aihubmix.com/v1
      api_key: os.environ/AIHUBMIX_API_KEY

Start the server

litellm --config config.yaml --port 4000

A successful start shows:

LiteLLM: Proxy running on http://0.0.0.0:4000

Call the local server

import os
from litellm import completion

response = completion(
    model="gpt-4o",
    api_base="http://localhost:4000",
    api_key="any-string",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response.choices[0].message.content)

The api_key here can be any string. The real AiHubMix key is managed by the Proxy.

9. Virtual Key Management

Virtual keys let you assign independent keys to different team members or projects, controlling access and usage without exposing the real AiHubMix key. Prerequisites: start a PostgreSQL instance

docker run -d \
  --name litellm-db \
  -e POSTGRES_USER=litellm \
  -e POSTGRES_PASSWORD=litellm \
  -e POSTGRES_DB=litellm \
  -p 5432:5432 \
  postgres

Update config.yaml

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_base: https://aihubmix.com/v1
      api_key: os.environ/AIHUBMIX_API_KEY

  - model_name: claude-sonnet
    litellm_params:
      model: openai/claude-sonnet-4-6
      api_base: https://aihubmix.com/v1
      api_key: os.environ/AIHUBMIX_API_KEY

general_settings:
  master_key: sk-my-master-key
  database_url: postgresql://litellm:litellm@localhost:5432/litellm

Restart the server

litellm --config config.yaml --port 4000

Create a virtual key

curl -X POST http://localhost:4000/key/generate \
  -H "Authorization: Bearer sk-my-master-key" \
  -H "Content-Type: application/json" \
  -d '{
    "key_alias": "team-a",
    "max_budget": 10,
    "models": ["gpt-4o", "claude-sonnet"]
  }'

The key field in the response is the virtual key, e.g. sk-xxxxxx. Use the virtual key

from litellm import completion

response = completion(
    model="claude-sonnet",
    api_base="http://localhost:4000",
    api_key="sk-xxxxxx",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response.choices[0].message.content)

Check usage

curl http://localhost:4000/key/info \
  -H "Authorization: Bearer sk-my-master-key" \
  -H "Content-Type: application/json" \
  -d '{"key": "sk-xxxxxx"}'

Each virtual key supports individual model restrictions, budget limits, and expiry times — ideal for multi-member team workflows.

Practical Example: Multi-Model Comparison

Send the same question to multiple models simultaneously and compare output quality, speed, and token usage. Set API Key

export AIHUBMIX_API_KEY="your-key"

Run the comparison

import os
import time
import asyncio
from litellm import acompletion

MODELS = [
    "gpt-5.5",
    "claude-opus-4-7",
    "deepseek-v4-flash",
    "coding-glm-5.1-free",
]

QUESTION = "If you could give a programmer only one piece of advice, what would it be?"

async def ask_model(model, question):
    start = time.time()
    try:
        response = await acompletion(
            model=f"openai/{model}",
            api_base="https://aihubmix.com/v1",
            api_key=os.environ.get("AIHUBMIX_API_KEY"),
            messages=[{"role": "user", "content": question}]
        )
        return {
            "model": model,
            "answer": response.choices[0].message.content.strip(),
            "tokens": response.usage.total_tokens,
            "time": round(time.time() - start, 2),
            "error": None
        }
    except Exception as e:
        return {
            "model": model,
            "answer": None,
            "tokens": 0,
            "time": round(time.time() - start, 2),
            "error": str(e)
        }

async def main():
    print(f"Question: {QUESTION}")
    print("=" * 60)
    tasks = [ask_model(m, QUESTION) for m in MODELS]
    results = await asyncio.gather(*tasks)
    for r in results:
        print(f"\nModel: {r['model']}")
        print(f"Time: {r['time']}s  |  Tokens: {r['tokens']}")
        print("-" * 40)
        if r["error"]:
            print(f"Error: {r['error']}")
        else:
            print(r["answer"])
    print("\n" + "=" * 60)
    print(f"{'Model':<30} {'Time':>8} {'Tokens':>8}")
    print("-" * 50)
    for r in sorted(results, key=lambda x: x["time"]):
        status = f"{r['time']}s" if not r["error"] else "failed"
        print(f"{r['model']:<30} {status:>8} {r['tokens']:>8}")

asyncio.run(main())

Last updated: April 29, 2026

Documentation Index

​LiteLLM Overview

​Two Usage Modes

​Core Capabilities

​Installation

​Requirements

​pip

​Install LiteLLM

​Optional Dependencies

​Install Proxy Server

​Docker (Optional)

​Configure API Key and Make Your First Call

​Get Your AiHubMix API Key

​Set the Environment Variable

​First Call

​Basic Usage

​1. Switching Models

​2. Streaming

​3. Multi-Turn Conversation

​4. Async Calls

​5. Timeout and Retry

​6. Token Usage and Cost Tracking

​7. Load Balancing and Failover

​8. Deploy Proxy Server

​9. Virtual Key Management

​Practical Example: Multi-Model Comparison