Skip to main content

Introduction

The Text-to-Speech (TTS) API is based on advanced generative AI models that can convert input text into realistic speech audio. It supports a variety of use cases:
  • Narrating written blog articles
  • Generating speech audio in multiple languages
  • Providing real-time audio output streams

Available Models

OpenAI Models

  • gpt-4o-audio-preview — OpenAI’s latest audio generation model, supporting conversational audio generation
  • gpt-4o-mini-tts — The preferred model for smart real-time applications, supporting advanced voice control and allowing various voice characteristics to be controlled via prompts:
    1. Accent
    2. Emotional range
    3. Intonation
    4. Impressions/Style
    5. Speed of speech
    6. Tone
    7. Whispering
  • tts-1-hd — The previous generation TTS model with high-definition audio quality
  • tts-1 — Standard TTS model, balancing quality and speed

Gemini Models

Performance Recommendations:
  1. For the fastest response time, it’s recommended to use wav or pcm as the response format
  2. For high-quality audio, use tts-1-hd
  3. For faster generation speed, use tts-1
  4. For smart voice applications, gpt-4o-mini-tts is recommended
  5. For scenarios requiring multi-speaker dialogues, the Gemini TTS models are recommended

API Endpoint

Request URL

POST https://aihubmix.com/v1/audio/speech

Request Headers

Authorization: Bearer $AIHUBMIX_API_KEY
Content-Type: application/json

Request Parameters

Standard TTS Parameters

The standard parameters applicable to the following TTS models: tts-1, tts-1-hd, gpt-4o-mini-tts, gemini-2.5-flash-preview-tts, and gemini-2.5-pro-preview-tts.
ParameterTypeRequiredDescription
modelstringYesThe model ID to be used. Optional values: tts-1, tts-1-hd, gpt-4o-mini-tts, gemini-2.5-flash-preview-tts, gemini-2.5-pro-preview-tts
inputstringYesThe text to generate audio from, with a maximum length of 4096 characters
voicestringYesThe voice used for synthesis. See the voice list below.
response_formatstringNoAudio output format. Supported audio formats include: mp3, opus, aac, flac, wav, pcm, default is mp3. Note: Gemini models only support wav and pcm formats.
speednumberNoThe speed of the generated audio. Range from 0.25 to 4.0, default is 1.0. Note: gpt-4o-mini-tts and Gemini models do not support this parameter, but speed can be controlled through natural language descriptions.
instructionsstringNoVoice generation instructions, which can specify voice style, intonation, and emotional characteristics in detail, applicable only for gpt-4o-mini-tts and Gemini models.

gpt-4o-audio-preview Parameters

ParameterTypeRequiredDescription
modelstringYesSet to gpt-4o-audio-preview
modalitiesarrayYesSet to ["text", "audio"] to enable audio output
audioobjectYesAudio configuration object containing voice and format fields
messagesarrayYesArray of chat messages, similar to standard chat format

Voice List

OpenAI Voices

Supports the following voice options:
  • alloy - Neutral, balanced
  • ash - Clear, professional
  • ballad - Warm, narrative
  • coral - Friendly, approachable
  • echo - Clear, bright
  • fable - Expressive, dramatic
  • onyx - Deep, authoritative
  • nova - Lively, energetic
  • sage - Mature, knowledgeable
  • shimmer - Soft, soothing
  • verse - Clear, versatile
  • marin - Natural, friendly
  • cedar - Stable, reliable

Gemini Voices

Supports the following 30 voice options:
Voice NameStyleVoice NameStyleVoice NameStyle
ZephyrBrightPuckUpbeatCharonInformative
KoreFirmFenrirExcitableLedaYouthful
OrusFirmAoedeBreezyCallirrhoeEasy-going
AutonoeBrightEnceladusBreathyIapetusClear
UmbrielEasy-goingAlgiebaSmoothDespinaSmooth
ErinomeClearAlgenibGravellyRasalgethiInformative
LaomedeiaUpbeatAchernarSoftAlnilamFirm
SchedarEvenGacruxMaturePulcherrimaForward
AchirdFriendlyZubenelgenubiCasualVindemiatrixGentle
SadachbiaLivelySadaltagerKnowledgeableSulafatWarm

Voice Mapping

When using Gemini models, if an OpenAI voice name is provided, the system will automatically map it to the corresponding Gemini voice:
OpenAI VoiceGemini VoiceOpenAI VoiceGemini Voice
alloyKoreashFenrir
balladAoedecoralLeda
echoPuckfableZephyr
onyxCharonnovaOrus
sageAlgiebashimmerCallirrhoe
verseEnceladusmarinDespina
cedarIapetus

Usage Examples

Standard TTS Model (OpenAI)

curl https://aihubmix.com/v1/audio/speech \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "The quick brown fox jumped over the lazy dog.",
    "voice": "alloy"
  }' \
  --output speech.mp3

Gemini TTS Model (Single Speaker)

curl https://aihubmix.com/v1/audio/speech \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash-preview-tts",
    "input": "Say cheerfully: Have a wonderful day!",
    "voice": "Kore",
    "response_format": "wav"
  }' \
  --output speech.wav

Gemini TTS Model (Multi-Speaker - Controlled by Prompts)

curl https://aihubmix.com/v1/audio/speech \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash-preview-tts",
    "input": "TTS the following conversation between Joe and Jane:\nJoe: How'\''s it going today Jane?\nJane: Not too bad, how about you?",
    "voice": "Kore",
    "response_format": "wav",
    "instructions": "Joe should sound firm and professional, Jane should sound upbeat and friendly"
  }' \
  --output conversation.wav

Python Example (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="your-aihubmix-api-key",
    base_url="https://aihubmix.com/v1"
)

response = client.audio.speech.create(
    model="tts-1",
    voice="alloy",
    input="The quick brown fox jumped over the lazy dog."
)

response.stream_to_file("speech.mp3")

Python Example (Gemini TTS)

from openai import OpenAI

client = OpenAI(
    api_key="your-aihubmix-api-key",
    base_url="https://aihubmix.com/v1"
)

# Single Speaker
response = client.audio.speech.create(
    model="gemini-2.5-flash-preview-tts",
    voice="Kore",
    input="Say cheerfully: Have a wonderful day!",
    extra_body={
        "response_format": "wav"
    }
)

response.stream_to_file("speech.wav")

# Multi-Speaker Dialogue
conversation_response = client.audio.speech.create(
    model="gemini-2.5-flash-preview-tts",
    voice="Kore",
    input="""TTS the following conversation between Joe and Jane:
    Joe: How's it going today Jane?
    Jane: Not too bad, how about you?""",
    extra_body={
        "response_format": "wav",
        "instructions": "Joe should sound firm, Jane should sound upbeat"
    }
)

conversation_response.stream_to_file("conversation.wav")

Controlling Voice Style (Gemini Models)

Gemini TTS models support controlling voice style, tone, accent, and speed through natural language prompts. You can provide guidance in the input or instructions parameters.

Single Speaker Style Control

{
  "model": "gemini-2.5-flash-preview-tts",
  "input": "Say in a spooky whisper: By the pricking of my thumbs... Something wicked this way comes",
  "voice": "Enceladus",
  "response_format": "wav"
}

Multi-Speaker Style Control

{
  "model": "gemini-2.5-flash-preview-tts",
  "input": "Speaker1: So... what's on the agenda today?\nSpeaker2: You're never going to guess!",
  "voice": "Kore",
  "response_format": "wav",
  "instructions": "Make Speaker1 sound tired and bored, and Speaker2 sound excited and happy"
}

Prompt Structure Recommendations

For best results, you can use the following structured prompt format:
{
  "model": "gemini-2.5-flash-preview-tts",
  "input": "Your transcript here",
  "voice": "Kore",
  "instructions": "# AUDIO PROFILE: Character Name\n## Role Description\n\n## THE SCENE: Scene Name\nDescribe the environment and mood\n\n### DIRECTOR'S NOTES\nStyle: Describe the style\nPacing: Describe the pacing\nAccent: Specify the accent"
}

Supported Languages

The TTS models automatically detect the input language. The following 24 languages are supported:
LanguageBCP-47 CodeLanguageBCP-47 Code
Arabic (Egypt)ar-EGGerman (Germany)de-DE
English (US)en-USSpanish (US)es-US
French (France)fr-FRHindi (India)hi-IN
Indonesian (Indonesia)id-IDItalian (Italy)it-IT
Japanese (Japan)ja-JPKorean (South Korea)ko-KR
Portuguese (Brazil)pt-BRRussian (Russia)ru-RU
Dutch (Netherlands)nl-NLPolish (Poland)pl-PL
Thai (Thailand)th-THTurkish (Turkey)tr-TR
Vietnamese (Vietnam)vi-VNRomanian (Romania)ro-RO
Ukrainian (Ukraine)uk-UABengali (Bangladesh)bn-BD
English (India)en-IN & hi-INMarathi (India)mr-IN
Tamil (India)ta-INTelugu (India)te-IN

Response Formats

Audio Formats

FormatContent-TypeDescriptionModel Support
mp3audio/mpegDefault format, widely compatibleOpenAI Models
opusaudio/opusSuitable for internet streamingOpenAI Models
aacaudio/aacDigital audio compressionOpenAI Models
flacaudio/flacLossless audio compressionOpenAI Models
wavaudio/wavUncompressed WAV audioAll Models
pcmaudio/pcmRaw PCM audio (24kHz, mono, 16-bit)All Models
Note: The Gemini model natively returns PCM format (24kHz, mono, 16-bit), and the system will automatically convert it to WAV format. For other formats, it’s recommended to use OpenAI models.

Response Body

On success, an audio stream (binary data) is returned, and Content-Type is set according to the response_format parameter. On failure, a JSON error message is returned:
{
  "error": {
    "message": "Error description",
    "type": "error_type",
    "code": "error_code"
  }
}

Billing Information

The TTS API is billed based on the number of characters:
  • The character count of the input text is the billing unit
  • Different models have different price multipliers
  • Maximum input length: 4096 characters

Limitations

  • Maximum input length: 4096 characters
  • Gemini TTS models only support wav and pcm output formats
  • Gemini TTS models do not support the speed parameter (controlled through prompts)
  • Context window limit: 32k tokens (Gemini models)

Frequently Asked Questions

Q: How do I choose the right model?

  • Need quick generation → tts-1 or gemini-2.5-flash-preview-tts
  • Need high-quality audio → tts-1-hd
  • Need intelligent voice control → gpt-4o-mini-tts or Gemini TTS models
  • Need multi-speaker dialogues → Gemini TTS models

Q: What are the differences between Gemini TTS and OpenAI TTS?

  • Gemini TTS: Supports controlling voice style through natural language prompts, supports multiple speakers, but only WAV/PCM formats
  • OpenAI TTS: Supports multiple audio formats, has fixed voice options, and speed can be controlled via parameters

Q: How do I implement multi-speaker dialogues?

Use the Gemini TTS model, format the input as a dialogue, and specify the style for each speaker in the instructions:
{
  "model": "gemini-2.5-flash-preview-tts",
  "input": "Speaker1: Hello!\nSpeaker2: Hi there!",
  "instructions": "Speaker1 should sound professional, Speaker2 should sound casual"
}

Q: Is streaming output supported?

Currently, the TTS API returns complete audio files and does not support streaming output.