Introduction
The Text-to-Speech (TTS) API is based on advanced generative AI models that can convert input text into realistic speech audio. It supports a variety of use cases:- Narrating written blog articles
- Generating speech audio in multiple languages
- Providing real-time audio output streams
Available Models
OpenAI Models
- gpt-4o-audio-preview — OpenAI’s latest audio generation model, supporting conversational audio generation
- gpt-4o-mini-tts — The preferred model for smart real-time applications, supporting advanced voice control and allowing various voice characteristics to be controlled via prompts:
- Accent
- Emotional range
- Intonation
- Impressions/Style
- Speed of speech
- Tone
- Whispering
- tts-1-hd — The previous generation TTS model with high-definition audio quality
- tts-1 — Standard TTS model, balancing quality and speed
Gemini Models
- gemini-2.5-flash-preview-tts — Gemini fast TTS model, supporting single and multiple speaker audio generation
- gemini-2.5-pro-preview-tts — Gemini professional TTS model, supporting single and multiple speaker audio generation
- For the fastest response time, it’s recommended to use
wavorpcmas the response format - For high-quality audio, use
tts-1-hd - For faster generation speed, use
tts-1 - For smart voice applications,
gpt-4o-mini-ttsis recommended - For scenarios requiring multi-speaker dialogues, the Gemini TTS models are recommended
API Endpoint
Request URL
Request Headers
Request Parameters
Standard TTS Parameters
The standard parameters applicable to the following TTS models: tts-1, tts-1-hd, gpt-4o-mini-tts, gemini-2.5-flash-preview-tts, and gemini-2.5-pro-preview-tts.| Parameter | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | The model ID to be used. Optional values: tts-1, tts-1-hd, gpt-4o-mini-tts, gemini-2.5-flash-preview-tts, gemini-2.5-pro-preview-tts |
| input | string | Yes | The text to generate audio from, with a maximum length of 4096 characters |
| voice | string | Yes | The voice used for synthesis. See the voice list below. |
| response_format | string | No | Audio output format. Supported audio formats include: mp3, opus, aac, flac, wav, pcm, default is mp3. Note: Gemini models only support wav and pcm formats. |
| speed | number | No | The speed of the generated audio. Range from 0.25 to 4.0, default is 1.0. Note: gpt-4o-mini-tts and Gemini models do not support this parameter, but speed can be controlled through natural language descriptions. |
| instructions | string | No | Voice generation instructions, which can specify voice style, intonation, and emotional characteristics in detail, applicable only for gpt-4o-mini-tts and Gemini models. |
gpt-4o-audio-preview Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
| model | string | Yes | Set to gpt-4o-audio-preview |
| modalities | array | Yes | Set to ["text", "audio"] to enable audio output |
| audio | object | Yes | Audio configuration object containing voice and format fields |
| messages | array | Yes | Array of chat messages, similar to standard chat format |
Voice List
OpenAI Voices
Supports the following voice options:- alloy - Neutral, balanced
- ash - Clear, professional
- ballad - Warm, narrative
- coral - Friendly, approachable
- echo - Clear, bright
- fable - Expressive, dramatic
- onyx - Deep, authoritative
- nova - Lively, energetic
- sage - Mature, knowledgeable
- shimmer - Soft, soothing
- verse - Clear, versatile
- marin - Natural, friendly
- cedar - Stable, reliable
Gemini Voices
Supports the following 30 voice options:| Voice Name | Style | Voice Name | Style | Voice Name | Style |
|---|---|---|---|---|---|
| Zephyr | Bright | Puck | Upbeat | Charon | Informative |
| Kore | Firm | Fenrir | Excitable | Leda | Youthful |
| Orus | Firm | Aoede | Breezy | Callirrhoe | Easy-going |
| Autonoe | Bright | Enceladus | Breathy | Iapetus | Clear |
| Umbriel | Easy-going | Algieba | Smooth | Despina | Smooth |
| Erinome | Clear | Algenib | Gravelly | Rasalgethi | Informative |
| Laomedeia | Upbeat | Achernar | Soft | Alnilam | Firm |
| Schedar | Even | Gacrux | Mature | Pulcherrima | Forward |
| Achird | Friendly | Zubenelgenubi | Casual | Vindemiatrix | Gentle |
| Sadachbia | Lively | Sadaltager | Knowledgeable | Sulafat | Warm |
Voice Mapping
When using Gemini models, if an OpenAI voice name is provided, the system will automatically map it to the corresponding Gemini voice:| OpenAI Voice | Gemini Voice | OpenAI Voice | Gemini Voice |
|---|---|---|---|
| alloy | Kore | ash | Fenrir |
| ballad | Aoede | coral | Leda |
| echo | Puck | fable | Zephyr |
| onyx | Charon | nova | Orus |
| sage | Algieba | shimmer | Callirrhoe |
| verse | Enceladus | marin | Despina |
| cedar | Iapetus |
Usage Examples
Standard TTS Model (OpenAI)
Gemini TTS Model (Single Speaker)
Gemini TTS Model (Multi-Speaker - Controlled by Prompts)
Python Example (OpenAI SDK)
Python Example (Gemini TTS)
Controlling Voice Style (Gemini Models)
Gemini TTS models support controlling voice style, tone, accent, and speed through natural language prompts. You can provide guidance in theinput or instructions parameters.
Single Speaker Style Control
Multi-Speaker Style Control
Prompt Structure Recommendations
For best results, you can use the following structured prompt format:Supported Languages
The TTS models automatically detect the input language. The following 24 languages are supported:| Language | BCP-47 Code | Language | BCP-47 Code |
|---|---|---|---|
| Arabic (Egypt) | ar-EG | German (Germany) | de-DE |
| English (US) | en-US | Spanish (US) | es-US |
| French (France) | fr-FR | Hindi (India) | hi-IN |
| Indonesian (Indonesia) | id-ID | Italian (Italy) | it-IT |
| Japanese (Japan) | ja-JP | Korean (South Korea) | ko-KR |
| Portuguese (Brazil) | pt-BR | Russian (Russia) | ru-RU |
| Dutch (Netherlands) | nl-NL | Polish (Poland) | pl-PL |
| Thai (Thailand) | th-TH | Turkish (Turkey) | tr-TR |
| Vietnamese (Vietnam) | vi-VN | Romanian (Romania) | ro-RO |
| Ukrainian (Ukraine) | uk-UA | Bengali (Bangladesh) | bn-BD |
| English (India) | en-IN & hi-IN | Marathi (India) | mr-IN |
| Tamil (India) | ta-IN | Telugu (India) | te-IN |
Response Formats
Audio Formats
| Format | Content-Type | Description | Model Support |
|---|---|---|---|
| mp3 | audio/mpeg | Default format, widely compatible | OpenAI Models |
| opus | audio/opus | Suitable for internet streaming | OpenAI Models |
| aac | audio/aac | Digital audio compression | OpenAI Models |
| flac | audio/flac | Lossless audio compression | OpenAI Models |
| wav | audio/wav | Uncompressed WAV audio | All Models |
| pcm | audio/pcm | Raw PCM audio (24kHz, mono, 16-bit) | All Models |
Response Body
On success, an audio stream (binary data) is returned, and Content-Type is set according to theresponse_format parameter.
On failure, a JSON error message is returned:
Billing Information
The TTS API is billed based on the number of characters:- The character count of the input text is the billing unit
- Different models have different price multipliers
- Maximum input length: 4096 characters
Limitations
- Maximum input length: 4096 characters
- Gemini TTS models only support
wavandpcmoutput formats - Gemini TTS models do not support the
speedparameter (controlled through prompts) - Context window limit: 32k tokens (Gemini models)
Frequently Asked Questions
Q: How do I choose the right model?
- Need quick generation →
tts-1orgemini-2.5-flash-preview-tts - Need high-quality audio →
tts-1-hd - Need intelligent voice control →
gpt-4o-mini-ttsor Gemini TTS models - Need multi-speaker dialogues → Gemini TTS models
Q: What are the differences between Gemini TTS and OpenAI TTS?
- Gemini TTS: Supports controlling voice style through natural language prompts, supports multiple speakers, but only WAV/PCM formats
- OpenAI TTS: Supports multiple audio formats, has fixed voice options, and speed can be controlled via parameters
Q: How do I implement multi-speaker dialogues?
Use the Gemini TTS model, format theinput as a dialogue, and specify the style for each speaker in the instructions: