Text to Speech
Use AI models to convert text into natural speech, supporting multiple speech styles and output formats
Introduction
Text to Speech (TTS) API is based on advanced generative AI models that can convert input text into realistic speech audio. It supports multiple uses:
- Voice blog articles
- Generate speech audio in multiple languages
- Provide real-time audio output stream
Available model list:
- gpt-4o-audio-preview - The latest audio generation model from OpenAI, supports conversational audio generation
- gpt-4o-mini-tts - The preferred model for smart real-time applications, supports advanced voice control, and can control multiple voice characteristics through prompts:
- Accent
- Emotional range
- Intonation
- Impressions
- Speed of speech
- Tone
- Whispering
- tts-1-hd - High-quality TTS model
- tts-1 - Standard TTS model, balance quality and speed
Performance suggestions: For the fastest response time, it is recommended to use wav
or pcm
as the response format. For high-quality audio, it is recommended to use tts-1-hd
; for faster generation speed, use tts-1
; for smart voice applications, it is recommended to use gpt-4o-mini-tts
.
Voice preview: You can listen to different voice effects on OpenAI.fm.
Model calling method
Standard TTS model (tts-1, tts-1-hd)
Use the /v1/audio/speech
endpoint, and call the client.audio.speech.create()
method.
gpt-4o-mini-tts
Use the /v1/audio/speech
endpoint, and support the instructions
parameter for advanced voice control.
gpt-4o-audio-preview
Use the /v1/chat/completions
endpoint, and set the modalities: ["text", "audio"]
and audio
configuration.
Request parameters
Standard TTS parameters
applicable to tts-1, tts-1-hd, gpt-4o-mini-tts
The model ID to use. Optional values: tts-1
, tts-1-hd
, gpt-4o-mini-tts
The text to generate audio, with a maximum length of 4096 characters
The voice to use for synthesis. Optional values: alloy
, echo
, fable
, onyx
, nova
, shimmer
The audio output format. Supported formats: mp3
, opus
, aac
, flac
, wav
, pcm
. Default is mp3
The speed of generating audio. The range is 0.25 to 4.0. Default is 1.0. Note: gpt-4o-mini-tts
does not support this parameter, but you can control the speed through natural language description
Voice generation instructions (only applicable to gpt-4o-mini-tts
model), can specify voice style, tone, emotion, etc.
gpt-4o-audio-preview parameters
Set to gpt-4o-audio-preview
Set to ["text", "audio"]
to enable audio output
Audio configuration object, containing voice
and format
fields
Chat message array, same as standard chat format