Introduction

Text to Speech (TTS) API is based on advanced generative AI models that can convert input text into realistic speech audio. It supports multiple uses:

  • Voice blog articles
  • Generate speech audio in multiple languages
  • Provide real-time audio output stream

Available model list:

  • gpt-4o-audio-preview - The latest audio generation model from OpenAI, supports conversational audio generation
  • gpt-4o-mini-tts - The preferred model for smart real-time applications, supports advanced voice control, and can control multiple voice characteristics through prompts:
    • Accent
    • Emotional range
    • Intonation
    • Impressions
    • Speed of speech
    • Tone
    • Whispering
  • tts-1-hd - High-quality TTS model
  • tts-1 - Standard TTS model, balance quality and speed

Performance suggestions: For the fastest response time, it is recommended to use wav or pcm as the response format. For high-quality audio, it is recommended to use tts-1-hd; for faster generation speed, use tts-1; for smart voice applications, it is recommended to use gpt-4o-mini-tts.

Voice preview: You can listen to different voice effects on OpenAI.fm.

Model calling method

Standard TTS model (tts-1, tts-1-hd)

Use the /v1/audio/speech endpoint, and call the client.audio.speech.create() method.

gpt-4o-mini-tts

Use the /v1/audio/speech endpoint, and support the instructions parameter for advanced voice control.

gpt-4o-audio-preview

Use the /v1/chat/completions endpoint, and set the modalities: ["text", "audio"] and audio configuration.

Request parameters

Standard TTS parameters

applicable to tts-1, tts-1-hd, gpt-4o-mini-tts

model
string
required

The model ID to use. Optional values: tts-1, tts-1-hd, gpt-4o-mini-tts

input
string
required

The text to generate audio, with a maximum length of 4096 characters

voice
string
required

The voice to use for synthesis. Optional values: alloy, echo, fable, onyx, nova, shimmer

response_format
string

The audio output format. Supported formats: mp3, opus, aac, flac, wav, pcm. Default is mp3

speed
number

The speed of generating audio. The range is 0.25 to 4.0. Default is 1.0. Note: gpt-4o-mini-tts does not support this parameter, but you can control the speed through natural language description

instructions
string

Voice generation instructions (only applicable to gpt-4o-mini-tts model), can specify voice style, tone, emotion, etc.

gpt-4o-audio-preview parameters

model
string
required

Set to gpt-4o-audio-preview

modalities
array
required

Set to ["text", "audio"] to enable audio output

audio
object
required

Audio configuration object, containing voice and format fields

messages
array
required

Chat message array, same as standard chat format

Usage

curl https://aihubmix.com/v1/audio/speech \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tts-1",
    "input": "The quick brown 🦊 jumped over the lazy dog.",
    "voice": "alloy"
  }' \
  --output speech.mp3