Introduction
The Speech-to-Text (STT) API based on OpenAI’s Whisper models can convert audio files to text. It supports various use cases:- Transcribing audio files to text
- Translating multilingual audio to English
- Supporting multiple audio format inputs
- Providing multiple output format options
- whisper-large-v3 —— Latest large Whisper model, supports multiple languages. For Chinese recognition, use with appropriate prompts and low temperature values
- whisper-1 —— Original Whisper model, stable and reliable, supports multiple languages
- distil-whisper-large-v3-en —— Distilled model, faster processing speed but slightly lower accuracy, recommended with low temperature values
Model Usage
Speech Transcription
Use/v1/audio/transcriptions endpoint via client.audio.transcriptions.create() method to transcribe audio to text in the original language.
Speech Translation
Use/v1/audio/translations endpoint via client.audio.translations.create() method to translate audio to English text.
Request Parameters
Transcription Parameters
Audio file object to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, maximum 25MB
Model ID to use. Options:
whisper-large-v3, whisper-1, distil-whisper-large-v3-enLanguage of the input audio in ISO-639-1 format (e.g., ‘en’, ‘zh’). Specifying the language can improve accuracy and latency
Optional text prompt to guide the model’s style or continue a previous audio segment. The prompt should match the audio language
Transcription output format. Options:
json (default), text, srt, verbose_json, vttSampling temperature between 0 and 1. Higher values make output more random, lower values make it more focused and deterministic. Default is 0
Timestamp granularities. Options:
word, segment. Only available when response_format is verbose_jsonTranslation Parameters
Audio file object to translate. Same formats as transcription
Model ID to use, same as transcription parameters
Optional English text prompt to guide translation style
Translation output format, same as transcription parameters
Sampling temperature, same as transcription parameters
Usage Examples
Response Formats
JSON Format (Default)
Verbose JSON Format (verbose_json)
Text Format
SRT Format
VTT Format
Best Practices
- Chinese Audio Processing: Use
whisper-large-v3model, setlanguage="zh",temperature=0.2, and provide appropriate Chinese prompts - English Audio Processing: Use
distil-whisper-large-v3-enfor faster processing speed - Noise Handling: Use prompts to instruct the model to ignore background noise or clean up stammering issues
- Long Audio Processing: API automatically segments long audio; recommend preprocessing audio quality for best results
- Timestamp Requirements: Use
verbose_jsonformat andtimestamp_granularitieswhen precise timestamps are needed - Subtitle Creation: Use
srtorvttformat output directly without additional processing