Introduction
The Speech-to-Text (STT) API based on OpenAI’s Whisper models can convert audio files to text. It supports various use cases:- Transcribing audio files to text
- Translating multilingual audio to English
- Supporting multiple audio format inputs
- Providing multiple output format options
- whisper-large-v3 —— Latest large Whisper model, supports multiple languages. For Chinese recognition, use with appropriate prompts and low temperature values
- whisper-1 —— Original Whisper model, stable and reliable, supports multiple languages
- distil-whisper-large-v3-en —— Distilled model, faster processing speed but slightly lower accuracy, recommended with low temperature values
Performance Recommendations:
- For Chinese audio, recommend using
whisper-large-v3
model with appropriate prompts and lower temperature values (e.g., 0.2) to reduce hallucinations - For English audio or faster processing, use
distil-whisper-large-v3-en
model - Supported audio formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
- File size limit: maximum 25MB
Model Usage
Speech Transcription
Use/v1/audio/transcriptions
endpoint via client.audio.transcriptions.create()
method to transcribe audio to text in the original language.
Speech Translation
Use/v1/audio/translations
endpoint via client.audio.translations.create()
method to translate audio to English text.
Request Parameters
Transcription Parameters
Audio file object to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, maximum 25MB
Model ID to use. Options:
whisper-large-v3
, whisper-1
, distil-whisper-large-v3-en
Language of the input audio in ISO-639-1 format (e.g., ‘en’, ‘zh’). Specifying the language can improve accuracy and latency
Optional text prompt to guide the model’s style or continue a previous audio segment. The prompt should match the audio language
Transcription output format. Options:
json
(default), text
, srt
, verbose_json
, vtt
Sampling temperature between 0 and 1. Higher values make output more random, lower values make it more focused and deterministic. Default is 0
Timestamp granularities. Options:
word
, segment
. Only available when response_format is verbose_jsonTranslation Parameters
Audio file object to translate. Same formats as transcription
Model ID to use, same as transcription parameters
Optional English text prompt to guide translation style
Translation output format, same as transcription parameters
Sampling temperature, same as transcription parameters
Usage Examples
Response Formats
JSON Format (Default)
Verbose JSON Format (verbose_json)
Text Format
SRT Format
VTT Format
Best Practices
- Chinese Audio Processing: Use
whisper-large-v3
model, setlanguage="zh"
,temperature=0.2
, and provide appropriate Chinese prompts - English Audio Processing: Use
distil-whisper-large-v3-en
for faster processing speed - Noise Handling: Use prompts to instruct the model to ignore background noise or clean up stammering issues
- Long Audio Processing: API automatically segments long audio; recommend preprocessing audio quality for best results
- Timestamp Requirements: Use
verbose_json
format andtimestamp_granularities
when precise timestamps are needed - Subtitle Creation: Use
srt
orvtt
format output directly without additional processing