Introduction
The Speech-to-Text (STT) API based on OpenAI’s Whisper models can convert audio files to text. It supports various use cases:
Transcribing audio files to text
Translating multilingual audio to English
Supporting multiple audio format inputs
Providing multiple output format options
Available Model List:
whisper-large-v3 —— Latest large Whisper model, supports multiple languages. For Chinese recognition, use with appropriate prompts and low temperature values
whisper-1 —— Original Whisper model, stable and reliable, supports multiple languages
distil-whisper-large-v3-en —— Distilled model, faster processing speed but slightly lower accuracy, recommended with low temperature values
Performance Recommendations:
For Chinese audio, recommend using whisper-large-v3
model with appropriate prompts and lower temperature values (e.g., 0.2) to reduce hallucinations
For English audio or faster processing, use distil-whisper-large-v3-en
model
Supported audio formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
File size limit: maximum 25MB
Model Usage
Speech Transcription
Use /v1/audio/transcriptions
endpoint via client.audio.transcriptions.create()
method to transcribe audio to text in the original language.
Speech Translation
Use /v1/audio/translations
endpoint via client.audio.translations.create()
method to translate audio to English text.
Request Parameters
Transcription Parameters
Audio file object to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, maximum 25MB
Model ID to use. Options: whisper-large-v3
, whisper-1
, distil-whisper-large-v3-en
Language of the input audio in ISO-639-1 format (e.g., ‘en’, ‘zh’). Specifying the language can improve accuracy and latency
Optional text prompt to guide the model’s style or continue a previous audio segment. The prompt should match the audio language
Transcription output format. Options: json
(default), text
, srt
, verbose_json
, vtt
Sampling temperature between 0 and 1. Higher values make output more random, lower values make it more focused and deterministic. Default is 0
timestamp_granularities[]
Timestamp granularities. Options: word
, segment
. Only available when response_format is verbose_json
Translation Parameters
Audio file object to translate. Same formats as transcription
Model ID to use, same as transcription parameters
Optional English text prompt to guide translation style
Translation output format, same as transcription parameters
Sampling temperature, same as transcription parameters
Usage Examples
Curl Transcription
Curl Translation
Speech Transcription
Speech Translation
Verbose Output Format
SRT Subtitle Format
curl https://aihubmix.com/v1/audio/transcriptions \
-H "Authorization: Bearer $AIHUBMIX_API_KEY " \
-H "Content-Type: multipart/form-data" \
-F file="@/path/to/file/audio.mp3" \
-F model="whisper-large-v3" \
-F response_format="text" \
-F temperature="0.2"
{
"text" : "This is the transcribed text content"
}
{
"task" : "transcribe" ,
"language" : "english" ,
"duration" : 8.470000267028809 ,
"text" : "This is the transcribed text content" ,
"segments" : [
{
"id" : 0 ,
"seek" : 0 ,
"start" : 0.0 ,
"end" : 8.470000267028809 ,
"text" : " This is the transcribed text content" ,
"tokens" : [ 50364 , 50365 , 50365 , 50365 ],
"temperature" : 0.2 ,
"avg_logprob" : -0.9929364013671875 ,
"compression_ratio" : 0.8888888888888888 ,
"no_speech_prob" : 0.0963134765625
}
]
}
Text Format
This is the transcribed text content
1
00:00:00,000 --> 00:00:08,470
This is the transcribed text content
WEBVTT
00:00:00.000 --> 00:00:08.470
This is the transcribed text content
Best Practices
Chinese Audio Processing : Use whisper-large-v3
model, set language="zh"
, temperature=0.2
, and provide appropriate Chinese prompts
English Audio Processing : Use distil-whisper-large-v3-en
for faster processing speed
Noise Handling : Use prompts to instruct the model to ignore background noise or clean up stammering issues
Long Audio Processing : API automatically segments long audio; recommend preprocessing audio quality for best results
Timestamp Requirements : Use verbose_json
format and timestamp_granularities
when precise timestamps are needed
Subtitle Creation : Use srt
or vtt
format output directly without additional processing