AiHubMix Documentation Hub

Introduction

The Speech-to-Text (STT) API based on OpenAI’s Whisper models can convert audio files to text. It supports various use cases:

Transcribing audio files to text
Translating multilingual audio to English
Supporting multiple audio format inputs
Providing multiple output format options

Available Model List:

whisper-large-v3 —— Latest large Whisper model, supports multiple languages. For Chinese recognition, use with appropriate prompts and low temperature values
whisper-1 —— Original Whisper model, stable and reliable, supports multiple languages
distil-whisper-large-v3-en —— Distilled model, faster processing speed but slightly lower accuracy, recommended with low temperature values

Performance Recommendations:

For Chinese audio, recommend using whisper-large-v3 model with appropriate prompts and lower temperature values (e.g., 0.2) to reduce hallucinations
For English audio or faster processing, use distil-whisper-large-v3-en model
Supported audio formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
File size limit: maximum 25MB

Model Usage

Speech Transcription

Use /v1/audio/transcriptions endpoint via client.audio.transcriptions.create() method to transcribe audio to text in the original language.

Speech Translation

Use /v1/audio/translations endpoint via client.audio.translations.create() method to translate audio to English text.

Request Parameters

Transcription Parameters

file

required

Audio file object to transcribe. Supported formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, maximum 25MB

model

string

required

Model ID to use. Options: whisper-large-v3, whisper-1, distil-whisper-large-v3-en

language

string

Language of the input audio in ISO-639-1 format (e.g., ‘en’, ‘zh’). Specifying the language can improve accuracy and latency

prompt

string

Optional text prompt to guide the model’s style or continue a previous audio segment. The prompt should match the audio language

response_format

string

Transcription output format. Options: json (default), text, srt, verbose_json, vtt

temperature

number

Sampling temperature between 0 and 1. Higher values make output more random, lower values make it more focused and deterministic. Default is 0

timestamp_granularities[]

array

Timestamp granularities. Options: word, segment. Only available when response_format is verbose_json

Translation Parameters

file

required

Audio file object to translate. Same formats as transcription

model

string

required

Model ID to use, same as transcription parameters

prompt

string

Optional English text prompt to guide translation style

response_format

string

Translation output format, same as transcription parameters

temperature

number

Sampling temperature, same as transcription parameters

Usage Examples

curl https://aihubmix.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $AIHUBMIX_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file="@/path/to/file/audio.mp3" \
  -F model="whisper-large-v3" \
  -F response_format="text" \
  -F temperature="0.2"

Response Formats

JSON Format (Default)

{
  "text": "This is the transcribed text content"
}

Verbose JSON Format (verbose_json)

{
  "task": "transcribe",
  "language": "english",
  "duration": 8.470000267028809,
  "text": "This is the transcribed text content",
  "segments": [
    {
      "id": 0,
      "seek": 0,
      "start": 0.0,
      "end": 8.470000267028809,
      "text": " This is the transcribed text content",
      "tokens": [50364, 50365, 50365, 50365],
      "temperature": 0.2,
      "avg_logprob": -0.9929364013671875,
      "compression_ratio": 0.8888888888888888,
      "no_speech_prob": 0.0963134765625
    }
  ]
}

Text Format

This is the transcribed text content

SRT Format

1
00:00:00,000 --> 00:00:08,470
This is the transcribed text content

VTT Format

WEBVTT

00:00:00.000 --> 00:00:08.470
This is the transcribed text content

Best Practices

Chinese Audio Processing: Use whisper-large-v3 model, set language="zh", temperature=0.2, and provide appropriate Chinese prompts
English Audio Processing: Use distil-whisper-large-v3-en for faster processing speed
Noise Handling: Use prompts to instruct the model to ignore background noise or clean up stammering issues
Long Audio Processing: API automatically segments long audio; recommend preprocessing audio quality for best results
Timestamp Requirements: Use verbose_json format and timestamp_granularities when precise timestamps are needed
Subtitle Creation: Use srt or vtt format output directly without additional processing

Basics

API

Platform Management

Terms and Privacy

Speech-to-Text

Introduction

Model Usage

Speech Transcription

Speech Translation

Request Parameters

Transcription Parameters

Translation Parameters

Usage Examples

Response Formats

JSON Format (Default)

Verbose JSON Format (verbose_json)

Text Format

SRT Format

VTT Format

Best Practices

Basics

API

Platform Management

Terms and Privacy

​Introduction

​Model Usage

​Speech Transcription

​Speech Translation

​Request Parameters

​Transcription Parameters

​Translation Parameters

​Usage Examples

​Response Formats

​JSON Format (Default)

​Verbose JSON Format (verbose_json)

​Text Format

​SRT Format

​VTT Format

​Best Practices

Introduction

Model Usage

Speech Transcription

Speech Translation

Request Parameters

Transcription Parameters

Translation Parameters

Usage Examples

Response Formats

JSON Format (Default)

Verbose JSON Format (verbose_json)

Text Format

SRT Format

VTT Format

Best Practices