Transcription | Voicebox

Overview

Voicebox uses OpenAI's Whisper for automatic speech recognition (ASR). Transcription powers two flows:

Reference-text auto-fill — when a user records or uploads a voice sample, the backend transcribes it and populates the reference_text field so cloning can use it.
On-demand transcription — a user-facing /transcribe endpoint for arbitrary audio.

On Apple Silicon, the transcription path runs through MLX-Whisper (from mlx-audio) for ~8× faster inference than PyTorch. Everywhere else it runs through PyTorch's transformers Whisper.

Architecture

Transcription is wired through the same backend abstraction as TTS. The STTBackend protocol lives in backend/backends/__init__.py:

@runtime_checkable
class STTBackend(Protocol):
async def load_model(self, model_size: str) -> None: ...
async def transcribe(
    self,
    audio_path: str,
    language: Optional[str] = None,
    model_size: Optional[str] = None,
) -> str: ...
def unload_model(self) -> None: ...
def is_loaded(self) -> bool: ...

Two implementations ship today:

MLXSTTBackend (backends/mlx_backend.py) — uses mlx_audio.stt.load(). Default on Apple Silicon.
PyTorchSTTBackend (backends/pytorch_backend.py) — uses transformers.WhisperForConditionalGeneration. Default everywhere else.

get_stt_backend() picks the right one based on get_backend_type(). backend/services/transcribe.py is a thin wrapper that delegates to the backend.

Model Sizes

Five Whisper variants are registered in ModelConfig:

Model	HuggingFace Repo	Size	Notes
Base	`openai/whisper-base`	~300 MB	Default; fast, decent quality
Small	`openai/whisper-small`	~500 MB	Better quality, still fast
Medium	`openai/whisper-medium`	~1.5 GB	High quality
Large	`openai/whisper-large-v3`	~3 GB	Best quality, slow on CPU
Turbo	`openai/whisper-large-v3-turbo`	~1.5 GB	Large-tier quality, ~5× faster than Large

The tiny model is not exposed — the quality gap to base wasn't worth the download.

Turbo + MLX-Whisper on Apple Silicon dropped user-facing transcription latency from ~20s to ~2-3s in v0.1.10.

Language Hints

Whisper can auto-detect language, but providing a hint improves accuracy on short clips:

text = await backend.transcribe(audio_path, language="en")

Accepted language codes are the standard Whisper set (99+ languages). The frontend typically passes the profile's language if available, or lets Whisper detect otherwise.

Model Loading

Both backends are lazy: the model is loaded on first use and cached in memory. Switching sizes unloads the previous model.

On MLX, the model is loaded via mlx_audio.stt.load(hf_repo). On PyTorch, via:

WhisperProcessor.from_pretrained(hf_repo)
WhisperForConditionalGeneration.from_pretrained(hf_repo).to(device)

Both load paths use model_load_progress() from backends/base.py so the frontend sees live download progress on the first use.

Audio Preprocessing

Whisper expects mono 16 kHz audio. The audio utility in backend/utils/audio.py handles resampling and format conversion transparently:

Formats: WAV, MP3, FLAC, OGG, M4A (via soundfile / librosa)
Target: mono, 16 kHz, float32

Files longer than Whisper's 30-second window are handled by the underlying library's chunking logic — no explicit splitting in Voicebox code.

API Endpoints

Method	Endpoint	Description
POST	`/transcribe`	Transcribe an uploaded audio file

Request

Multipart form data:

POST /transcribe
Content-Type: multipart/form-data

file: <audio_file>
language: en         # optional
model_size: base     # optional (default: "base")

Response

{
  "text": "Hello, this is a test transcription.",
  "duration": 3.5
}

Use Cases

Reference Text for Voice Cloning

Adding a voice sample triggers transcription automatically:

User uploads or records audio.
The backend writes the audio file and calls /transcribe internally (or the frontend calls it separately).
The returned text becomes reference_text on the new profile_samples row.
Cloning engines that need reference text (Chatterbox, TADA, etc.) read it from there.

Quality Tips

Provide a language hint for short clips (under 5 seconds) — auto-detection is unreliable on little audio.
Use Turbo or Large for noisy audio — Base can hallucinate on hard inputs.
Prefer clean audio; transcription errors become reference-text errors, which become cloning errors.

Memory Management

unload_model() drops the model reference and clears the CUDA cache if applicable. /models/unload wires this up for manual control.

A singleton per backend is returned by get_stt_backend() — multiple callers share one Whisper instance.

Error Handling

Error	Cause	Solution
Model not found	First run + network failure	Retry; check connectivity
OOM on load	Large model on low-VRAM GPU	Switch to Small or Turbo
Empty result	No speech in audio	Confirm input has voice; check trim
Wrong language	Auto-detect misfired	Pass `language` hint

Next Steps

Model Management

Download / load / unload any model

Voice Profiles

How reference text is stored alongside samples

GPU Acceleration

Platform-specific acceleration including MLX-Whisper