Architecture | Voicebox

System Overview

Voicebox uses a client-server architecture with a React frontend and Python backend. The desktop app is built with Tauri and contains two main layers:

Frontend Layer: A React application that handles the UI components, state management with Zustand, and data fetching with React Query (TanStack Query).

Backend Layer: A Python FastAPI server that hosts the REST API, runs a pluggable registry of TTS and STT engines, manages the SQLite database, and handles audio processing.

These two layers communicate via HTTP on localhost:17493, with the frontend making API requests to the backend. In production the backend is compiled with PyInstaller and launched as a Tauri sidecar; in development it's run manually via uvicorn.

Frontend Architecture

Tech Stack

Framework: React 18 with TypeScript
State Management: Zustand stores
Data Fetching: React Query (TanStack Query)
Styling: Tailwind CSS
Audio: WaveSurfer.js
Desktop: Tauri (Rust)

Component Structure

Backend Architecture

Tech Stack

Framework: FastAPI (Python 3.11+)
TTS Engines: Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro
Transcription: Whisper (PyTorch or MLX-Whisper)
Inference Backends: MLX (Apple Silicon), PyTorch (CUDA / ROCm / XPU / DirectML / CPU)
Database: SQLite via SQLAlchemy
Audio: librosa, soundfile, Pedalboard

Layout

Request Flow

An HTTP request enters a route handler, which validates input and delegates to a service function. The service calls into the appropriate engine backend via the registry, which runs the actual inference. Audio post-processing runs through utils (trim, resample, effects).

Route handlers are intentionally thin — they validate input, delegate to a service function, and format the response. All business logic lives in services/.

Multi-Engine Registry

The backend is designed so that adding a new TTS engine only requires touching the backends/ directory and the central registry. There is no per-engine branching in routes or services.

TTSBackend Protocol (backends/__init__.py) — defines the contract every engine implements: load_model, create_voice_prompt, combine_voice_prompts, generate, unload_model, is_loaded, _get_model_path.
ModelConfig dataclass — central metadata record for each model variant: model_name, display_name, engine, hf_repo_id, size_mb, needs_trim, languages, supports_instruct, etc.
TTS_ENGINES dict — maps engine name ("qwen", "kokoro", etc.) to display name.
get_tts_backend_for_engine(engine) — thread-safe factory that lazily instantiates and caches the backend for an engine using double-checked locking.

Shipped engines:

Engine key	Display name	Profile type
`qwen`	Qwen TTS	Cloned
`qwen_custom_voice`	Qwen CustomVoice	Preset
`luxtts`	LuxTTS	Cloned
`chatterbox`	Chatterbox TTS	Cloned
`chatterbox_turbo`	Chatterbox Turbo	Cloned
`tada`	TADA	Cloned
`kokoro`	Kokoro	Preset

See TTS Engines for the full contract and integration phases, and PROJECT_STATUS.md for candidates under evaluation.

Key Modules

app.py — FastAPI app factory, CORS, lifecycle events
main.py — Entry point (imports app, runs uvicorn)
server.py — Tauri sidecar launcher, parent-pid watchdog, frozen-build environment setup
services/generation.py — Single function handling all generation modes (generate, retry, regenerate)
services/task_queue.py — Serial generation queue for GPU inference
backends/__init__.py — Protocol definitions, ModelConfig registry, and engine factory
backends/base.py — Shared utilities across all engine implementations (device selection, progress tracking, output trimming)

Inference Backend Selection

The server detects the best inference backend at startup and uses it for all engines that support it:

Platform	Backend	Acceleration
macOS (Apple Silicon)	MLX	Metal / Neural Engine
Windows / Linux (NVIDIA)	PyTorch	CUDA (cu128)
Linux (AMD)	PyTorch	ROCm
Windows / Linux (Intel Arc)	PyTorch	XPU (IPEX)
Windows (other GPU)	PyTorch	DirectML
Any	PyTorch	CPU fallback

See GPU Acceleration for platform-specific notes and manual overrides.

Data Model

Core tables (see backend/database/models.py):

profiles — Voice profiles with voice_type discriminator (cloned | preset | designed), preset_engine, preset_voice_id, and default_engine.
profile_samples — Reference audio clips + transcripts for cloned profiles. Empty for preset profiles.
generations — Generated audio with text, engine, model, language, seed, and duration.
generation_versions — Processed variants of a generation with different effects chains applied.
audio_channels + channel_device_mappings + profile_channel_mappings — Multi-output routing.

See Voice Profiles and Effects Pipeline for details.

Desktop App (Tauri)

Rust Backend

Responsibilities

Launch Python backend as sidecar process
Native file dialogs
System tray integration
Auto-updates (Tauri updater + custom CUDA backend swap)
Parent-PID watchdog so the backend exits if the app crashes

Build Process

Development

just dev              # Starts backend + Tauri app
just dev-web          # Starts backend + web app (no Tauri)
just dev-backend      # Backend only
just dev-frontend     # Tauri app only (backend must be running)

Production

just build            # CPU server binary + Tauri installer
just build-local      # CPU + CUDA binaries + Tauri installer (Windows)
just build-server     # Server binary only
just build-tauri      # Tauri app only

See Building for what PyInstaller does and how the CUDA binary is split and packaged separately.

Data Flow

Generation Flow

User Input — text entered in a React component, engine + profile selected
State Update — Zustand generation form store records the request
API Request — React Query mutation hits POST /generate
Route — routes/generate.py validates input, dispatches to services/generation.py
Voice Prompt — the service creates or retrieves a cached voice prompt via the engine's backend
Queue — services/task_queue.py serializes generation to avoid GPU contention
Inference — the engine backend runs generate() and returns audio + sample rate
Post-process — optional trim (for engines that need it), effects chain applied per generation version
Storage — audio written to the generations directory, metadata saved to SQLite
Response — backend returns the generation record; frontend updates React Query cache and plays audio

Performance Considerations

Frontend

Code splitting — lazy-load routes
Memoization — React.memo for heavy components
Virtual scrolling — for large lists
Debouncing — search and input handling

Backend

Async I/O — all I/O is async; inference runs in asyncio.to_thread
Serial task queue — avoids multiple engines fighting for the GPU
Voice prompt caching — engine-specific, keyed by audio hash + reference text
Model pinning — only one model per engine loaded at a time; switching unloads the previous one
Per-engine backend cache — engines are only instantiated once per process