System Overview
Voicebox uses a client-server architecture with a React frontend and Python backend. The desktop app is built with Tauri and contains two main layers:
Frontend Layer: A React application that handles the UI components, state management with Zustand, and data fetching with React Query (TanStack Query).
Backend Layer: A Python FastAPI server that hosts the REST API, runs a pluggable registry of TTS and STT engines, manages the SQLite database, and handles audio processing.
These two layers communicate via HTTP on localhost:17493, with the frontend making API requests to the backend. In production the backend is compiled with PyInstaller and launched as a Tauri sidecar; in development it's run manually via uvicorn.
Frontend Architecture
Tech Stack
- Framework: React 18 with TypeScript
- State Management: Zustand stores
- Data Fetching: React Query (TanStack Query)
- Styling: Tailwind CSS
- Audio: WaveSurfer.js
- Desktop: Tauri (Rust)
Component Structure
Backend Architecture
Tech Stack
- Framework: FastAPI (Python 3.11+)
- TTS Engines: Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro
- Transcription: Whisper (PyTorch or MLX-Whisper)
- Inference Backends: MLX (Apple Silicon), PyTorch (CUDA / ROCm / XPU / DirectML / CPU)
- Database: SQLite via SQLAlchemy
- Audio: librosa, soundfile, Pedalboard
Layout
Request Flow
An HTTP request enters a route handler, which validates input and delegates to a service function. The service calls into the appropriate engine backend via the registry, which runs the actual inference. Audio post-processing runs through utils (trim, resample, effects).
Route handlers are intentionally thin — they validate input, delegate to a service function, and format the response. All business logic lives in services/.
Multi-Engine Registry
The backend is designed so that adding a new TTS engine only requires touching the backends/ directory and the central registry. There is no per-engine branching in routes or services.
TTSBackendProtocol (backends/__init__.py) — defines the contract every engine implements:load_model,create_voice_prompt,combine_voice_prompts,generate,unload_model,is_loaded,_get_model_path.ModelConfigdataclass — central metadata record for each model variant:model_name,display_name,engine,hf_repo_id,size_mb,needs_trim,languages,supports_instruct, etc.TTS_ENGINESdict — maps engine name ("qwen","kokoro", etc.) to display name.get_tts_backend_for_engine(engine)— thread-safe factory that lazily instantiates and caches the backend for an engine using double-checked locking.
Shipped engines:
| Engine key | Display name | Profile type |
|---|---|---|
qwen |
Qwen TTS | Cloned |
qwen_custom_voice |
Qwen CustomVoice | Preset |
luxtts |
LuxTTS | Cloned |
chatterbox |
Chatterbox TTS | Cloned |
chatterbox_turbo |
Chatterbox Turbo | Cloned |
tada |
TADA | Cloned |
kokoro |
Kokoro | Preset |
See TTS Engines for the full contract and integration phases, and PROJECT_STATUS.md for candidates under evaluation.
Key Modules
app.py— FastAPI app factory, CORS, lifecycle eventsmain.py— Entry point (imports app, runs uvicorn)server.py— Tauri sidecar launcher, parent-pid watchdog, frozen-build environment setupservices/generation.py— Single function handling all generation modes (generate, retry, regenerate)services/task_queue.py— Serial generation queue for GPU inferencebackends/__init__.py— Protocol definitions,ModelConfigregistry, and engine factorybackends/base.py— Shared utilities across all engine implementations (device selection, progress tracking, output trimming)
Inference Backend Selection
The server detects the best inference backend at startup and uses it for all engines that support it:
| Platform | Backend | Acceleration |
|---|---|---|
| macOS (Apple Silicon) | MLX | Metal / Neural Engine |
| Windows / Linux (NVIDIA) | PyTorch | CUDA (cu128) |
| Linux (AMD) | PyTorch | ROCm |
| Windows / Linux (Intel Arc) | PyTorch | XPU (IPEX) |
| Windows (other GPU) | PyTorch | DirectML |
| Any | PyTorch | CPU fallback |
See GPU Acceleration for platform-specific notes and manual overrides.
Data Model
Core tables (see backend/database/models.py):
profiles— Voice profiles withvoice_typediscriminator (cloned|preset|designed),preset_engine,preset_voice_id, anddefault_engine.profile_samples— Reference audio clips + transcripts for cloned profiles. Empty for preset profiles.generations— Generated audio with text, engine, model, language, seed, and duration.generation_versions— Processed variants of a generation with different effects chains applied.audio_channels+channel_device_mappings+profile_channel_mappings— Multi-output routing.
See Voice Profiles and Effects Pipeline for details.
Desktop App (Tauri)
Rust Backend
Responsibilities
- Launch Python backend as sidecar process
- Native file dialogs
- System tray integration
- Auto-updates (Tauri updater + custom CUDA backend swap)
- Parent-PID watchdog so the backend exits if the app crashes
Build Process
Development
just dev # Starts backend + Tauri app
just dev-web # Starts backend + web app (no Tauri)
just dev-backend # Backend only
just dev-frontend # Tauri app only (backend must be running)
Production
just build # CPU server binary + Tauri installer
just build-local # CPU + CUDA binaries + Tauri installer (Windows)
just build-server # Server binary only
just build-tauri # Tauri app only
See Building for what PyInstaller does and how the CUDA binary is split and packaged separately.
Data Flow
Generation Flow
- User Input — text entered in a React component, engine + profile selected
- State Update — Zustand generation form store records the request
- API Request — React Query mutation hits
POST /generate - Route —
routes/generate.pyvalidates input, dispatches toservices/generation.py - Voice Prompt — the service creates or retrieves a cached voice prompt via the engine's backend
- Queue —
services/task_queue.pyserializes generation to avoid GPU contention - Inference — the engine backend runs
generate()and returns audio + sample rate - Post-process — optional trim (for engines that need it), effects chain applied per generation version
- Storage — audio written to the generations directory, metadata saved to SQLite
- Response — backend returns the generation record; frontend updates React Query cache and plays audio
Performance Considerations
Frontend
- Code splitting — lazy-load routes
- Memoization —
React.memofor heavy components - Virtual scrolling — for large lists
- Debouncing — search and input handling
Backend
- Async I/O — all I/O is async; inference runs in
asyncio.to_thread - Serial task queue — avoids multiple engines fighting for the GPU
- Voice prompt caching — engine-specific, keyed by audio hash + reference text
- Model pinning — only one model per engine loaded at a time; switching unloads the previous one
- Per-engine backend cache — engines are only instantiated once per process
Security
Current
- Local-only by default (bound to
127.0.0.1:17493) - No authentication (localhost trust)
- File system sandboxing via Tauri
Planned
- API key authentication for remote mode
- User accounts
- Rate limiting
- HTTPS support
Deployment Modes
Local Mode
- Backend runs as sidecar
- All data stays on device
- No network required
Remote Mode
- Backend on a separate machine (Docker or bare host)
- Frontend (desktop or web) connects over HTTP
- See Remote Mode and Docker
Next Steps
Set up your dev environment
How to add a new engine
Contribute to Voicebox