Architecture

System Overview

Voicebox uses a client-server architecture with a React frontend and Python backend. The desktop app is built with Tauri and contains two main layers:

Frontend Layer: A React application that handles the UI components, state management with Zustand, and data fetching with React Query (TanStack Query).

Backend Layer: A Python FastAPI server that hosts the REST API, runs a pluggable registry of TTS and STT engines, manages the SQLite database, and handles audio processing.

These two layers communicate via HTTP on localhost:17493, with the frontend making API requests to the backend. In production the backend is compiled with PyInstaller and launched as a Tauri sidecar; in development it's run manually via uvicorn.

Frontend Architecture

Tech Stack

  • Framework: React 18 with TypeScript
  • State Management: Zustand stores
  • Data Fetching: React Query (TanStack Query)
  • Styling: Tailwind CSS
  • Audio: WaveSurfer.js
  • Desktop: Tauri (Rust)

Component Structure

Backend Architecture

Tech Stack

  • Framework: FastAPI (Python 3.11+)
  • TTS Engines: Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro
  • Transcription: Whisper (PyTorch or MLX-Whisper)
  • Inference Backends: MLX (Apple Silicon), PyTorch (CUDA / ROCm / XPU / DirectML / CPU)
  • Database: SQLite via SQLAlchemy
  • Audio: librosa, soundfile, Pedalboard

Layout

Request Flow

An HTTP request enters a route handler, which validates input and delegates to a service function. The service calls into the appropriate engine backend via the registry, which runs the actual inference. Audio post-processing runs through utils (trim, resample, effects).

Route handlers are intentionally thin — they validate input, delegate to a service function, and format the response. All business logic lives in services/.

Multi-Engine Registry

The backend is designed so that adding a new TTS engine only requires touching the backends/ directory and the central registry. There is no per-engine branching in routes or services.

  • TTSBackend Protocol (backends/__init__.py) — defines the contract every engine implements: load_model, create_voice_prompt, combine_voice_prompts, generate, unload_model, is_loaded, _get_model_path.
  • ModelConfig dataclass — central metadata record for each model variant: model_name, display_name, engine, hf_repo_id, size_mb, needs_trim, languages, supports_instruct, etc.
  • TTS_ENGINES dict — maps engine name ("qwen", "kokoro", etc.) to display name.
  • get_tts_backend_for_engine(engine) — thread-safe factory that lazily instantiates and caches the backend for an engine using double-checked locking.

Shipped engines:

Engine key Display name Profile type
qwen Qwen TTS Cloned
qwen_custom_voice Qwen CustomVoice Preset
luxtts LuxTTS Cloned
chatterbox Chatterbox TTS Cloned
chatterbox_turbo Chatterbox Turbo Cloned
tada TADA Cloned
kokoro Kokoro Preset

See TTS Engines for the full contract and integration phases, and PROJECT_STATUS.md for candidates under evaluation.

Key Modules

  • app.py — FastAPI app factory, CORS, lifecycle events
  • main.py — Entry point (imports app, runs uvicorn)
  • server.py — Tauri sidecar launcher, parent-pid watchdog, frozen-build environment setup
  • services/generation.py — Single function handling all generation modes (generate, retry, regenerate)
  • services/task_queue.py — Serial generation queue for GPU inference
  • backends/__init__.py — Protocol definitions, ModelConfig registry, and engine factory
  • backends/base.py — Shared utilities across all engine implementations (device selection, progress tracking, output trimming)

Inference Backend Selection

The server detects the best inference backend at startup and uses it for all engines that support it:

Platform Backend Acceleration
macOS (Apple Silicon) MLX Metal / Neural Engine
Windows / Linux (NVIDIA) PyTorch CUDA (cu128)
Linux (AMD) PyTorch ROCm
Windows / Linux (Intel Arc) PyTorch XPU (IPEX)
Windows (other GPU) PyTorch DirectML
Any PyTorch CPU fallback

See GPU Acceleration for platform-specific notes and manual overrides.

Data Model

Core tables (see backend/database/models.py):

  • profiles — Voice profiles with voice_type discriminator (cloned | preset | designed), preset_engine, preset_voice_id, and default_engine.
  • profile_samples — Reference audio clips + transcripts for cloned profiles. Empty for preset profiles.
  • generations — Generated audio with text, engine, model, language, seed, and duration.
  • generation_versions — Processed variants of a generation with different effects chains applied.
  • audio_channels + channel_device_mappings + profile_channel_mappings — Multi-output routing.

See Voice Profiles and Effects Pipeline for details.

Desktop App (Tauri)

Rust Backend

Responsibilities

  • Launch Python backend as sidecar process
  • Native file dialogs
  • System tray integration
  • Auto-updates (Tauri updater + custom CUDA backend swap)
  • Parent-PID watchdog so the backend exits if the app crashes

Build Process

Development

just dev              # Starts backend + Tauri app
just dev-web          # Starts backend + web app (no Tauri)
just dev-backend      # Backend only
just dev-frontend     # Tauri app only (backend must be running)

Production

just build            # CPU server binary + Tauri installer
just build-local      # CPU + CUDA binaries + Tauri installer (Windows)
just build-server     # Server binary only
just build-tauri      # Tauri app only

See Building for what PyInstaller does and how the CUDA binary is split and packaged separately.

Data Flow

Generation Flow

  1. User Input — text entered in a React component, engine + profile selected
  2. State Update — Zustand generation form store records the request
  3. API Request — React Query mutation hits POST /generate
  4. Routeroutes/generate.py validates input, dispatches to services/generation.py
  5. Voice Prompt — the service creates or retrieves a cached voice prompt via the engine's backend
  6. Queueservices/task_queue.py serializes generation to avoid GPU contention
  7. Inference — the engine backend runs generate() and returns audio + sample rate
  8. Post-process — optional trim (for engines that need it), effects chain applied per generation version
  9. Storage — audio written to the generations directory, metadata saved to SQLite
  10. Response — backend returns the generation record; frontend updates React Query cache and plays audio

Performance Considerations

Frontend

  • Code splitting — lazy-load routes
  • MemoizationReact.memo for heavy components
  • Virtual scrolling — for large lists
  • Debouncing — search and input handling

Backend

  • Async I/O — all I/O is async; inference runs in asyncio.to_thread
  • Serial task queue — avoids multiple engines fighting for the GPU
  • Voice prompt caching — engine-specific, keyed by audio hash + reference text
  • Model pinning — only one model per engine loaded at a time; switching unloads the previous one
  • Per-engine backend cache — engines are only instantiated once per process

Security

Current

  • Local-only by default (bound to 127.0.0.1:17493)
  • No authentication (localhost trust)
  • File system sandboxing via Tauri

Planned

  • API key authentication for remote mode
  • User accounts
  • Rate limiting
  • HTTPS support

Deployment Modes

Local Mode

  • Backend runs as sidecar
  • All data stays on device
  • No network required

Remote Mode

  • Backend on a separate machine (Docker or bare host)
  • Frontend (desktop or web) connects over HTTP
  • See Remote Mode and Docker

Next Steps