PROJECT_STATUS

Voicebox Project Status & Roadmap

Last updated: 2026-04-18 | Current version: v0.4.1 | 232 open issues | 12 open PRs


Table of Contents

  1. Architecture Overview
  2. Current State
  3. Open PRs — Triage & Analysis
  4. Open Issues — Categorized
  5. Existing Plan Documents — Status
  6. New Model Integration — Landscape
  7. Architectural Bottlenecks
  8. Recommended Priorities

Architecture Overview

Tauri shell (Rust) hosts a React frontend (app/) that talks over HTTP on localhost:17493 to a FastAPI backend (backend/).

The backend exposes:

  • TTSBackend Protocol with seven concrete engine implementations:
    • Qwen3-TTS (PyTorch or MLX depending on platform)
    • Qwen CustomVoice (predefined speakers with instruct)
    • LuxTTS (fast, CPU-friendly)
    • Chatterbox Multilingual (23 languages)
    • Chatterbox Turbo (English, paralinguistic tags)
    • TADA (1B English, 3B multilingual via HumeAI)
    • Kokoro 82M (pre-built voices, CPU realtime)
  • STTBackend Protocol for Whisper (PyTorch or MLX-Whisper)
  • Profiles / History / Stories services for persistence and timeline editing

Key Files

Layer File Purpose
Backend entry backend/main.py FastAPI app, all API routes (~2850 lines)
TTS protocol backend/backends/__init__.py:32-101 TTSBackend Protocol definition
Model registry backend/backends/__init__.py:17-29,153-366 ModelConfig dataclass + registry helpers
TTS factory backend/backends/__init__.py:382-426 Thread-safe engine registry (double-checked locking)
PyTorch TTS backend/backends/pytorch_backend.py Qwen3-TTS via qwen_tts package
MLX TTS backend/backends/mlx_backend.py Qwen3-TTS via mlx_audio.tts
LuxTTS backend/backends/luxtts_backend.py LuxTTS — fast, CPU-friendly
Chatterbox MTL backend/backends/chatterbox_backend.py Chatterbox Multilingual — 23 languages
Chatterbox Turbo backend/backends/chatterbox_turbo_backend.py Chatterbox Turbo — English, paralinguistic tags
TADA backend/backends/hume_backend.py HumeAI TADA — 1B English + 3B Multilingual
Kokoro backend/backends/kokoro_backend.py Kokoro 82M — CPU realtime, pre-built voices
Qwen CustomVoice backend/backends/qwen_custom_voice_backend.py Qwen CustomVoice — predefined speakers with instruct
Platform detect backend/platform_detect.py Apple Silicon → MLX, else → PyTorch
API types backend/models.py Pydantic request/response models
HF progress backend/utils/hf_progress.py HFProgressTracker (tqdm patching for download progress)
Audio utils backend/utils/audio.py trim_tts_output(), normalize, load/save audio
Frontend API app/src/lib/api/client.ts Hand-written fetch wrapper
Frontend types app/src/lib/api/types.ts TypeScript API types
Engine selector app/src/components/Generation/EngineModelSelector.tsx Shared engine/model dropdown
Generation form app/src/components/Generation/GenerationForm.tsx TTS generation UI
Floating gen box app/src/components/Generation/FloatingGenerateBox.tsx Compact generation UI
Model manager app/src/components/ServerSettings/ModelManagement.tsx Model download/status/progress UI
GPU acceleration app/src/components/ServerSettings/GpuAcceleration.tsx CUDA backend swap UI
Gen form hook app/src/lib/hooks/useGenerationForm.ts Form validation + submission
Language constants app/src/lib/constants/languages.ts Per-engine language maps

How TTS Generation Works (Current Flow)

POST /generate
  1. Look up voice profile from DB
  2. Resolve engine from request (qwen | qwen_custom_voice | luxtts | chatterbox | chatterbox_turbo | tada | kokoro)
  3. Get backend: get_tts_backend_for_engine(engine)  # thread-safe singleton per engine
  4. Check model cache → if missing, trigger background download, return HTTP 202
  5. Load model (lazy): tts_backend.load_model(model_size)
  6. Create voice prompt: profiles.create_voice_prompt_for_profile(engine=engine)
   → tts_backend.create_voice_prompt(audio_path, reference_text)
  7. Generate: tts_backend.generate(text, voice_prompt, language, seed, instruct)
  8. Post-process: trim_tts_output() for Chatterbox engines
  9. Save WAV → data/generations/{id}.wav
  10. Insert history record in SQLite
  11. Return GenerationResponse

Current State

What's Shipped (v0.4.x)

New since v0.3.0:

  • Kokoro 82M TTS engine + voice profile type system (PR #325)
  • Qwen CustomVoice preset engine — predefined speakers with instruct support (PR #328)
  • Intel Arc (XPU) GPU support (PR #320)
  • Blackwell GPU (sm_120) CUDA support (PR #401)
  • Generation cancellation flow (PR #444)
  • Frontend quality gates + TypeScript hardening (PR #418)
  • macOS Intel (x86_64) PyTorch compatibility (PR #416)
  • Frozen-binary import fixes for Kokoro / Chatterbox Multilingual / scipy / transformers (PR #438)
  • Linux PipeWire/PulseAudio monitor detection (PR #457)
  • Server survives GUI close on Windows (PR #402)
  • GPU arch compatibility warning on startup (catches unsupported PyTorch builds)
  • cpal Stream playback reliability (PR #405), clip-splitting stability (PR #403)
  • torch.from_numpy crash with numpy 2.x in frozen binary (PR #361)
  • Async CUDA download lock (PR #428), NUMBA_CACHE_DIR env var (PR #425)
  • "Clear failed" history button (PR #412)
  • External server GUI startup + data refresh (PR #319)
  • Force offline mode for cached Qwen/Whisper models (PR #318)
  • macOS 11 ScreenCaptureKit launch crash fix (PR #424)

Core TTS (cumulative):

  • Qwen3-TTS voice cloning (1.7B and 0.6B models, MLX + PyTorch)
  • Qwen CustomVoice (preset speakers, instruct)
  • LuxTTS — fast, CPU-friendly English TTS (PR #254)
  • Chatterbox Multilingual — 23 languages including Hebrew (PR #257)
  • Chatterbox Turbo — paralinguistic tags, low latency English (PR #258)
  • HumeAI TADA — 1B English + 3B Multilingual (PR #296)
  • Kokoro 82M — CPU-realtime, 8 languages, Apache 2.0 (PR #325)
  • Multi-engine architecture with thread-safe backend registry (PR #254)
  • Chunked TTS generation — engine-agnostic, removes ~500 char limit (PR #266)
  • Async generation queue (PR #269)
  • Post-processing audio effects system (PR #271)
  • Voice profile type system (preset vs cloned, engine compatibility gating)
  • Centralized ModelConfig registry — no per-engine dispatch maps
  • Shared EngineModelSelector component

Infrastructure (cumulative):

  • CUDA backend swap via binary download (PR #252), cu128 upgrade (PR #316), Blackwell/sm_120 (PR #401)
  • CUDA backend split into independently versioned server + libs archives (PR #298)
  • Intel Arc XPU support (PR #320)
  • Docker + web deployment (PR #161)
  • Backend refactor: modular architecture, style guide, tooling (PR #285)
  • Settings overhaul: routed sub-tabs, server logs, changelog, about page (PR #294)
  • Windows support: CUDA detection, cross-platform justfile, server lifecycle (PR #272, #402)
  • Linux audio capture via pactl monitor detection (PR #457)
  • macOS Intel x86_64 compatibility (PR #416)
  • Voice profiles with multi-sample support
  • Stories editor (multi-track DAW timeline)
  • Whisper transcription (base, small, medium, large, turbo variants)
  • Model management UI with inline download progress + folder migration (PR #268)
  • Download cancel/clear UI with error panel (PR #238)
  • Generation history with caching and cancellation (PR #444)
  • Streaming generation endpoint (MLX only)
  • Audio player freeze fix + UX improvements (PR #293)
  • CORS restriction to known local origins (PR #88)

Abandoned / Backlogged Integrations

Model PR / Branch Reason
CosyVoice2/3 PR #311 Output quality too poor. Heavy deps, no PyPI, needed 5+ shims. PR should be closed.
VoxCPM 1.5 / VoxCPM2 voicebox-new-models research (2026-04-18) Backlogged. See detailed analysis below.

VoxCPM — Evaluation Notes (2026-04-18)

Project: OpenBMB/VoxCPM — tokenizer-free TTS, 2B params (VoxCPM2), end-to-end diffusion autoregressive architecture, 30 languages, 48 kHz output, Apache 2.0, pip install voxcpm.

Why it looked interesting:

  • Clean PyPI install (pip install voxcpm)
  • Apache 2.0 — commercially safe
  • Voice cloning via reference_wav_path with optional prompt_wav_path + prompt_text for "ultimate" cloning
  • Streaming API via generate_streaming()
  • Zero-shot cloning + style control via parenthetical prefixes in text ((slightly faster, cheerful tone)...)
  • Relatively high-quality output per demos

Why we backlogged it:

  • Effectively CUDA-only. README states CUDA ≥ 12.0 as hard requirement. Source code's from_pretrained(device=None|"auto") claims "preferring CUDA, then MPS, then CPU," but in practice:
    • MPS (Apple Silicon) broken upstream — OpenBMB/VoxCPM issues #232 (NotImplementedError: Output channels > 65536 not supported at the MPS device) and #248 (IndexError on M3 Mac) are both open with no resolution.
    • CPU unsupported in the Python package — issue #256 shows voxcpm --device cpu rejected with unrecognized arguments. The only CPU path is the third-party VoxCPM.cpp GGML engine, which is a separate ecosystem project, not pip install voxcpm.
    • macOS source install fails — issue #233 open with no resolution.
  • Would require CUDA-only gating in UI (new requires_cuda flag on ModelConfig, lock icon + "Requires NVIDIA GPU" in ModelManagement.tsx / EngineModelSelector.tsx) plus a hard error at load_model() as safety net. Doable but adds first-class platform gating that doesn't exist for any other engine today.
  • Voicebox's user base skews Apple Silicon (MLX is a primary backend). Shipping a CUDA-only model sets a precedent worth a separate scoping discussion (see issues #419 engine sprawl, #420 platform tiers, PR #465).

What would change the decision:

  • Upstream fixes MPS crashes (watch issues #232, #248).
  • We define an "experimental / CUDA-only" engine tier as part of issue #419 / PR #465, and decide it's acceptable to ship engines that are hidden on non-NVIDIA platforms.
  • VoxCPM.cpp matures into a viable CPU path we can wrap (currently separate project, C++/GGML, unclear ergonomics).

Integration shape if we revive it: Zero-shot cloning maps naturally to the Chatterbox-style backend (store ref_audio + ref_text paths in the voice prompt dict, process at generate time). Est. ~250 lines for voxcpm_backend.py + one ModelConfig entry + engine registration in backends/__init__.py. Frontend UI gating is the bigger lift.

What's In-Flight

Feature Branch/PR Status
Platform support tiers PR #465, issue #420 Defining tier-1 (supported) vs tier-2 (community) platforms
Engine sprawl cleanup issue #419 First-class vs experimental TTS backends distinction
Frontend tech-debt burn-down issue #421 Biome + a11y debt before gating CI
Docker registry auto-publish PR #463, issue #453 ghcr.io image on tag push
New model research voicebox-new-models branch Evaluating Fish Speech, XTTS-v2, Pocket TTS, VibeVoice, Fish Audio S2, index-tts2

TTS Engine Comparison

Engine Model Name Profile Type Languages Size Key Features Instruct Support
Qwen3-TTS 1.7B qwen-tts-1.7B Cloned 10 (zh, en, ja, ko, de, fr, ru, pt, es, it) ~3.5 GB Highest quality, voice cloning None (Base model has no instruct path)
Qwen3-TTS 0.6B qwen-tts-0.6B Cloned 10 ~1.2 GB Lighter, faster None
Qwen CustomVoice 1.7B qwen-custom-voice-1.7B Preset 10 ~3.5 GB Predefined speakers, instruct support Yes
Qwen CustomVoice 0.6B qwen-custom-voice-0.6B Preset 10 ~1.2 GB Predefined speakers, instruct support Yes
LuxTTS luxtts Cloned English ~300 MB CPU-friendly, 48 kHz, fast None
Chatterbox chatterbox-tts Cloned 23 (incl. Hebrew, Arabic, Hindi, etc.) ~3.2 GB Zero-shot cloning, multilingual Partial — exaggeration float (0-1)
Chatterbox Turbo chatterbox-turbo Cloned English ~1.5 GB Paralinguistic tags ([laugh], [cough]), 350M params, low latency Partial — inline tags only
TADA 1B tada-1b Cloned English ~4 GB HumeAI speech-language model, 700s+ coherent audio None
TADA 3B Multilingual tada-3b-ml Cloned 10 (en, ar, zh, de, es, fr, it, ja, pl, pt) ~8 GB Multilingual, text-acoustic dual alignment None
Kokoro 82M kokoro Preset 8 (en, es, fr, hi, it, pt, ja, zh) ~350 MB 82M params, CPU realtime, Apache 2.0, pre-built voices None

Multi-Engine Architecture (Shipped)

  • Thread-safe backend registry (_tts_backends dict + _tts_backends_lock) with double-checked locking
  • Per-engine backend instances — each engine gets its own singleton, loaded lazily
  • Engine field on GenerationRequest — frontend sends engine: 'qwen' | 'qwen_custom_voice' | 'luxtts' | 'chatterbox' | 'chatterbox_turbo' | 'tada' | 'kokoro'
  • Per-engine language filteringENGINE_LANGUAGES map in frontend, backend regex accepts all languages
  • Per-engine voice promptscreate_voice_prompt_for_profile() dispatches to the correct backend
  • Profile type system — preset vs cloned profiles, UI grays out incompatible engines and auto-switches on selection
  • Trim post-processingtrim_tts_output() for Chatterbox engines (cuts trailing silence/hallucination)

Known Limitations

  • HF XET progress: Large files downloaded via hf-xet (HuggingFace's new transfer backend) report n=0 in tqdm updates. Progress bars may appear stuck for large .safetensors files even though the download is proceeding. This is a known upstream limitation.
  • Chatterbox Turbo upstream token bug: from_pretrained() passes token=os.getenv("HF_TOKEN") or True which fails without a stored HF token. Our backend works around this by calling snapshot_download(token=None) + from_local().
  • chatterbox-tts must install with --no-deps: It pins numpy<1.26, torch==2.6.0, transformers==4.46.3 — all incompatible with our stack (Python 3.12, torch 2.10, transformers 4.57.3). Sub-deps listed explicitly in requirements.txt.
  • Instruct parameter partially shipped (#224, #303): Qwen CustomVoice (PR #328) now provides real instruct support via predefined speakers. Other backends still silently drop the instruct field — the UI exposes the field broadly but most engines ignore it. The floating generate box was patched to restore instruct for CustomVoice (commit 106aec4).
  • Streaming generation only works for Qwen on MLX. Other engines use the non-streaming /generate endpoint.
  • dicta-onnx (Hebrew diacritization) not included — upstream Chatterbox bug requires model_path arg but calls Dicta() with none. Hebrew works fine without it.
  • Blackwell (RTX 50-series) CUDA: cu128 + sm_120 kernel support shipped (PR #401, #316), but users still report cudaErrorNoKernelImageForDevice (#417, #400, #396, #395, #390, #362) — likely a stale CUDA binary on upgraded installs. Needs a follow-up diagnostic / forced re-download path.
  • Long text 50k character limit (#464, #365, #354): Still hit on GPU despite chunking (PR #266). Chunking reliability needs another pass.
  • ROCm on RDNA 3/4 (#469): HSA_OVERRIDE_GFX_VERSION is hardcoded and harms newer cards.
  • flash-attn is not installed warning on every platform (cosmetic, common user complaint): Our transformer-based engines (Chatterbox / Qwen) emit Warning: flash-attn is not installed. Will only run the manual PyTorch version. Please install flash-attn for faster inference. on every startup, on every platform — we don't pin flash-attn in requirements because installing it is fragile and version-sensitive. Fallback is PyTorch SDPA, which is near-FA2 throughput on Ampere+ and is what actually runs. Per-platform reality: (a) macOS/Apple Silicon — FlashAttention is CUDA-only, irrelevant here; MLX has its own attention kernels. (b) Linuxpip install flash-attn --no-build-isolation works but takes 20+ min to compile. (c) Windows — no official support (Dao-AILab README still says only "Might work"; source builds routinely fail on recent CUDA/MSVC, issues #1715, #1828, #2395). Windows users can install community prebuilt wheels from kingbri1/flash-attention or bdashore3/flash-attention (latest v2.8.3, Aug 2025; win_amd64 wheels for CUDA 12.4/12.8, Torch 2.6–2.9, Python 3.10–3.13) matching their exact CUDA/Torch/Python, or use WSL2. Native-Windows alternatives worth considering as a build-time swap: SageAttention (thu-ml, Apache 2.0, claims 2–5× over FA2) and xformers (official Windows wheels). Action for us: troubleshooting doc now covers it (see docs/content/docs/overview/troubleshooting.mdx), and we should optionally suppress the warning via logging.getLogger(...).setLevel(ERROR) at backend import since the fallback is functionally fine.
  • WebAudio playback dies after audio-session interruption (#41, plus an internal repro where the app is backgrounded long enough): WaveSurfer's AudioContext gets suspended by macOS — either because another app grabs the audio output, or because the WKWebView throttles when backgrounded. play() resolves and timeupdate can still fire, but no audio reaches the output. Only app restart fixes it. Things already tried that didn't work: (a) swapping WaveSurfer backend away from WebAudio — introduced more bugs, not an option; (b) remount hook on the player — doesn't help because a freshly-created AudioContext is born suspended and only resumes on a user gesture. PR #293 was a prior partial fix that doesn't cover this path. Next thing to try (not yet attempted — confirmed via grep of AudioPlayer.tsx): call wavesurfer.getMediaElement().getGainNode().context.resume() on the play button click (the click itself is a valid user gesture), plus a visibilitychange + statechange listener as belt-and-suspenders. The ctx.resume() pattern already exists in the codebase at useStoryPlayback.ts:52 — just not wired into the main player.

Open PRs — Triage & Analysis

Recently Merged (Since Last Update — 2026-03-18 → 2026-04-18)

PR Title Merged
#481 fix(build): pin transformers in MLX requirements to prevent 5.x upgrade 2026-04-19
#470 fix(api-client): declare moved + errors on migrateModels response type 2026-04-18
#457 fix(linux): use pactl to detect PipeWire/PulseAudio monitor 2026-04-18
#450 docs: clarify paralinguistic tag support in quick start 2026-04-18
#447 fix: delete version rows and files in delete_generations_by_profile 2026-04-18
#444 Fix generation cancellation flow 2026-04-18
#440 fix(paths): strip legacy "data/" prefix when resolving stored paths 2026-04-18
#439 Fix migration dialog hanging when no models are present 2026-04-18
#438 fix(build): repair frozen-binary imports for kokoro/chatterbox-multilingual/scipy/transformers 2026-04-18
#433 fix: warn user when no models to migrate during storage change 2026-04-18
#425 Add NUMBA_CACHE_DIR environment variable 2026-04-16
#424 fix: avoid ScreenCaptureKit launch crash on macOS 11 2026-04-16
#418 Frontend quality gates + TypeScript hardening 2026-04-18
#416 fix(deps): relax PyTorch requirement for macOS Intel (x86_64) 2026-04-16
#412 feat(history): add "Clear failed" button 2026-04-16
#405 fix: keep cpal Stream alive until playback completes 2026-04-16
#403 fix: prevent intermittent clip splitting failures 2026-04-16
#402 fix: reliably keep server alive after GUI close on Windows 2026-04-16
#401 feat: add Blackwell GPU (sm_120) CUDA support 2026-04-16
#394 fix(history): populate status/error/engine fields from DB row 2026-04-16
#384 Fix: Resolve ModuleNotFoundError in effects service 2026-04-16
#361 fix: torch.from_numpy crash with numpy 2.x in frozen binary 2026-04-16
#345 Fix: "Failed to Save" preset error by resolving backend import path 2026-03-22
#344 fix: include changelog in docker web build 2026-03-27
#332 Fix links in Get Started section of index.mdx 2026-03-21
#328 feat: add Qwen CustomVoice preset engine 2026-03-27
#325 feat: Kokoro 82M TTS engine + voice profile type system 2026-03-20
#321 fix: allows deletion of failed generations 2026-03-19
#320 feat: Intel Arc (XPU) GPU support 2026-03-21
#319 fix: GUI startup with external server + data refresh on server switch 2026-03-27
#318 fix: force offline mode when loading cached models (Qwen TTS & Whisper) 2026-03-21
#316 Upgrade CUDA backend from cu126 to cu128, fix GPU settings UI 2026-03-18

Currently Open (12 PRs)

PR Title Status Notes
#465 docs: define tier-1 and tier-2 platform support targets Community PR Pairs with issue #420. Important for scoping.
#463 feat(actions): add docker-registry.yml for automatic ghcr.io publishing Community PR Pairs with issue #453. Low risk.
#443 fix: prevent infinite retry loop in offline mode (#434) Community PR Fixes reported bug.
#430 feat: add MiniMax TTS provider support Community PR Cloud TTS provider — new direction (external API). Superset of #331?
#331 feat: add MiniMax Cloud TTS as a built-in engine Community PR Likely superseded by #430. Dedupe.
#311 feat: add CosyVoice2/3 TTS engine Close Abandoned — output quality too poor.
#253 Enhance speech tokenizer with 48kHz version Community PR Qwen tokenizer upgrade. Still worth reviewing.
#227 fix: harden input validation & file safety Community PR Coupled to #225 (custom models).
#225 feat: custom HuggingFace voice model support Community PR Needs rework for multi-engine arch.
#195 feat: per-profile LoRA fine-tuning Draft Complex. 15 new endpoints.
#154 feat: Audiobook tab Community PR Chunked generation now shipped (#266).
#91 fix: CoreAudio device enumeration Draft macOS audio device handling.

Open Issues — Categorized

GPU / Hardware Detection — still the top category

RTX 50-series (Blackwell / sm_120) cluster — NEW: #417, #400, #396, #395, #390, #362 all report cudaErrorNoKernelImageForDevice / "no kernel image available." sm_120 support shipped in PR #401 + cu128 in PR #316, but users on upgraded installs still hit it — likely stale CUDA binary. Needs a diagnostic that detects binary/GPU-arch mismatch and prompts re-download.

AMD / ROCm — NEW: #469 HSA_OVERRIDE_GFX_VERSION is hardcoded and breaks RDNA 3/4 cards. #313 DirectML on AMD Ryzen AI Max+ 395 not working.

Intel Arc: PR #320 shipped XPU support — may resolve #119.

General GPU-not-detected (older): #368, #310, #330, #324, #326, #355 (multi-GPU / eGPU).

Fix path: CUDA backend swap (PR #252) + cu128 (PR #316) + sm_120 (PR #401) + GPU-arch warning (73170d0) are all in. Remaining work is diagnostics + re-download prompts for users whose binary predates the kernel updates.

Model Downloads

Still reported. Users get stuck downloads, can't resume, offline mode edge cases.

Key issues: #475 (MAC CustomVoice install error), #449 (infinite loading macOS), #445 (can't download CustomVoice), #462 (Qwen requires internet even when loaded — regression from #150), #434 (infinite retry loop offline — PR #443 open), #432 (storage location change hangs when empty — partly fixed by PR #439/#433), #348 (TADA 3B Multilingual download fails), #336 (TADA model not listed in app), #275 (No module named 'chatterbox' on download), #304 (whisper-base feature extractor load error), #287 (macOS ARM check_model_inputs ImportError on new version), #181, #180.

Fix path: PR #443 addresses infinite offline retry. CustomVoice-specific download failures (#475, #445) need triage — likely related to frozen-binary import fixes in PR #438. TADA cluster (#336, #348) and macOS ARM import regressions (#287, #275, #304) need a dedicated triage pass.

Qwen 0.6B-downloads-1.7B reports: #485 (2026-04-19), #423 (macOS M1), #329. Originally a stale-fallback bug: mlx-community/Qwen3-TTS-12Hz-0.6B-Base-bf16 wasn't published when MLX support shipped, so the 0.6B slot was aliased to the 1.7B repo. The 0.6B bf16 conversion is live now and both backend/backends/mlx_backend.py and backend/backends/__init__.py point at their correct repos. Qwen CustomVoice is unaffected — it runs via PyTorch on all platforms, both sizes always have dedicated repos.

Language Requests (ongoing)

Strong demand: Hungarian (#479), Indonesian (#458, #247), Thai (#455), Bangla (#454), Arabic (#379), Persian (#162), IndicF5 (#339 — Indian languages), Ukrainian (#109), Chinese UI (#392, #261).

Fix path: Chatterbox Multilingual (PR #257) covers Arabic, Danish, German, Greek, Finnish, Hebrew, Hindi, Dutch, Norwegian, Polish, Swedish, Swahili, Turkish. Still missing: Hungarian, Indonesian, Thai, Bangla, Ukrainian. Issue #411 offers a PR for UI i18n foundation.

New Model Requests (growing)

Issue Model Requested
#478 CosyVoice3 (we tried & abandoned CosyVoice2/3 — see #311)
#407, #347 RVC-style voice-to-voice / seed voice conversion (STS)
#385 Fish Audio S2
#380 OmniVoice
#370 index-tts2
#364 Voxtral-TTS
#335 Faster-Qwen-TTS
#346 Multi-model batch request
#381 Microsoft MAI models
#339 IndicF5
#226 GGUF support
#172 VibeVoice
#138 Export to ONNX/Piper format
#132 LavaSR (transcription)
#147 Facebook Omnilingual ASR
#338 Default voices

The multi-engine architecture makes integration straightforward — see content/docs/developer/tts-engines.mdx. Platform-specific gating (e.g. VoxCPM CUDA-only) doesn't exist yet and would need design.

Platform Scope & Quality Debt — NEW category

Awareness issues filed this cycle — ties into engine sprawl and platform tier work.

  • #419 — Engine sprawl: define first-class vs experimental TTS backends
  • #420 — Formalize tier-1 vs tier-2 platform support targets (PR #465 open)
  • #421 — Track & burn down frontend Biome + a11y debt before gating CI
  • #422 — Code-split web build (main bundle > 1 MB)

Long-Form / Chunking

Still reported despite chunking + queue being merged.

Key issues: #464 (50k char limit on GPU despite 16 GB VRAM — v0.4.0), #365 (FR: >50k chars), #363 (smart chunking to prevent robotic artifacts), #354 (50k limit v0.3.0).

Fix path: Chunking (#266) and queue (#269) shipped. Remaining work is raising/removing the 50k guard and tuning chunk boundaries for prosody.

Feature Requests (ongoing)

Notable:

  • #480 — Noise removal on uploaded recordings
  • #448 — API for non-Qwen models (external integrations)
  • #427 — Task status control
  • #407, #347 — Voice-to-voice / audio-to-audio conversion
  • #387 — Location of downloaded generated voices
  • #383 — Concatenate partial reference audio into generated audio
  • #382 — Lightning.ai support
  • #376 — Remote mode
  • #353 — Audio transcoding
  • #317 — Voice pitch control
  • #189 — "Auto" language option
  • #173 — Vocal intonation/inflection control
  • #165, #270 — Audiobook mode (PR #154 open)
  • #242 — Seed value pinning
  • #228 — Always use 0.6B option
  • #235 — Finetuned Qwen3-TTS tokenizer (PR #253 open)
  • #144 — Copy text to clipboard

Housekeeping / Triage Needed

Issue Reason
#431, #408 Spam — Chinese "free Claude API" promos. Close.
#398 ("Excelente") Non-issue. Close.
#357 Informational — project featured in Awesome MLX. Close after acknowledgement.
#374, #377 Version-release questions, no bug. Close.
#306 ("voice model"), #389 ("New model"), #473 ("New functionality") Title-only issues, no content. Request details or close.
#309 Uninstall/cleanup question. Answer and close.
#241 "How to use in Colab" — support question, not a bug.
#423 / #485 / #329 Stale MLX fallback to 1.7B repo — fixed; 0.6B bf16 conversion now live on mlx-community, registry points at correct repo on both backends.
#336 / #348 TADA download/registration cluster — triage together.
#287 / #275 / #304 macOS ARM import regressions on new version — likely one root cause.
#292, #349 Possibly already fixed by merged PRs (#321/#412 and #345). Verify + close.

~70 older issues (pre-#170) not individually categorized above. Most are long-tail support questions or duplicates of problems now addressed by the multi-engine / model-registry work. A dedicated backlog-sweep pass is overdue.

Bugs (ongoing)

Category Issues
Generation failures #476, #467, #452, #459 (voice clone fetch error), #468 (tada-1b marked error), #437, #300, #301, #282
Audio quality #456 (clipping errors v0.4.0), #436 (emotion labels), #333 (pitch/echo), #307 (by-model breakdown), #340 (all generations say "www...")
Transcription #371 (fails every time), #291 (extract transcription from generated audio)
Effects / presets #349 ("Failed to save" when creating effects presets — possibly fixed by merged #345)
File ops #477 (spacy_pkuseg dict missing on frozen Windows build), #472 (storage location change), #283 (allow longer files for voice creation + in-app trim), #350 (failed to add sample)
History #292 (can't delete failed generations — possibly fixed by merged #321/#412)
Windows #466 (install problem), #375 (WinError 5 access denied), #273 (port 8000 conflict), #201 (model doesn't stay loaded)
Linux #471 (thread-safe PULSE_SOURCE), #413 (Arch build), #409 (Kubuntu build), #351, #341
macOS #441 (older macOS), #369 (malware flag), #334 (microphone permission), #287 (check_model_inputs ImportError — regression), #171 (ARM64 binary won't open)
Profile/UI #360 (Kokoro profile hides others — partly addressed by auto-switch), #299 (drag-drop on Win11), #329 (size selector state bug), #393 (stuck loading screen after reinstall to new dir)
Integrations #397 (SAMMI-bot 422 Unprocessable Entity)
Audio playback / session #41 (macOS: Voicebox goes silent after another app takes audio output; restart restores it) — see deep-dive below
Database #174 (sqlite3 IntegrityError)

Existing Plan Documents — Status

Document Target Version Status Relevance
TTS_PROVIDER_ARCHITECTURE.md v0.1.13 Partially superseded by multi-engine arch + CUDA swap Core concepts implemented differently than planned
CUDA_BACKEND_SWAP.md Shipped (PR #252) CUDA binary download + backend restart
CUDA_BACKEND_SWAP_FINAL.md Shipped (PR #252) Final implementation plan
EXTERNAL_PROVIDERS.md v0.2.0 Not started Remote server support
MLX_AUDIO.md Shipped MLX backend is live
DOCKER_DEPLOYMENT.md v0.2.0 Shipped (PR #161) Docker + web deployment
OPENAI_SUPPORT.md v0.2.0 Not started OpenAI-compatible API layer
PR33_CUDA_PROVIDER_REVIEW.md Reference Analysis of the original provider approach

New Model Integration — Landscape

Status Snapshot (2026-04-18)

Model Cloning Speed Sample Rate Languages VRAM Instruct Cross-platform? Status
Qwen3-TTS 10s zero-shot Medium 24 kHz 10 Medium None MLX + PyTorch Shipped
Qwen CustomVoice Preset speakers Medium 24 kHz 10 Medium Yes PyTorch Shipped (PR #328)
LuxTTS 3s zero-shot 150x RT, CPU ok 48 kHz English <1 GB None All Shipped (PR #254)
Chatterbox MTL 5s zero-shot Medium 24 kHz 23 Medium Partial — exaggeration CPU/CUDA Shipped (PR #257)
Chatterbox Turbo 5s zero-shot Fast 24 kHz English Low Partial — inline tags CPU/CUDA Shipped (PR #258)
HumeAI TADA 1B/3B Zero-shot 5x faster than LLM-TTS 24 kHz EN (1B), 10 (3B) Medium Partial — prosody PyTorch Shipped (PR #296)
Kokoro-82M Preset voices CPU realtime 24 kHz 8 Tiny (82M) None All Shipped (PR #325)
CosyVoice2-0.5B 3-10s zero-shot Very fast 24 kHz Multilingual Low Yes Abandoned (PR #311) — poor output quality
VoxCPM2 Zero-shot ~0.15 RTF streaming 48 kHz 30 Medium Partial — parenthetical style CUDA-only in practice Backlogged (2026-04-18) — see notes above
Fish Speech 10-30s few-shot Real-time 24-44 kHz 50+ Medium Yes — word-level inline All Candidate — license TBD
Fish Audio S2 Candidate (#385)
XTTS-v2 6s zero-shot Mid-GPU 24 kHz 17+ Medium Partial — style transfer from ref All Candidate — CPML license likely blocker
Pocket TTS (Kyutai) Zero-shot + streaming >1x RT on CPU English + several European (FR/DE/PT/IT/ES added by Feb 2026) ~100M None CPU-first Candidate — MIT
MOSS-TTS-Nano Zero-shot Realtime on 4 CPU cores 48 kHz stereo 20 0.1B Partial — MOSS-VoiceGenerator companion does text-to-voice design All (ONNX CPU path dropped 2026-04-17) Top candidate — Apache 2.0, released 2026-04-13, streaming
VibeVoice (Microsoft) Multi-speaker long-form (up to 90 min, 4 speakers) 1.5B Candidate (#172) — Stories-editor fit
index-tts2 Candidate (#370)
Voxtral TTS (Mistral) Zero-shot (short clips) + 20 preset voices Single-GPU 4B (Voxtral-4B-TTS-2603) Presets + cloning CUDA (16 GB+ VRAM) Candidate (#364) — frontier quality claim, open-weight
Dia / Dia2 Watch — emotion-forward, but "rough edges" / artifacts per April reviews
IndicF5 Indian languages Candidate (#339) — fills Indic gap
MiniMax Cloud TTS Cloud N/A (API) N/A Community PR #430, #331 — new direction (external API)
OmniVoice Candidate (#380)
RVC voice conversion N/A (STS) N/A All New modality, not TTS (#407, #347)

Watch list: MioTTS-2.6B (fast LLM-based EN/JP, vLLM compatible), Oolel-Voices (Soynade Research, expressive modular control), Faster-Qwen-TTS (#335), Orpheus / Sesame CSM (on-device fine-tuning discussions), Fish Audio S2 Pro / Fish Speech V1.5 (benchmark leader but research/non-commercial license — same blocker as Fish Speech).

Deep-research pass (2026-04-18): MOSS-TTS-Nano identified as the freshest high-alignment candidate — verified via OpenMOSS/MOSS-TTS README (0.1B params, Apache 2.0, 48 kHz stereo, 4-core CPU realtime, streaming, released 2026-04-13). Dedicated repo: OpenMOSS/MOSS-TTS-Nano. Voxtral TTS verified on HF as mistralai/Voxtral-4B-TTS-2603.

Active Evaluation Criteria (learned from cycle)

  1. Cross-platform first. MLX is a primary backend for our Apple Silicon user base. CUDA-only models require platform gating that doesn't exist yet — shipping one sets a precedent (see VoxCPM notes, issues #419/#420).
  2. PyPI + Apache/MIT licensing preferred. Heavy deps, git-only installs, and --no-deps workarounds are expensive to maintain (Chatterbox taught us this).
  3. Output quality is non-negotiable. CosyVoice was abandoned despite the best instruct API.
  4. Instruct support fills a real gap (#173, #224, #303). Qwen CustomVoice partially addresses it with preset speakers; zero-shot clone-with-instruct is still unmet.
  5. Long-form + streaming are user-requested (#363, #365, #464). Candidates with native streaming (Pocket TTS, Fish Speech) get extra weight.

Adding a New Engine (Now Straightforward)

With the model config registry and shared EngineModelSelector component, adding a new TTS engine requires:

  1. Create backend/backends/<engine>_backend.py — implement TTSBackend protocol (~200-300 lines)
  2. Register in backend/backends/__init__.py — add ModelConfig entry + TTS_ENGINES entry + factory elif
  3. Update backend/models.py — add engine name to regex
  4. Update frontend — add to engine union type, EngineModelSelector options, form schema, language map, profile type gating (icons/labels ~9 files per grep of kokoro)

main.py requires zero changes — the registry handles all dispatch automatically.

Platform gating doesn't exist yet. If we add a CUDA-only model (e.g. VoxCPM), we need a new requires_cuda (or more generally requires: list[device]) flag on ModelConfig, plumbed through /models API and surfaced in ModelManagement.tsx and EngineModelSelector.tsx as a lock icon + "Requires NVIDIA GPU" state. Backend should hard-error at load_model() as a safety net.

Total effort: ~1 day for a well-documented model with a PyPI package, cross-platform. ~2 days if platform gating is required. See content/docs/developer/tts-engines.mdx for the full guide.


Architectural Bottlenecks

1. Single Backend Singleton — RESOLVED

The singleton TTS backend was replaced with a thread-safe per-engine registry in PR #254. Multiple engines can now be loaded simultaneously.

2. main.py Dispatch Point Duplication — RESOLVED

Previously, each engine required updates to 6+ hardcoded dispatch maps across main.py (~320 lines of if/elif chains). A model config registry in backend/backends/__init__.py now centralizes all model metadata (ModelConfig dataclass) with helper functions (load_engine_model(), check_model_loaded(), engine_needs_trim(), etc.). Adding a new engine requires zero changes to main.py.

3. Model Config is Scattered — RESOLVED

Model identifiers, HF repo IDs, display names, and engine metadata are now consolidated in the ModelConfig registry. Backend-aware branching (e.g. MLX vs PyTorch Qwen repo IDs) happens inside the registry. Frontend model options are centralized in EngineModelSelector.tsx.

4. Voice Prompt Cache Assumes PyTorch Tensors

backend/utils/cache.py uses torch.save() / torch.load(). LuxTTS, Chatterbox, and Kokoro backends work around this by storing reference audio paths (or preset voice IDs) instead of tensors in their voice prompt dicts. Not ideal but functional.

5. Frontend Assumes Qwen Model Sizes — RESOLVED

The generation form now uses a flat model dropdown with engine-based routing. Per-engine language filtering is in place. Model size is only sent for Qwen / Qwen CustomVoice.

6. No Platform Gating on Models — NEW

ModelConfig has no way to express hardware requirements. Every engine is shown to every user, regardless of whether it'll actually load. Users on non-CUDA platforms discover failure at load time (or not at all — some fall back silently to CPU and never complete). Blocks shipping CUDA-only engines (VoxCPM) and would improve the Intel Arc / ROCm / CPU-only UX today. See ModelConfig TODO: add requires: list[Literal["cuda", "mps", "xpu", "cpu", "rocm"]] or equivalent, plumb through /models API, render in ModelManagement.tsx + EngineModelSelector.tsx.

7. Engine Sprawl — NEW

Seven TTS engines shipped, more candidates queued. Issue #419 asks for a first-class vs experimental distinction. Related: issue #420 asks for formalized platform support tiers. Combined, these would let us ship more engines more confidently with clearer expectations for users.


Recommended Priorities

Tier 1 — Ship Now

Priority PR/Item Impact Effort
1 RTX 50-series / Blackwell diagnostic — detect stale CUDA binary vs GPU arch, prompt re-download (#417, #400, #396, #395, #390, #362) Large cluster of user-blocking errors Medium
2 CustomVoice download failures (#475, #445) New engine blocked on MAC/Win — regression triage Medium
3 50k char limit on GPU (#464) Regression — chunking should handle this Medium
4 Close PR #311 (CosyVoice) and dedupe #331/#430 (MiniMax) Housekeeping None
5 PR #443 — infinite offline retry loop Bug fix, reviewable Low
6 PR #465 — define tier-1 / tier-2 platforms Unblocks engine-sprawl decision (#419) Low
7 PR #463 — docker registry auto-publish Community PR, low risk Low
8 #253 — 48kHz speech tokenizer Quality improvement for Qwen Medium
9 Kokoro profile UX (#360) — partially addressed by auto-switch Polish Low

Tier 2 — Feature Work

Priority Item Impact Effort
1 Engine tier system (#419) — first-class vs experimental, platform gating in ModelConfig Unblocks CUDA-only engines (VoxCPM, etc.) and frontend polish Medium
2 Frontend tech-debt burn-down (#421) + code-split (#422) Before gating CI on Biome Medium
3 #154 — Audiobook tab Long-form users. Chunking + queue shipped. Medium
4 UI i18n (#411 PR offer, #392, #261) Chinese UI + general localization Medium
5 #225 — Custom HuggingFace models User-supplied models. Needs rework. High
6 OpenAI-compatible API (plan doc exists) — see also #448 (API for non-Qwen) Low effort once API is stable Low
7 LoRA fine-tuning (PR #195) Complex, needs rework for multi-engine Very High
8 Streaming for non-MLX engines Currently MLX-only Medium
9 Voice-to-voice / RVC (#407, #347) New modality — different arch shape High

Tier 3 — Future Engines (cross-platform preferred)

Priority Item Notes
1 MOSS-TTS-Nano 0.1B, Apache 2.0, 4-core CPU realtime, 48 kHz stereo, streaming, 20 langs, released 2026-04-13. Best alignment with our criteria. Verify install ergonomics before committing.
2 Pocket TTS (Kyutai) CPU-first 100M model. MIT. Fills streaming gap without CUDA dependency. Several European langs added by Feb 2026.
3 IndicF5 Fills Indian-language gap (#339). Closes many language-request issues.
4 VibeVoice (Microsoft, #172) 1.5B, long-form multi-speaker (up to 90 min, 4 speakers). Strong Stories-editor fit.
5 Voxtral TTS (Mistral, #364) 4B presets+cloning. Frontier quality claim, but 16 GB+ VRAM — would need the platform-tier work first.
6 Fish Speech / Fish Audio S2 50+ langs, word-level instruct. License clarification first. (#385)
7 XTTS-v2 17+ langs, mature pip. CPML likely kills commercial use — verify.
8 index-tts2 (#370) Unvetted.
VoxCPM2 Backlogged — CUDA-only upstream. Revisit when tier system ships or MPS bugs are fixed upstream.

Previously Prioritized — Now Done

  • Kokoro 82M — finish integration Shipped (PR #325)
  • Qwen CustomVoice Shipped (PR #328)
  • Intel Arc (XPU) support Shipped (PR #320)
  • Blackwell CUDA Shipped (PR #401, follow-up work open)
  • Generation cancellation Shipped (PR #444)
  • macOS Intel x86_64 Shipped (PR #416)

Branch Inventory

Branch PR Status Notes
voicebox-new-models Active New model research (Fish Speech, Pocket TTS, VibeVoice, etc.); VoxCPM evaluated & backlogged
fix/kokoro-pyinstaller-source-files Active Kokoro frozen-build source bundling (parent of voicebox-new-models)
feat/cosyvoice-engine #311 Open — closing CosyVoice2/3 — abandoned, poor quality
feat/kokoro #325 Merged Kokoro 82M + voice profile type system
feat/qwen-custom-voice #328 Merged Qwen CustomVoice preset engine
feat/chatterbox-turbo #258 Merged Chatterbox Turbo + per-engine languages
feat/chatterbox #257 Merged Chatterbox Multilingual
feat/luxtts #254 Merged LuxTTS + multi-engine arch

Quick Reference: API Endpoints

All current endpoints
Endpoint Method Purpose
/health GET Health check, model/GPU status
/profiles POST, GET Create/list voice profiles
/profiles/{id} GET, PUT, DELETE Profile CRUD
/profiles/{id}/samples POST, GET Add/list voice samples
/profiles/{id}/avatar POST, GET, DELETE Avatar management
/profiles/{id}/export GET Export profile as ZIP
/profiles/import POST Import profile from ZIP
/generate POST Generate speech (engine param selects TTS backend)
/generate/stream POST Stream speech (MLX only)
/history GET List generation history
/history/{id} GET, DELETE Get/delete generation
/history/{id}/export GET Export generation ZIP
/history/{id}/export-audio GET Export audio only
/transcribe POST Transcribe audio (Whisper)
/models/status GET All model statuses (Qwen, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Whisper)
/models/download POST Trigger model download
/models/download/cancel POST Cancel/dismiss download
/models/{name} DELETE Delete downloaded model
/models/load POST Load model into memory
/models/unload POST Unload model
/models/progress/{name} GET SSE download progress
/tasks/active GET Active downloads/generations (with inline progress)
/stories POST, GET Create/list stories
/stories/{id} GET, PUT, DELETE Story CRUD
/stories/{id}/items POST, GET Story items CRUD
/stories/{id}/export GET Export story audio
/channels POST, GET Audio channel CRUD
/channels/{id} PUT, DELETE Channel update/delete
/cache/clear POST Clear voice prompt cache
/server/cuda/status GET CUDA binary availability
/server/cuda/download POST Download CUDA binary
/server/cuda/switch POST Switch to CUDA backend