TTS Engines

For humans: This doc is optimized for AI agents to implement new TTS engines autonomously. It's structured as a phased workflow with explicit gates and a checklist so an agent can do the full integration — dependency research, backend, frontend, bundling — and hand you a draft release or prod build to test locally. It's also a useful reference if you're doing it yourself.

Adding an engine touches ~10 files across 4 layers. The backend protocol work is straightforward — the real time sink is dependency hell, upstream library bugs, and PyInstaller bundling.

Do not start writing code until you complete Phase 0. The v0.2.3 release was three patch releases of PyInstaller fixes because dependency research was skipped. Every issue — inspect.getsource() failures, missing native data files, metadata lookups, dtype mismatches — was discoverable by reading the model library's source code before integration began.

Architecture Overview

The backend is split into layers:

Layer Purpose Files Touched
routes/ Thin HTTP handlers None (auto-dispatch)
services/ Business logic None (auto-dispatch)
backends/ Engine implementations your_engine_backend.py
utils/ Shared utilities As needed

New engines only need to touch backends/ and models.py on the backend side — the route and service layers use a model config registry that handles dispatch automatically.

Phase 0: Dependency Research

This phase is mandatory. Clone the model library and its key dependencies into a temporary directory and inspect them before writing any integration code. The goal is to produce a dependency audit that identifies every PyInstaller-incompatible pattern, every native data file, and every upstream bug you'll need to work around.

0.1 Clone and Inspect the Model Library

# Create a throwaway workspace
mkdir /tmp/engine-research && cd /tmp/engine-research

# Clone the model library
git clone https://github.com/org/model-library.git
cd model-library

Read these files first, in order:

  1. setup.py / setup.cfg / pyproject.toml — Check pinned dependency versions. If the library pins torch==2.6.0 or numpy<1.26, you'll need --no-deps installation and manual sub-dependency listing (this is what happened with chatterbox-tts).

  2. __init__.py and the main model class — Trace the import chain. Look for:

    • from_pretrained() — does it call huggingface_hub internally? Does it pass token=True (which crashes without a stored HF token)?
    • from_local() — does it exist? You may need manual snapshot_download() + from_local() to bypass download bugs.
    • Device handling — does it default to CUDA? Does it support MPS? Many libraries crash on MPS with unsupported operators.
  3. All import statements — Recursively trace what the library imports. You're looking for:

    • inspect.getsource() anywhere in the chain (search all .py files)
    • typeguard / @typechecked decorators (these call inspect.getsource() at import time)
    • importlib.metadata.version() or pkg_resources.get_distribution() (need --copy-metadata)
    • lazy_loader (needs --collect-all to bundle .pyi stubs)

0.2 Scan for PyInstaller-Incompatible Patterns

Run these searches against the cloned library and its transitive dependencies:

# inspect.getsource — will crash in frozen binary without --collect-all
grep -r "inspect.getsource\|getsource(" .

# typeguard / @typechecked — calls inspect.getsource at import time
grep -r "@typechecked\|from typeguard" .

# importlib.metadata — needs --copy-metadata
grep -r "importlib.metadata\|pkg_resources.get_distribution\|pkg_resources.require" .

# Data files loaded at runtime — need --collect-all or --collect-data
grep -r "Path(__file__).parent\|os.path.dirname(__file__)\|resources_path\|pkg_resources.resource_filename" .

# Native library paths — may need env var override in frozen builds
grep -r "/usr/share\|/usr/lib\|/usr/local\|espeak\|phonemize" .

# torch.load without map_location — will crash on CPU-only builds
grep -r "torch.load(" . | grep -v "map_location"

# HuggingFace token bugs
grep -r 'token=True\|token=os.getenv' .

# Float64/Float32 assumptions — librosa returns float64, many models assume float32
grep -r "torch.from_numpy\|\.double()\|float64" .

# @torch.jit.script — calls inspect.getsource(), crashes in frozen builds
grep -r "@torch.jit.script\|torch.jit.script" .

# torchaudio.load — requires torchcodec in torchaudio 2.10+, use soundfile.read() instead
grep -r "torchaudio.load\|torchaudio.save" .

# Gated HuggingFace repos — models that hardcode gated repos as tokenizer/config sources
grep -r "from_pretrained\|tokenizer_name\|AutoTokenizer" . | grep -i "llama\|meta-llama\|gated"

0.3 Install and Trace in a Throwaway Venv

# Create isolated venv
python -m venv /tmp/engine-venv
source /tmp/engine-venv/bin/activate

# Install the package (try normally first)
pip install model-package

# Check if it conflicts with our stack
pip install model-package torch==2.10 transformers==4.57.3 numpy>=1.26
# If this fails, you need --no-deps:
pip install --no-deps model-package

# Get the full dependency tree
pip show model-package  # Check Requires: field
pip show -f model-package  # List all installed files (look for data files)

# Check for non-PyPI dependencies
pip install model-package 2>&1 | grep -i "no matching distribution"

0.4 Test Model Loading on CPU

Before writing any integration code, verify the model works on CPU in a plain Python script:

import torch
# Force CPU to catch map_location bugs early
model = ModelClass.from_pretrained("org/model", device="cpu")

# Test with a float32 audio array (not float64)
import numpy as np
audio = np.random.randn(16000).astype(np.float32)
output = model.generate("Hello world", audio)
print(f"Output shape: {output.shape}, dtype: {output.dtype}, sample rate: {model.sample_rate}")

If this crashes, you've found a bug you'll need to monkey-patch. Common ones:

  • RuntimeError: expected scalar type Float but found Double → needs float32 cast
  • RuntimeError: map_location → needs torch.load patch
  • RuntimeError: Unsupported operator aten::... → needs MPS skip

0.5 Produce a Dependency Audit

Before proceeding to Phase 1, write down:

  1. PyPI vs non-PyPI deps — which packages need --find-links, git+https://, or --no-deps?
  2. PyInstaller directives needed — which packages need --collect-all, --copy-metadata, --hidden-import?
  3. Runtime data files — which packages ship data files (YAML, pretrained weights, phoneme tables, shader libraries) that must be bundled?
  4. Native library paths — which packages look for data at system paths that won't exist in a frozen binary?
  5. Monkey-patches neededtorch.load map_location, float64→float32 casts, MPS skip, HF token bypass, etc.
  6. Sample rate — what does the engine output? (24kHz, 44.1kHz, 48kHz)
  7. Model download methodfrom_pretrained() with library-managed download, or manual snapshot_download() + from_local()?

This audit becomes your implementation plan for Phases 1, 4, and 5.

Phase 1: Backend Implementation

1.1 Create the Backend File

Create backend/backends/<engine>_backend.py (~200-300 lines) implementing the TTSBackend protocol:

class YourBackend:
"""Must satisfy the TTSBackend protocol."""

async def load_model(self, model_size: str = "default") -> None: ...
async def create_voice_prompt(self, audio_path: str, reference_text: str, use_cache: bool = True) -> tuple[dict, bool]: ...
async def combine_voice_prompts(self, audio_paths: list[str], ref_texts: list[str]) -> tuple[np.ndarray, str]: ...
async def generate(self, text: str, voice_prompt: dict, language: str = "en", seed: int | None = None, instruct: str | None = None) -> tuple[np.ndarray, int]: ...
def unload_model(self) -> None: ...
def is_loaded(self) -> bool: ...
def _get_model_path(self, model_size: str) -> str: ...

Key decisions per engine:

Decision Options Examples
Voice prompt storage Pre-computed tensors vs deferred file paths Qwen stores tensor dicts; Chatterbox stores paths
Caching Use voice prompt cache or skip it LuxTTS caches with prefix; Chatterbox skips caching
Device selection CUDA / MPS / CPU Chatterbox forces CPU on macOS (MPS bugs)
Model download Library handles it vs manual snapshot_download Turbo uses manual download to bypass token=True bug
Sample rate Engine-specific LuxTTS outputs 48kHz, everything else is 24kHz

1.2 Voice Prompt Patterns

Pattern A: Pre-computed tensors (Qwen, LuxTTS)

encoded = model.encode_prompt(audio_path)
return encoded, False  # (prompt_dict, was_cached)

Pattern B: Deferred file paths (Chatterbox, MLX)

return {"ref_audio": audio_path, "ref_text": reference_text}, False

Pattern C: Hybrid (possible for new engines)

embedding = model.extract_speaker(audio_path)
return {"embedding": embedding, "ref_audio": audio_path}, False

If caching, prefix your cache keys:

cache_key = "yourengine_" + get_cache_key(audio_path, reference_text)

1.3 Register the Engine

In backend/backends/__init__.py:

Add a ModelConfig entry:

ModelConfig(
model_name="your-engine",
display_name="Your Engine",
engine="your_engine",
hf_repo_id="org/model-repo",
size_mb=3200,
needs_trim=False,  # set True if output needs trim_tts_output()
languages=["en", "fr", "de"],
),

Add to TTS_ENGINES dict:

TTS_ENGINES = {
...
"your_engine": "Your Engine",
}

Add factory branch:

elif engine == "your_engine":
from .your_backend import YourBackend
backend = YourBackend()

1.4 Update Request Models

In backend/models.py:

  • Add engine name to GenerationRequest.engine regex pattern
  • Add any new language codes to the language regex

Phase 2: Route and Service Integration

With the model config registry, route and service layers have zero per-engine dispatch points. All endpoints use registry helpers like get_model_config(), load_engine_model(), engine_needs_trim(), check_model_loaded(), etc.

You don't need to touch any route or service files unless your engine needs custom behavior in the generate pipeline.

Post-Processing

If your model produces trailing silence, set needs_trim=True on your ModelConfig. The generation service applies trim_tts_output() automatically.

Phase 3: Frontend Integration

3.1 TypeScript Types

In app/src/lib/api/types.ts:

  • Add to the engine union type on GenerationRequest

3.2 Language Maps

In app/src/lib/constants/languages.ts:

  • Add entry to ENGINE_LANGUAGES record
  • Add any new language codes to ALL_LANGUAGES if needed

3.3 Engine/Model Selector

In app/src/components/Generation/EngineModelSelector.tsx:

  • Add entry to ENGINE_OPTIONS and ENGINE_DESCRIPTIONS
  • Add to ENGLISH_ONLY_ENGINES if applicable

3.4 Form Hook

In app/src/lib/hooks/useGenerationForm.ts:

  • Add to Zod schema enum for engine
  • Add engine-to-model-name mapping
  • Update payload construction for engine-specific fields

Watch out for model naming inconsistencies. The HuggingFace repo name, the model size label, and the API model name don't always follow predictable patterns. For example, TADA's 3B model is named tada-3b-ml (not tada-3b), because it's a multilingual variant. Always check the actual repo names and build the frontend model name mapping from those, not from assumptions like {engine}-{size}.

3.5 Model Management

In app/src/components/ServerSettings/ModelManagement.tsx:

  • Add description to MODEL_DESCRIPTIONS record
  • Add model name to voiceModels filter condition

3.6 Non-Cloning Engines (Preset Voices)

If your engine uses pre-built voices instead of zero-shot cloning from reference audio (e.g. Kokoro), additional integration is needed:

Backend:

  • In kokoro_backend.py (or your engine), define a VOICES list of (voice_id, display_name, gender, language) tuples
  • create_voice_prompt() should return {"voice_type": "preset", "preset_engine": "<engine>", "preset_voice_id": "<id>"}
  • generate() should read voice_prompt.get("preset_voice_id") to select the voice
  • Add a seed_preset_profiles("<engine>") call in backend/routes/models.py after model download completes
  • The seed_preset_profiles() function in backend/services/profiles.py creates DB profiles with voice_type="preset"

Frontend:

  • The EngineModelSelector filters options based on selectedProfile.voice_type:
    • "cloned" profiles → only cloning engines shown (Kokoro hidden)
    • "preset" profiles → only the preset's engine shown
  • Profile cards show the engine name as a badge for preset profiles
  • When a preset profile is selected, the engine auto-switches

Profile schema fields for presets:

  • voice_type: "preset" (vs "cloned" for traditional profiles)
  • preset_engine: "<engine>" — which engine owns this voice
  • preset_voice_id: "<id>" — the engine-specific voice identifier

For future "designed" voices (text description instead of audio, e.g. Qwen CustomVoice):

  • Use voice_type: "designed" with design_prompt field
  • create_voice_prompt_for_profile() already returns the design prompt for this type

Phase 4: Dependencies

Use the dependency audit from Phase 0 to drive this phase. You should already know what packages are needed, which conflict, and which require special installation.

4.1 Python Dependencies

Add to backend/requirements.txt. There are three installation patterns, depending on what Phase 0 revealed:

Normal PyPI packages:

some-model-package>=1.0.0

Pinned dependency conflicts (--no-deps) — If the model package pins old versions of torch/numpy/transformers, install with --no-deps and list sub-dependencies manually. This is the pattern used for chatterbox-tts:

# In justfile / CI setup:
pip install --no-deps chatterbox-tts

# In requirements.txt — list each actual sub-dependency:
conformer>=0.3.2
diffusers>=0.31.0
omegaconf>=2.3.0
resemble-perth>=0.0.2
s3tokenizer>=0.1.6

To identify sub-deps: pip show chatterbox-ttsRequires: field, then cross-reference against existing requirements.txt to avoid duplicates.

Non-PyPI packages — Some libraries only exist on GitHub or require custom indexes:

# Git-only packages (no PyPI release)
linacodec @ git+https://github.com/ysharma3501/LinaCodec.git
Zipvoice @ git+https://github.com/ysharma3501/LuxTTS.git

# Custom package indexes (C extensions with platform-specific wheels)
--find-links https://k2-fsa.github.io/icefall/piper_phonemize.html
piper-phonemize>=1.2.0

4.2 Dependency Conflict Resolution

Check for conflicts with the existing stack before adding anything:

# Our current stack pins (approximate):
# Python 3.12+, torch>=2.10, transformers>=4.57, numpy>=1.26

# Test compatibility
pip install model-package torch==2.10 transformers==4.57.3 numpy>=1.26

# If it fails, check what the package pins:
pip show model-package | grep Requires
# Look at setup.py/pyproject.toml for version constraints

Known incompatible patterns in the wild:

  • torch==2.6.0 — many older packages pin this
  • numpy<1.26 — conflicts with Python 3.12+
  • transformers==4.46.3 — many packages pin old transformers
  • onnxruntime pinned versions — often conflict with torch

4.3 Update Installation Scripts

Dependencies must be added in multiple places:

File What to add
backend/requirements.txt Package and version constraint
justfile --no-deps install line if needed (in setup-python and setup-python-release targets)
.github/workflows/release.yml Same --no-deps line in CI build steps
Dockerfile Same install commands for Docker builds

Phase 5: PyInstaller Bundling (build_binary.py)

This is where most of the pain lives. The v0.2.3 release was entirely dedicated to fixing bundling issues — every new engine that shipped in v0.2.1 (LuxTTS, Chatterbox, Chatterbox Turbo) worked in dev but failed in production builds. Don't skip this phase.

5.1 Register Your Engine in build_binary.py

Every new engine needs entries in backend/build_binary.py. This file drives PyInstaller and is the single most common source of "works in dev, breaks in prod" bugs. You need to decide which PyInstaller directives your engine's dependencies require:

Directive What It Does When You Need It
--hidden-import <module> Includes a module PyInstaller can't detect via static analysis Dynamic imports, lazy imports, plugin architectures
--collect-all <package> Bundles source .py files, data files, AND native libraries Packages that call inspect.getsource() at import time (e.g. inflect via typeguard's @typechecked), or that ship pretrained model files (e.g. perth ships .pth.tar + hparams.yaml)
--collect-data <package> Bundles only data files (not source or native libs) Packages with YAML configs, vocab files, etc.
--collect-submodules <package> Bundles all submodules Packages with deep module trees that PyInstaller misses
--copy-metadata <package> Copies importlib.metadata info Packages that call importlib.metadata.version() or pkg_resources.get_distribution() at runtime. Already required for: requests, transformers, huggingface-hub, tokenizers, safetensors, tqdm

Example: adding hidden imports and collect-all for a new engine:

# In build_binary.py, inside the args list:
"--hidden-import",
"backend.backends.your_engine_backend",
"--hidden-import",
"your_engine_package",
"--hidden-import",
"your_engine_package.inference",
"--collect-all",
"some_dependency_that_uses_inspect_getsource",
"--copy-metadata",
"some_dependency_that_checks_its_own_version",

5.2 Lessons from v0.2.3 — Real Failures and Their Fixes

These are actual production failures from shipping new engines. Every one of these passed python -m uvicorn in dev:

Engine Failure Root Cause Fix
LuxTTS "could not get source code" on import inflect uses typeguard's @typechecked which calls inspect.getsource() — needs .py source files, not just bytecode --collect-all inflect
LuxTTS espeak-ng-data not found piper_phonemize C library looks for data at /usr/share/espeak-ng-data/ which doesn't exist in the bundle --collect-all piper_phonemize + set ESPEAK_DATA_PATH env var at runtime (see 5.3)
LuxTTS inspect.getsource error in Vocos codec linacodec and zipvoice use source introspection --collect-all linacodec + --collect-all zipvoice
Chatterbox FileNotFoundError for watermark model perth ships pretrained model files (hparams.yaml, .pth.tar) that PyInstaller doesn't bundle by default --collect-all perth
All engines importlib.metadata failures Frozen binary doesn't include package metadata for huggingface-hub, transformers, etc. --copy-metadata for each affected package
All engines Download progress bars stuck at 0% huggingface_hub silently disables tqdm progress bars based on logger level in frozen builds — our progress tracker never receives byte updates Force-enable tqdm's internal counter in HFProgressTracker
TADA inspect.getsource error in DAC's Snake1d @torch.jit.script calls inspect.getsource() which fails without .py source files Wrote a lightweight shim (dac_shim.py) reimplementing Snake1d without @torch.jit.script, registered fake dac.* modules in sys.modules
All engines NameError: name 'obj' is not defined on macOS Python 3.12.0 has a CPython bug that corrupts bytecode when PyInstaller rewrites code objects Upgrade to Python 3.12.13+
All engines resource_tracker subprocess crash multiprocessing in frozen binaries needs freeze_support() called before anything else Added to server.py entry point

5.3 Runtime Frozen-Build Handling (server.py)

Some fixes can't live in build_binary.py — they need runtime detection. The entry point backend/server.py handles these before any heavy imports:

# 1. freeze_support() — MUST be called before any multiprocessing use
import multiprocessing
multiprocessing.freeze_support()

# 2. Native data paths — redirect C libraries to bundled data
if getattr(sys, 'frozen', False):
_meipass = getattr(sys, '_MEIPASS', os.path.dirname(sys.executable))
_espeak_data = os.path.join(_meipass, 'piper_phonemize', 'espeak-ng-data')
if os.path.isdir(_espeak_data):
    os.environ.setdefault('ESPEAK_DATA_PATH', _espeak_data)

# 3. stdout/stderr safety — PyInstaller --noconsole on Windows sets these to None
if not _is_writable(sys.stdout):
sys.stdout = open(os.devnull, 'w')

If your engine's dependencies include native libraries that look for data at system paths (like espeak-ng does), you'll need to add a similar os.environ.setdefault() block here.

5.4 CUDA vs CPU Build Branching

build_binary.py produces two different binaries:

  • voicebox-server (CPU) — excludes all nvidia.* packages to avoid bundling ~3 GB of CUDA DLLs
  • voicebox-server-cuda — includes torch.cuda and torch.backends.cudnn

On Windows, if the build environment has CUDA torch installed but you're building the CPU binary, the script temporarily swaps to CPU-only torch and restores CUDA torch afterward. This prevents PyInstaller from accidentally bundling CUDA libraries into the CPU build.

New engine imports go in the common section (not the CUDA or MLX conditional blocks) unless your engine has platform-specific dependencies.

5.5 MLX Conditional Inclusion

Apple Silicon builds conditionally include MLX hidden imports and --collect-all mlx / --collect-all mlx_audio. If your engine has an MLX-specific backend variant, add its imports inside the if is_apple_silicon() and not cuda: block.

5.6 Testing Frozen Builds

You can't skip this. Models that work in python -m uvicorn will break in the PyInstaller binary. The v0.2.3 release required three patch releases (v0.2.1 → v0.2.2 → v0.2.3) to get all engines working in production.

  1. Build: just build
  2. Launch the binary directly (not via python -m)
  3. Test the full chain: download → load → generate → progress tracking
  4. Check stderr for the actual error (logs go to stderr for Tauri sidecar capture)
  5. Fix, rebuild, repeat

Common gotcha: testing only generation with a pre-cached model from your dev install. Always test with a clean model cache to verify downloads work too.

Phase 6: Common Upstream Workarounds

torch.load device mismatch

_original_torch_load = torch.load
def _patched_torch_load(*args, **kwargs):
kwargs.setdefault("map_location", "cpu")
return _original_torch_load(*args, **kwargs)
torch.load = _patched_torch_load

Float64/Float32 dtype mismatch

original_fn = SomeClass.some_method
def patched_fn(self, *args, **kwargs):
result = original_fn(self, *args, **kwargs)
return result.float()
SomeClass.some_method = patched_fn

HuggingFace token bug

from huggingface_hub import snapshot_download
local_path = snapshot_download(repo_id=REPO, token=None)
model = ModelClass.from_local(local_path, device=device)

MPS tensor issues

Skip MPS entirely if operators aren't supported:

def _get_device(self):
if torch.cuda.is_available():
    return "cuda"
return "cpu"  # Skip MPS

Gated HuggingFace repos as hardcoded config sources

Some models hardcode a gated HuggingFace repo as their tokenizer or config source (e.g., TADA hardcodes "meta-llama/Llama-3.2-1B" in both its AlignerConfig and TadaConfig). This silently fails without HF authentication.

Fix: Download from an ungated mirror and patch the config objects directly:

# Download tokenizer from ungated mirror
UNGATED_TOKENIZER = "unsloth/Llama-3.2-1B"
tokenizer_path = snapshot_download(UNGATED_TOKENIZER, token=None)

# Patch the model config to use the local path instead of the gated repo
config = ModelConfig.from_pretrained(model_path)
config.tokenizer_name = tokenizer_path
model = ModelClass.from_pretrained(model_path, config=config)

Do NOT monkey-patch AutoTokenizer.from_pretrained — it's a classmethod, and replacing it corrupts the descriptor, which breaks other engines that use different tokenizers (e.g., Qwen uses a Qwen tokenizer via AutoTokenizer). Always patch at the config level, not the class method level.

torchaudio.load() requires torchcodec in 2.10+

As of torchaudio>=2.10, torchaudio.load() requires the torchcodec package for audio I/O. If your engine or backend code uses torchaudio.load(), replace it with soundfile:

# Before (breaks without torchcodec):
import torchaudio
waveform, sr = torchaudio.load("audio.wav")

# After:
import soundfile as sf
import torch
data, sr = sf.read("audio.wav", dtype="float32")
waveform = torch.from_numpy(data).unsqueeze(0)

Note: torchaudio.functional.resample() and other pure-PyTorch math functions work fine without torchcodec — only the I/O functions are affected.

@torch.jit.script breaks in frozen builds

torch.jit.script calls inspect.getsource() to parse the decorated function's source code. In a PyInstaller binary, .py source files aren't available, so this crashes at import time.

Fix: Remove or avoid @torch.jit.script decorators. If the decorated function comes from an upstream dependency, write a shim that reimplements the function without the decorator (see "Toxic dependency chains" below).

Toxic dependency chains — the shim pattern

Sometimes a model library depends on a package with a massive, hostile transitive dependency tree, but only uses a tiny piece of it. When the dependency chain is unbuildable or would pull in dozens of unwanted packages, the right move is to write a lightweight shim.

Example: TADA depends on descript-audio-codec (DAC), which pulls in descript-audiotools -> onnx, tensorboard, protobuf, matplotlib, pystoi, etc. The onnx package fails to build from source on macOS. But TADA only uses Snake1d from DAC — a 7-line PyTorch module.

Solution: Create a shim at backend/utils/dac_shim.py that registers fake modules in sys.modules:

import sys
import types
import torch
from torch import nn

def snake(x, alpha):
"""Snake activation — reimplemented without @torch.jit.script."""
return x + (1.0 / (alpha + 1e-9)) * torch.sin(alpha * x).pow(2)

class Snake1d(nn.Module):
def __init__(self, channels):
    super().__init__()
    self.alpha = nn.Parameter(torch.ones(1, channels, 1))
def forward(self, x):
    return snake(x, self.alpha)

# Register fake dac.* modules so "from dac.nn.layers import Snake1d" works
_nn = types.ModuleType("dac.nn")
_layers = types.ModuleType("dac.nn.layers")
_layers.Snake1d = Snake1d
_nn.layers = _layers

for name, mod in [("dac", types.ModuleType("dac")),
               ("dac.nn", _nn), ("dac.nn.layers", _layers)]:
sys.modules[name] = mod

Key rules for shims:

  • Import the shim before importing the model library (so it finds the fake modules first)
  • Do NOT use @torch.jit.script in the shim (see above)
  • Only reimplement what the model actually uses — check the import chain carefully

Candidate Engines

The docs/PROJECT_STATUS.md file is the canonical, living list of candidates under evaluation — including why some have been backlogged (e.g. VoxCPM, which is effectively CUDA-only upstream).

At a glance, current top candidates:

Model Tier Size Cross-platform? Key Features
MOSS-TTS-Nano 1 0.1 B Yes (CPU realtime) 48 kHz stereo, Apache 2.0, released 2026-04-13
Voxtral TTS 2 4 B Likely mistralai/Voxtral-4B-TTS-2603 — presets + cloning
VibeVoice 2 ~500 M Yes Podcast-style multi-speaker dialogue
Dia2 3 TBD TBD Successor to the original Dia
Fish Audio S2 Pro 3 Medium Yes Word-level control via inline text

Backlogged:

  • VoxCPM (2B, Apache 2.0) — CUDA ≥12 required upstream; MPS broken in issues #232/#248; CPU path rejected by maintainers (#256). Keep watching for a PR that relaxes the device requirement.

Update PROJECT_STATUS.md when you pick one up or mark one as shipped/backlogged.

Implementation Checklist

Use this as a gate between phases. Do not proceed to the next phase until every item in the current phase is checked.

Phase 0: Dependency Research

  • Cloned model library source into a temp directory
  • Read setup.py / pyproject.toml — noted pinned dependency versions
  • Traced all imports from the model class through to leaf dependencies
  • Searched for inspect.getsource, @typechecked, typeguard in the full dependency tree
  • Searched for importlib.metadata, pkg_resources.get_distribution in the dependency tree
  • Searched for Path(__file__).parent, os.path.dirname(__file__), hardcoded system paths
  • Searched for torch.load calls missing map_location
  • Searched for torch.from_numpy without .float() cast
  • Searched for token=True or token=os.getenv("HF_TOKEN") in HuggingFace calls
  • Searched for @torch.jit.script / torch.jit.script (crashes in frozen builds)
  • Searched for torchaudio.load / torchaudio.save (requires torchcodec in 2.10+)
  • Searched for hardcoded gated HuggingFace repo names (e.g., meta-llama/*)
  • Evaluated whether any dependency is used minimally enough to shim instead of install
  • Tested model loading and generation on CPU in a throwaway venv
  • Tested with a clean HuggingFace cache (no pre-downloaded models)
  • Produced a written dependency audit documenting all findings

Phase 1: Backend Implementation

  • Created backend/backends/<engine>_backend.py implementing TTSBackend protocol
  • Chose voice prompt pattern (pre-computed tensors vs deferred file paths)
  • Implemented all monkey-patches identified in Phase 0
  • Used get_torch_device() from backends/base.py for device selection
  • Used model_load_progress() from backends/base.py for download/load tracking
  • Tested: model downloads correctly
  • Tested: model loads on CPU
  • Tested: generation produces valid audio
  • Tested: voice cloning from reference audio works
  • Registered ModelConfig in backends/__init__.py
  • Added to TTS_ENGINES dict
  • Added factory branch in get_tts_backend_for_engine()
  • Updated engine regex in backend/models.py

Phase 2–3: Route, Service, and Frontend

  • Confirmed zero changes needed in routes/services (or documented why custom behavior is needed)
  • Added engine to TypeScript union type in app/src/lib/api/types.ts
  • Added language map entry in app/src/lib/constants/languages.ts
  • Added to ENGINE_OPTIONS and ENGINE_DESCRIPTIONS in EngineModelSelector.tsx
  • Added to Zod schema and model-name mapping in useGenerationForm.ts
  • Added description in ModelManagement.tsx

Phase 4: Dependencies

  • Added packages to backend/requirements.txt
  • If --no-deps needed: listed sub-dependencies explicitly
  • If git-only packages: added @ git+https://... entries
  • If custom index needed: added --find-links line
  • Updated justfile setup targets
  • Updated .github/workflows/release.yml build steps
  • Updated Dockerfile if applicable
  • Verified pip install succeeds in a clean venv with existing requirements

Phase 5: PyInstaller Bundling

  • Added --hidden-import entries in build_binary.py for:
    • backend.backends.<engine>_backend
    • The model package and its key submodules
  • Added --collect-all for any packages that:
    • Use inspect.getsource() / @typechecked
    • Ship pretrained model data files (.pth.tar, .yaml, etc.)
    • Ship native data files (phoneme tables, shader libraries, etc.)
  • Added --copy-metadata for any packages that use importlib.metadata
  • If engine has native data paths: added os.environ.setdefault() in server.py
  • Built frozen binary with just build
  • Tested in frozen binary with clean model cache (not pre-cached from dev):
    • Model download works with real-time progress
    • Model loading works
    • Generation produces valid audio
    • No errors in stderr logs

Phase 6: Final Verification

  • Engine works in dev mode (just dev)
  • Engine works in frozen binary (just build → run binary directly)
  • Tested on target platform (macOS for MLX, Windows/Linux for CUDA)
  • No regressions in existing engines