For humans: This doc is optimized for AI agents to implement new TTS engines autonomously. It's structured as a phased workflow with explicit gates and a checklist so an agent can do the full integration — dependency research, backend, frontend, bundling — and hand you a draft release or prod build to test locally. It's also a useful reference if you're doing it yourself.
Adding an engine touches ~10 files across 4 layers. The backend protocol work is straightforward — the real time sink is dependency hell, upstream library bugs, and PyInstaller bundling.
Do not start writing code until you complete Phase 0. The v0.2.3 release was three patch releases of PyInstaller fixes because dependency research was skipped. Every issue — inspect.getsource() failures, missing native data files, metadata lookups, dtype mismatches — was discoverable by reading the model library's source code before integration began.
Architecture Overview
The backend is split into layers:
| Layer | Purpose | Files Touched |
|---|---|---|
routes/ |
Thin HTTP handlers | None (auto-dispatch) |
services/ |
Business logic | None (auto-dispatch) |
backends/ |
Engine implementations | your_engine_backend.py |
utils/ |
Shared utilities | As needed |
New engines only need to touch backends/ and models.py on the backend side — the route and service layers use a model config registry that handles dispatch automatically.
Phase 0: Dependency Research
This phase is mandatory. Clone the model library and its key dependencies into a temporary directory and inspect them before writing any integration code. The goal is to produce a dependency audit that identifies every PyInstaller-incompatible pattern, every native data file, and every upstream bug you'll need to work around.
0.1 Clone and Inspect the Model Library
# Create a throwaway workspace
mkdir /tmp/engine-research && cd /tmp/engine-research
# Clone the model library
git clone https://github.com/org/model-library.git
cd model-library
Read these files first, in order:
setup.py/setup.cfg/pyproject.toml— Check pinned dependency versions. If the library pinstorch==2.6.0ornumpy<1.26, you'll need--no-depsinstallation and manual sub-dependency listing (this is what happened withchatterbox-tts).__init__.pyand the main model class — Trace the import chain. Look for:from_pretrained()— does it callhuggingface_hubinternally? Does it passtoken=True(which crashes without a stored HF token)?from_local()— does it exist? You may need manualsnapshot_download()+from_local()to bypass download bugs.- Device handling — does it default to CUDA? Does it support MPS? Many libraries crash on MPS with unsupported operators.
All
importstatements — Recursively trace what the library imports. You're looking for:inspect.getsource()anywhere in the chain (search all.pyfiles)typeguard/@typecheckeddecorators (these callinspect.getsource()at import time)importlib.metadata.version()orpkg_resources.get_distribution()(need--copy-metadata)lazy_loader(needs--collect-allto bundle.pyistubs)
0.2 Scan for PyInstaller-Incompatible Patterns
Run these searches against the cloned library and its transitive dependencies:
# inspect.getsource — will crash in frozen binary without --collect-all
grep -r "inspect.getsource\|getsource(" .
# typeguard / @typechecked — calls inspect.getsource at import time
grep -r "@typechecked\|from typeguard" .
# importlib.metadata — needs --copy-metadata
grep -r "importlib.metadata\|pkg_resources.get_distribution\|pkg_resources.require" .
# Data files loaded at runtime — need --collect-all or --collect-data
grep -r "Path(__file__).parent\|os.path.dirname(__file__)\|resources_path\|pkg_resources.resource_filename" .
# Native library paths — may need env var override in frozen builds
grep -r "/usr/share\|/usr/lib\|/usr/local\|espeak\|phonemize" .
# torch.load without map_location — will crash on CPU-only builds
grep -r "torch.load(" . | grep -v "map_location"
# HuggingFace token bugs
grep -r 'token=True\|token=os.getenv' .
# Float64/Float32 assumptions — librosa returns float64, many models assume float32
grep -r "torch.from_numpy\|\.double()\|float64" .
# @torch.jit.script — calls inspect.getsource(), crashes in frozen builds
grep -r "@torch.jit.script\|torch.jit.script" .
# torchaudio.load — requires torchcodec in torchaudio 2.10+, use soundfile.read() instead
grep -r "torchaudio.load\|torchaudio.save" .
# Gated HuggingFace repos — models that hardcode gated repos as tokenizer/config sources
grep -r "from_pretrained\|tokenizer_name\|AutoTokenizer" . | grep -i "llama\|meta-llama\|gated"
0.3 Install and Trace in a Throwaway Venv
# Create isolated venv
python -m venv /tmp/engine-venv
source /tmp/engine-venv/bin/activate
# Install the package (try normally first)
pip install model-package
# Check if it conflicts with our stack
pip install model-package torch==2.10 transformers==4.57.3 numpy>=1.26
# If this fails, you need --no-deps:
pip install --no-deps model-package
# Get the full dependency tree
pip show model-package # Check Requires: field
pip show -f model-package # List all installed files (look for data files)
# Check for non-PyPI dependencies
pip install model-package 2>&1 | grep -i "no matching distribution"
0.4 Test Model Loading on CPU
Before writing any integration code, verify the model works on CPU in a plain Python script:
import torch
# Force CPU to catch map_location bugs early
model = ModelClass.from_pretrained("org/model", device="cpu")
# Test with a float32 audio array (not float64)
import numpy as np
audio = np.random.randn(16000).astype(np.float32)
output = model.generate("Hello world", audio)
print(f"Output shape: {output.shape}, dtype: {output.dtype}, sample rate: {model.sample_rate}")
If this crashes, you've found a bug you'll need to monkey-patch. Common ones:
RuntimeError: expected scalar type Float but found Double→ needs float32 castRuntimeError: map_location→ needstorch.loadpatchRuntimeError: Unsupported operator aten::...→ needs MPS skip
0.5 Produce a Dependency Audit
Before proceeding to Phase 1, write down:
- PyPI vs non-PyPI deps — which packages need
--find-links,git+https://, or--no-deps? - PyInstaller directives needed — which packages need
--collect-all,--copy-metadata,--hidden-import? - Runtime data files — which packages ship data files (YAML, pretrained weights, phoneme tables, shader libraries) that must be bundled?
- Native library paths — which packages look for data at system paths that won't exist in a frozen binary?
- Monkey-patches needed —
torch.loadmap_location, float64→float32 casts, MPS skip, HF token bypass, etc. - Sample rate — what does the engine output? (24kHz, 44.1kHz, 48kHz)
- Model download method —
from_pretrained()with library-managed download, or manualsnapshot_download()+from_local()?
This audit becomes your implementation plan for Phases 1, 4, and 5.
Phase 1: Backend Implementation
1.1 Create the Backend File
Create backend/backends/<engine>_backend.py (~200-300 lines) implementing the TTSBackend protocol:
class YourBackend:
"""Must satisfy the TTSBackend protocol."""
async def load_model(self, model_size: str = "default") -> None: ...
async def create_voice_prompt(self, audio_path: str, reference_text: str, use_cache: bool = True) -> tuple[dict, bool]: ...
async def combine_voice_prompts(self, audio_paths: list[str], ref_texts: list[str]) -> tuple[np.ndarray, str]: ...
async def generate(self, text: str, voice_prompt: dict, language: str = "en", seed: int | None = None, instruct: str | None = None) -> tuple[np.ndarray, int]: ...
def unload_model(self) -> None: ...
def is_loaded(self) -> bool: ...
def _get_model_path(self, model_size: str) -> str: ...
Key decisions per engine:
| Decision | Options | Examples |
|---|---|---|
| Voice prompt storage | Pre-computed tensors vs deferred file paths | Qwen stores tensor dicts; Chatterbox stores paths |
| Caching | Use voice prompt cache or skip it | LuxTTS caches with prefix; Chatterbox skips caching |
| Device selection | CUDA / MPS / CPU | Chatterbox forces CPU on macOS (MPS bugs) |
| Model download | Library handles it vs manual snapshot_download |
Turbo uses manual download to bypass token=True bug |
| Sample rate | Engine-specific | LuxTTS outputs 48kHz, everything else is 24kHz |
1.2 Voice Prompt Patterns
Pattern A: Pre-computed tensors (Qwen, LuxTTS)
encoded = model.encode_prompt(audio_path)
return encoded, False # (prompt_dict, was_cached)
Pattern B: Deferred file paths (Chatterbox, MLX)
return {"ref_audio": audio_path, "ref_text": reference_text}, False
Pattern C: Hybrid (possible for new engines)
embedding = model.extract_speaker(audio_path)
return {"embedding": embedding, "ref_audio": audio_path}, False
If caching, prefix your cache keys:
cache_key = "yourengine_" + get_cache_key(audio_path, reference_text)
1.3 Register the Engine
In backend/backends/__init__.py:
Add a ModelConfig entry:
ModelConfig(
model_name="your-engine",
display_name="Your Engine",
engine="your_engine",
hf_repo_id="org/model-repo",
size_mb=3200,
needs_trim=False, # set True if output needs trim_tts_output()
languages=["en", "fr", "de"],
),
Add to TTS_ENGINES dict:
TTS_ENGINES = {
...
"your_engine": "Your Engine",
}
Add factory branch:
elif engine == "your_engine":
from .your_backend import YourBackend
backend = YourBackend()
1.4 Update Request Models
In backend/models.py:
- Add engine name to
GenerationRequest.engineregex pattern - Add any new language codes to the language regex
Phase 2: Route and Service Integration
With the model config registry, route and service layers have zero per-engine dispatch points. All endpoints use registry helpers like get_model_config(), load_engine_model(), engine_needs_trim(), check_model_loaded(), etc.
You don't need to touch any route or service files unless your engine needs custom behavior in the generate pipeline.
Post-Processing
If your model produces trailing silence, set needs_trim=True on your ModelConfig. The generation service applies trim_tts_output() automatically.
Phase 3: Frontend Integration
3.1 TypeScript Types
In app/src/lib/api/types.ts:
- Add to the
engineunion type onGenerationRequest
3.2 Language Maps
In app/src/lib/constants/languages.ts:
- Add entry to
ENGINE_LANGUAGESrecord - Add any new language codes to
ALL_LANGUAGESif needed
3.3 Engine/Model Selector
In app/src/components/Generation/EngineModelSelector.tsx:
- Add entry to
ENGINE_OPTIONSandENGINE_DESCRIPTIONS - Add to
ENGLISH_ONLY_ENGINESif applicable
3.4 Form Hook
In app/src/lib/hooks/useGenerationForm.ts:
- Add to Zod schema enum for
engine - Add engine-to-model-name mapping
- Update payload construction for engine-specific fields
Watch out for model naming inconsistencies. The HuggingFace repo name, the model size label, and the API model name don't always follow predictable patterns. For example, TADA's 3B model is named tada-3b-ml (not tada-3b), because it's a multilingual variant. Always check the actual repo names and build the frontend model name mapping from those, not from assumptions like {engine}-{size}.
3.5 Model Management
In app/src/components/ServerSettings/ModelManagement.tsx:
- Add description to
MODEL_DESCRIPTIONSrecord - Add model name to
voiceModelsfilter condition
3.6 Non-Cloning Engines (Preset Voices)
If your engine uses pre-built voices instead of zero-shot cloning from reference audio (e.g. Kokoro), additional integration is needed:
Backend:
- In
kokoro_backend.py(or your engine), define aVOICESlist of(voice_id, display_name, gender, language)tuples create_voice_prompt()should return{"voice_type": "preset", "preset_engine": "<engine>", "preset_voice_id": "<id>"}generate()should readvoice_prompt.get("preset_voice_id")to select the voice- Add a
seed_preset_profiles("<engine>")call inbackend/routes/models.pyafter model download completes - The
seed_preset_profiles()function inbackend/services/profiles.pycreates DB profiles withvoice_type="preset"
Frontend:
- The
EngineModelSelectorfilters options based onselectedProfile.voice_type:"cloned"profiles → only cloning engines shown (Kokoro hidden)"preset"profiles → only the preset's engine shown
- Profile cards show the engine name as a badge for preset profiles
- When a preset profile is selected, the engine auto-switches
Profile schema fields for presets:
voice_type: "preset"(vs"cloned"for traditional profiles)preset_engine: "<engine>"— which engine owns this voicepreset_voice_id: "<id>"— the engine-specific voice identifier
For future "designed" voices (text description instead of audio, e.g. Qwen CustomVoice):
- Use
voice_type: "designed"withdesign_promptfield create_voice_prompt_for_profile()already returns the design prompt for this type
Phase 4: Dependencies
Use the dependency audit from Phase 0 to drive this phase. You should already know what packages are needed, which conflict, and which require special installation.
4.1 Python Dependencies
Add to backend/requirements.txt. There are three installation patterns, depending on what Phase 0 revealed:
Normal PyPI packages:
some-model-package>=1.0.0
Pinned dependency conflicts (--no-deps) — If the model package pins old versions of torch/numpy/transformers, install with --no-deps and list sub-dependencies manually. This is the pattern used for chatterbox-tts:
# In justfile / CI setup:
pip install --no-deps chatterbox-tts
# In requirements.txt — list each actual sub-dependency:
conformer>=0.3.2
diffusers>=0.31.0
omegaconf>=2.3.0
resemble-perth>=0.0.2
s3tokenizer>=0.1.6
To identify sub-deps: pip show chatterbox-tts → Requires: field, then cross-reference against existing requirements.txt to avoid duplicates.
Non-PyPI packages — Some libraries only exist on GitHub or require custom indexes:
# Git-only packages (no PyPI release)
linacodec @ git+https://github.com/ysharma3501/LinaCodec.git
Zipvoice @ git+https://github.com/ysharma3501/LuxTTS.git
# Custom package indexes (C extensions with platform-specific wheels)
--find-links https://k2-fsa.github.io/icefall/piper_phonemize.html
piper-phonemize>=1.2.0
4.2 Dependency Conflict Resolution
Check for conflicts with the existing stack before adding anything:
# Our current stack pins (approximate):
# Python 3.12+, torch>=2.10, transformers>=4.57, numpy>=1.26
# Test compatibility
pip install model-package torch==2.10 transformers==4.57.3 numpy>=1.26
# If it fails, check what the package pins:
pip show model-package | grep Requires
# Look at setup.py/pyproject.toml for version constraints
Known incompatible patterns in the wild:
torch==2.6.0— many older packages pin thisnumpy<1.26— conflicts with Python 3.12+transformers==4.46.3— many packages pin old transformersonnxruntimepinned versions — often conflict with torch
4.3 Update Installation Scripts
Dependencies must be added in multiple places:
| File | What to add |
|---|---|
backend/requirements.txt |
Package and version constraint |
justfile |
--no-deps install line if needed (in setup-python and setup-python-release targets) |
.github/workflows/release.yml |
Same --no-deps line in CI build steps |
Dockerfile |
Same install commands for Docker builds |
Phase 5: PyInstaller Bundling (build_binary.py)
This is where most of the pain lives. The v0.2.3 release was entirely dedicated to fixing bundling issues — every new engine that shipped in v0.2.1 (LuxTTS, Chatterbox, Chatterbox Turbo) worked in dev but failed in production builds. Don't skip this phase.
5.1 Register Your Engine in build_binary.py
Every new engine needs entries in backend/build_binary.py. This file drives PyInstaller and is the single most common source of "works in dev, breaks in prod" bugs. You need to decide which PyInstaller directives your engine's dependencies require:
| Directive | What It Does | When You Need It |
|---|---|---|
--hidden-import <module> |
Includes a module PyInstaller can't detect via static analysis | Dynamic imports, lazy imports, plugin architectures |
--collect-all <package> |
Bundles source .py files, data files, AND native libraries |
Packages that call inspect.getsource() at import time (e.g. inflect via typeguard's @typechecked), or that ship pretrained model files (e.g. perth ships .pth.tar + hparams.yaml) |
--collect-data <package> |
Bundles only data files (not source or native libs) | Packages with YAML configs, vocab files, etc. |
--collect-submodules <package> |
Bundles all submodules | Packages with deep module trees that PyInstaller misses |
--copy-metadata <package> |
Copies importlib.metadata info |
Packages that call importlib.metadata.version() or pkg_resources.get_distribution() at runtime. Already required for: requests, transformers, huggingface-hub, tokenizers, safetensors, tqdm |
Example: adding hidden imports and collect-all for a new engine:
# In build_binary.py, inside the args list:
"--hidden-import",
"backend.backends.your_engine_backend",
"--hidden-import",
"your_engine_package",
"--hidden-import",
"your_engine_package.inference",
"--collect-all",
"some_dependency_that_uses_inspect_getsource",
"--copy-metadata",
"some_dependency_that_checks_its_own_version",
5.2 Lessons from v0.2.3 — Real Failures and Their Fixes
These are actual production failures from shipping new engines. Every one of these passed python -m uvicorn in dev:
| Engine | Failure | Root Cause | Fix |
|---|---|---|---|
| LuxTTS | "could not get source code" on import |
inflect uses typeguard's @typechecked which calls inspect.getsource() — needs .py source files, not just bytecode |
--collect-all inflect |
| LuxTTS | espeak-ng-data not found |
piper_phonemize C library looks for data at /usr/share/espeak-ng-data/ which doesn't exist in the bundle |
--collect-all piper_phonemize + set ESPEAK_DATA_PATH env var at runtime (see 5.3) |
| LuxTTS | inspect.getsource error in Vocos codec |
linacodec and zipvoice use source introspection |
--collect-all linacodec + --collect-all zipvoice |
| Chatterbox | FileNotFoundError for watermark model |
perth ships pretrained model files (hparams.yaml, .pth.tar) that PyInstaller doesn't bundle by default |
--collect-all perth |
| All engines | importlib.metadata failures |
Frozen binary doesn't include package metadata for huggingface-hub, transformers, etc. |
--copy-metadata for each affected package |
| All engines | Download progress bars stuck at 0% | huggingface_hub silently disables tqdm progress bars based on logger level in frozen builds — our progress tracker never receives byte updates |
Force-enable tqdm's internal counter in HFProgressTracker |
| TADA | inspect.getsource error in DAC's Snake1d |
@torch.jit.script calls inspect.getsource() which fails without .py source files |
Wrote a lightweight shim (dac_shim.py) reimplementing Snake1d without @torch.jit.script, registered fake dac.* modules in sys.modules |
| All engines | NameError: name 'obj' is not defined on macOS |
Python 3.12.0 has a CPython bug that corrupts bytecode when PyInstaller rewrites code objects | Upgrade to Python 3.12.13+ |
| All engines | resource_tracker subprocess crash |
multiprocessing in frozen binaries needs freeze_support() called before anything else |
Added to server.py entry point |
5.3 Runtime Frozen-Build Handling (server.py)
Some fixes can't live in build_binary.py — they need runtime detection. The entry point backend/server.py handles these before any heavy imports:
# 1. freeze_support() — MUST be called before any multiprocessing use
import multiprocessing
multiprocessing.freeze_support()
# 2. Native data paths — redirect C libraries to bundled data
if getattr(sys, 'frozen', False):
_meipass = getattr(sys, '_MEIPASS', os.path.dirname(sys.executable))
_espeak_data = os.path.join(_meipass, 'piper_phonemize', 'espeak-ng-data')
if os.path.isdir(_espeak_data):
os.environ.setdefault('ESPEAK_DATA_PATH', _espeak_data)
# 3. stdout/stderr safety — PyInstaller --noconsole on Windows sets these to None
if not _is_writable(sys.stdout):
sys.stdout = open(os.devnull, 'w')
If your engine's dependencies include native libraries that look for data at system paths (like espeak-ng does), you'll need to add a similar os.environ.setdefault() block here.
5.4 CUDA vs CPU Build Branching
build_binary.py produces two different binaries:
voicebox-server(CPU) — excludes allnvidia.*packages to avoid bundling ~3 GB of CUDA DLLsvoicebox-server-cuda— includestorch.cudaandtorch.backends.cudnn
On Windows, if the build environment has CUDA torch installed but you're building the CPU binary, the script temporarily swaps to CPU-only torch and restores CUDA torch afterward. This prevents PyInstaller from accidentally bundling CUDA libraries into the CPU build.
New engine imports go in the common section (not the CUDA or MLX conditional blocks) unless your engine has platform-specific dependencies.
5.5 MLX Conditional Inclusion
Apple Silicon builds conditionally include MLX hidden imports and --collect-all mlx / --collect-all mlx_audio. If your engine has an MLX-specific backend variant, add its imports inside the if is_apple_silicon() and not cuda: block.
5.6 Testing Frozen Builds
You can't skip this. Models that work in python -m uvicorn will break in the PyInstaller binary. The v0.2.3 release required three patch releases (v0.2.1 → v0.2.2 → v0.2.3) to get all engines working in production.
- Build:
just build - Launch the binary directly (not via
python -m) - Test the full chain: download → load → generate → progress tracking
- Check stderr for the actual error (logs go to stderr for Tauri sidecar capture)
- Fix, rebuild, repeat
Common gotcha: testing only generation with a pre-cached model from your dev install. Always test with a clean model cache to verify downloads work too.
Phase 6: Common Upstream Workarounds
torch.load device mismatch
_original_torch_load = torch.load
def _patched_torch_load(*args, **kwargs):
kwargs.setdefault("map_location", "cpu")
return _original_torch_load(*args, **kwargs)
torch.load = _patched_torch_load
Float64/Float32 dtype mismatch
original_fn = SomeClass.some_method
def patched_fn(self, *args, **kwargs):
result = original_fn(self, *args, **kwargs)
return result.float()
SomeClass.some_method = patched_fn
HuggingFace token bug
from huggingface_hub import snapshot_download
local_path = snapshot_download(repo_id=REPO, token=None)
model = ModelClass.from_local(local_path, device=device)
MPS tensor issues
Skip MPS entirely if operators aren't supported:
def _get_device(self):
if torch.cuda.is_available():
return "cuda"
return "cpu" # Skip MPS
Gated HuggingFace repos as hardcoded config sources
Some models hardcode a gated HuggingFace repo as their tokenizer or config source (e.g., TADA hardcodes "meta-llama/Llama-3.2-1B" in both its AlignerConfig and TadaConfig). This silently fails without HF authentication.
Fix: Download from an ungated mirror and patch the config objects directly:
# Download tokenizer from ungated mirror
UNGATED_TOKENIZER = "unsloth/Llama-3.2-1B"
tokenizer_path = snapshot_download(UNGATED_TOKENIZER, token=None)
# Patch the model config to use the local path instead of the gated repo
config = ModelConfig.from_pretrained(model_path)
config.tokenizer_name = tokenizer_path
model = ModelClass.from_pretrained(model_path, config=config)
Do NOT monkey-patch AutoTokenizer.from_pretrained — it's a classmethod, and replacing it corrupts the descriptor, which breaks other engines that use different tokenizers (e.g., Qwen uses a Qwen tokenizer via AutoTokenizer). Always patch at the config level, not the class method level.
torchaudio.load() requires torchcodec in 2.10+
As of torchaudio>=2.10, torchaudio.load() requires the torchcodec package for audio I/O. If your engine or backend code uses torchaudio.load(), replace it with soundfile:
# Before (breaks without torchcodec):
import torchaudio
waveform, sr = torchaudio.load("audio.wav")
# After:
import soundfile as sf
import torch
data, sr = sf.read("audio.wav", dtype="float32")
waveform = torch.from_numpy(data).unsqueeze(0)
Note: torchaudio.functional.resample() and other pure-PyTorch math functions work fine without torchcodec — only the I/O functions are affected.
@torch.jit.script breaks in frozen builds
torch.jit.script calls inspect.getsource() to parse the decorated function's source code. In a PyInstaller binary, .py source files aren't available, so this crashes at import time.
Fix: Remove or avoid @torch.jit.script decorators. If the decorated function comes from an upstream dependency, write a shim that reimplements the function without the decorator (see "Toxic dependency chains" below).
Toxic dependency chains — the shim pattern
Sometimes a model library depends on a package with a massive, hostile transitive dependency tree, but only uses a tiny piece of it. When the dependency chain is unbuildable or would pull in dozens of unwanted packages, the right move is to write a lightweight shim.
Example: TADA depends on descript-audio-codec (DAC), which pulls in descript-audiotools -> onnx, tensorboard, protobuf, matplotlib, pystoi, etc. The onnx package fails to build from source on macOS. But TADA only uses Snake1d from DAC — a 7-line PyTorch module.
Solution: Create a shim at backend/utils/dac_shim.py that registers fake modules in sys.modules:
import sys
import types
import torch
from torch import nn
def snake(x, alpha):
"""Snake activation — reimplemented without @torch.jit.script."""
return x + (1.0 / (alpha + 1e-9)) * torch.sin(alpha * x).pow(2)
class Snake1d(nn.Module):
def __init__(self, channels):
super().__init__()
self.alpha = nn.Parameter(torch.ones(1, channels, 1))
def forward(self, x):
return snake(x, self.alpha)
# Register fake dac.* modules so "from dac.nn.layers import Snake1d" works
_nn = types.ModuleType("dac.nn")
_layers = types.ModuleType("dac.nn.layers")
_layers.Snake1d = Snake1d
_nn.layers = _layers
for name, mod in [("dac", types.ModuleType("dac")),
("dac.nn", _nn), ("dac.nn.layers", _layers)]:
sys.modules[name] = mod
Key rules for shims:
- Import the shim before importing the model library (so it finds the fake modules first)
- Do NOT use
@torch.jit.scriptin the shim (see above) - Only reimplement what the model actually uses — check the import chain carefully
Candidate Engines
The docs/PROJECT_STATUS.md file is the canonical, living list of candidates under evaluation — including why some have been backlogged (e.g. VoxCPM, which is effectively CUDA-only upstream).
At a glance, current top candidates:
| Model | Tier | Size | Cross-platform? | Key Features |
|---|---|---|---|---|
| MOSS-TTS-Nano | 1 | 0.1 B | Yes (CPU realtime) | 48 kHz stereo, Apache 2.0, released 2026-04-13 |
| Voxtral TTS | 2 | 4 B | Likely | mistralai/Voxtral-4B-TTS-2603 — presets + cloning |
| VibeVoice | 2 | ~500 M | Yes | Podcast-style multi-speaker dialogue |
| Dia2 | 3 | TBD | TBD | Successor to the original Dia |
| Fish Audio S2 Pro | 3 | Medium | Yes | Word-level control via inline text |
Backlogged:
- VoxCPM (2B, Apache 2.0) — CUDA ≥12 required upstream; MPS broken in issues #232/#248; CPU path rejected by maintainers (#256). Keep watching for a PR that relaxes the device requirement.
Update PROJECT_STATUS.md when you pick one up or mark one as shipped/backlogged.
Implementation Checklist
Use this as a gate between phases. Do not proceed to the next phase until every item in the current phase is checked.
Phase 0: Dependency Research
- Cloned model library source into a temp directory
- Read
setup.py/pyproject.toml— noted pinned dependency versions - Traced all imports from the model class through to leaf dependencies
- Searched for
inspect.getsource,@typechecked,typeguardin the full dependency tree - Searched for
importlib.metadata,pkg_resources.get_distributionin the dependency tree - Searched for
Path(__file__).parent,os.path.dirname(__file__), hardcoded system paths - Searched for
torch.loadcalls missingmap_location - Searched for
torch.from_numpywithout.float()cast - Searched for
token=Trueortoken=os.getenv("HF_TOKEN")in HuggingFace calls - Searched for
@torch.jit.script/torch.jit.script(crashes in frozen builds) - Searched for
torchaudio.load/torchaudio.save(requirestorchcodecin 2.10+) - Searched for hardcoded gated HuggingFace repo names (e.g.,
meta-llama/*) - Evaluated whether any dependency is used minimally enough to shim instead of install
- Tested model loading and generation on CPU in a throwaway venv
- Tested with a clean HuggingFace cache (no pre-downloaded models)
- Produced a written dependency audit documenting all findings
Phase 1: Backend Implementation
- Created
backend/backends/<engine>_backend.pyimplementingTTSBackendprotocol - Chose voice prompt pattern (pre-computed tensors vs deferred file paths)
- Implemented all monkey-patches identified in Phase 0
- Used
get_torch_device()frombackends/base.pyfor device selection - Used
model_load_progress()frombackends/base.pyfor download/load tracking - Tested: model downloads correctly
- Tested: model loads on CPU
- Tested: generation produces valid audio
- Tested: voice cloning from reference audio works
- Registered
ModelConfiginbackends/__init__.py - Added to
TTS_ENGINESdict - Added factory branch in
get_tts_backend_for_engine() - Updated engine regex in
backend/models.py
Phase 2–3: Route, Service, and Frontend
- Confirmed zero changes needed in routes/services (or documented why custom behavior is needed)
- Added engine to TypeScript union type in
app/src/lib/api/types.ts - Added language map entry in
app/src/lib/constants/languages.ts - Added to
ENGINE_OPTIONSandENGINE_DESCRIPTIONSinEngineModelSelector.tsx - Added to Zod schema and model-name mapping in
useGenerationForm.ts - Added description in
ModelManagement.tsx
Phase 4: Dependencies
- Added packages to
backend/requirements.txt - If
--no-depsneeded: listed sub-dependencies explicitly - If git-only packages: added
@ git+https://...entries - If custom index needed: added
--find-linksline - Updated
justfilesetup targets - Updated
.github/workflows/release.ymlbuild steps - Updated
Dockerfileif applicable - Verified
pip installsucceeds in a clean venv with existing requirements
Phase 5: PyInstaller Bundling
- Added
--hidden-importentries inbuild_binary.pyfor:-
backend.backends.<engine>_backend - The model package and its key submodules
-
- Added
--collect-allfor any packages that:- Use
inspect.getsource()/@typechecked - Ship pretrained model data files (
.pth.tar,.yaml, etc.) - Ship native data files (phoneme tables, shader libraries, etc.)
- Use
- Added
--copy-metadatafor any packages that useimportlib.metadata - If engine has native data paths: added
os.environ.setdefault()inserver.py - Built frozen binary with
just build - Tested in frozen binary with clean model cache (not pre-cached from dev):
- Model download works with real-time progress
- Model loading works
- Generation produces valid audio
- No errors in stderr logs
Phase 6: Final Verification
- Engine works in dev mode (
just dev) - Engine works in frozen binary (
just build→ run binary directly) - Tested on target platform (macOS for MLX, Windows/Linux for CUDA)
- No regressions in existing engines