Voice Profiles

Overview

Voice profiles are the unit of "a saved voice" in Voicebox. As of 0.4 they support two flavors backed by the same profiles table:

  • Cloned profiles — store one or more reference audio samples; the cloning engine generates a voice embedding at use time
  • Preset profiles — store no audio; just a pointer to an engine-specific pre-built voice (e.g. Kokoro's am_adam, Qwen CustomVoice's Ryan)

The schema also reserves a third type, designed, for future text-described voices. Not currently used by any shipped engine.

Architecture

The voice profile system consists of three main components:

Database Layer: SQLite tables store profile metadata, sample references (cloned), and engine + voice ID (preset).

File Storage: Audio samples are stored on disk in a structured directory format. Preset profiles have no on-disk audio.

Profile Module: backend/services/profiles.py provides the business logic for CRUD operations and dispatches to the appropriate engine based on voice_type.

Data Model

VoiceProfile Table

class VoiceProfile(Base):
__tablename__ = "profiles"

id = Column(String, primary_key=True, default=lambda: str(uuid.uuid4()))
name = Column(String, unique=True, nullable=False)
description = Column(Text)
language = Column(String, default="en")
avatar_path = Column(String, nullable=True)
effects_chain = Column(Text, nullable=True)

# Voice type system — added v0.3.x
voice_type = Column(String, default="cloned")    # "cloned" | "preset" | "designed"
preset_engine = Column(String, nullable=True)    # e.g. "kokoro" — only for preset
preset_voice_id = Column(String, nullable=True)  # e.g. "am_adam" — only for preset
design_prompt = Column(Text, nullable=True)      # text description — only for designed (reserved)
default_engine = Column(String, nullable=True)   # auto-selected engine, locked for preset

created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)

The voice_type column discriminates the three flavors:

voice_type preset_engine preset_voice_id Samples in profile_samples
cloned NULL NULL Required (≥1 row)
preset engine name voice ID string None
designed NULL NULL None (uses design_prompt)

The default_engine column is set automatically when the profile is created. For preset profiles it's locked to the source engine — switching engines at generation time will skip the profile (and the UI auto-switches back when the user clicks a greyed-out card; see the floating generate box and profile grid).

ProfileSample Table

class ProfileSample(Base):
__tablename__ = "profile_samples"

id = Column(String, primary_key=True, default=lambda: str(uuid.uuid4()))
profile_id = Column(String, ForeignKey("profiles.id"))
audio_path = Column(String, nullable=False)
reference_text = Column(Text, nullable=False)

Only populated for cloned profiles. Preset and designed profiles have zero rows in this table.

File Structure

Profiles are stored in the data directory:

Core Functions

Creating a Profile

async def create_profile(data: VoiceProfileCreate, db: Session) -> VoiceProfileResponse:
# 1. Create database record
db_profile = DBVoiceProfile(
    id=str(uuid.uuid4()),
    name=data.name,
    description=data.description,
    language=data.language,
)
db.add(db_profile)
db.commit()

# 2. Create profile directory
profile_dir = profiles_dir / db_profile.id
profile_dir.mkdir(parents=True, exist_ok=True)

return VoiceProfileResponse.model_validate(db_profile)

Adding Samples

When a sample is added, the audio is validated and copied to the profile directory:

async def add_profile_sample(
profile_id: str,
audio_path: str,
reference_text: str,
db: Session,
) -> ProfileSampleResponse:
# 1. Validate audio (duration, format, quality)
is_valid, error_msg = validate_reference_audio(audio_path)
if not is_valid:
    raise ValueError(f"Invalid reference audio: {error_msg}")

# 2. Copy to profile directory
sample_id = str(uuid.uuid4())
dest_path = profile_dir / f"{sample_id}.wav"
audio, sr = load_audio(audio_path)
save_audio(audio, str(dest_path), sr)

# 3. Create database record
db_sample = DBProfileSample(
    id=sample_id,
    profile_id=profile_id,
    audio_path=str(dest_path),
    reference_text=reference_text,
)
db.add(db_sample)
db.commit()

Voice Prompt Creation

When generating speech, samples are combined into a voice prompt:

async def create_voice_prompt_for_profile(
profile_id: str,
db: Session,
) -> dict:
samples = db.query(DBProfileSample).filter_by(profile_id=profile_id).all()

if len(samples) == 1:
    # Single sample - use directly
    voice_prompt, _ = await tts_model.create_voice_prompt(
        sample.audio_path,
        sample.reference_text,
    )
else:
    # Multiple samples - combine them
    combined_audio, combined_text = await tts_model.combine_voice_prompts(
        [s.audio_path for s in samples],
        [s.reference_text for s in samples],
    )
    voice_prompt, _ = await tts_model.create_voice_prompt(
        combined_audio_path,
        combined_text,
    )

return voice_prompt

Audio Validation

Reference audio is validated before being accepted:

  • Duration: 3-30 seconds recommended
  • Format: WAV, MP3, FLAC, OGG, M4A supported
  • Sample Rate: Engine-specific — the audio utility resamples to whatever the active engine expects (Whisper uses 16 kHz, most TTS engines use 24 kHz, LuxTTS outputs 48 kHz). Resampling happens on the fly; the stored sample retains its original rate.
  • Channels: Converted to mono if stereo

Export/Import

Profiles can be exported as ZIP archives for sharing:

API Endpoints

Method Endpoint Description
GET /profiles List all profiles
POST /profiles Create a profile
GET /profiles/{id} Get profile by ID
PUT /profiles/{id} Update profile
DELETE /profiles/{id} Delete profile
GET /profiles/{id}/samples Get profile samples
POST /profiles/{id}/samples Add sample to profile
PUT /profiles/samples/{id} Update sample text
DELETE /profiles/samples/{id} Delete sample
GET /profiles/{id}/export Export as ZIP
POST /profiles/import Import from ZIP

Best Practices

Sample Quality

  • Use clean audio with minimal background noise
  • Ensure the reference text exactly matches what is spoken
  • Multiple samples (3-5) improve voice cloning quality

Language Matching

  • Set the profile language to match the reference audio
  • Supported languages: en, zh, ja, ko, de, fr, ru, pt, es, it

Naming Conventions

  • Use descriptive names that identify the voice
  • Avoid special characters that may cause filesystem issues