Voice Profiles | Voicebox

Overview

Voice profiles are the unit of "a saved voice" in Voicebox. As of 0.4 they support two flavors backed by the same profiles table:

Cloned profiles — store one or more reference audio samples; the cloning engine generates a voice embedding at use time
Preset profiles — store no audio; just a pointer to an engine-specific pre-built voice (e.g. Kokoro's am_adam, Qwen CustomVoice's Ryan)

The schema also reserves a third type, designed, for future text-described voices. Not currently used by any shipped engine.

Architecture

The voice profile system consists of three main components:

Database Layer: SQLite tables store profile metadata, sample references (cloned), and engine + voice ID (preset).

File Storage: Audio samples are stored on disk in a structured directory format. Preset profiles have no on-disk audio.

Profile Module: backend/services/profiles.py provides the business logic for CRUD operations and dispatches to the appropriate engine based on voice_type.

Data Model

VoiceProfile Table

class VoiceProfile(Base):
__tablename__ = "profiles"

id = Column(String, primary_key=True, default=lambda: str(uuid.uuid4()))
name = Column(String, unique=True, nullable=False)
description = Column(Text)
language = Column(String, default="en")
avatar_path = Column(String, nullable=True)
effects_chain = Column(Text, nullable=True)

# Voice type system — added v0.3.x
voice_type = Column(String, default="cloned")    # "cloned" | "preset" | "designed"
preset_engine = Column(String, nullable=True)    # e.g. "kokoro" — only for preset
preset_voice_id = Column(String, nullable=True)  # e.g. "am_adam" — only for preset
design_prompt = Column(Text, nullable=True)      # text description — only for designed (reserved)
default_engine = Column(String, nullable=True)   # auto-selected engine, locked for preset

created_at = Column(DateTime, default=datetime.utcnow)
updated_at = Column(DateTime, default=datetime.utcnow, onupdate=datetime.utcnow)

The voice_type column discriminates the three flavors:

`voice_type`	`preset_engine`	`preset_voice_id`	Samples in `profile_samples`
`cloned`	NULL	NULL	Required (≥1 row)
`preset`	engine name	voice ID string	None
`designed`	NULL	NULL	None (uses `design_prompt`)

The default_engine column is set automatically when the profile is created. For preset profiles it's locked to the source engine — switching engines at generation time will skip the profile (and the UI auto-switches back when the user clicks a greyed-out card; see the floating generate box and profile grid).

ProfileSample Table

class ProfileSample(Base):
__tablename__ = "profile_samples"

id = Column(String, primary_key=True, default=lambda: str(uuid.uuid4()))
profile_id = Column(String, ForeignKey("profiles.id"))
audio_path = Column(String, nullable=False)
reference_text = Column(Text, nullable=False)

Only populated for cloned profiles. Preset and designed profiles have zero rows in this table.

File Structure

Profiles are stored in the data directory:

Core Functions

Creating a Profile

async def create_profile(data: VoiceProfileCreate, db: Session) -> VoiceProfileResponse:
# 1. Create database record
db_profile = DBVoiceProfile(
    id=str(uuid.uuid4()),
    name=data.name,
    description=data.description,
    language=data.language,
)
db.add(db_profile)
db.commit()

# 2. Create profile directory
profile_dir = profiles_dir / db_profile.id
profile_dir.mkdir(parents=True, exist_ok=True)

return VoiceProfileResponse.model_validate(db_profile)

Adding Samples

When a sample is added, the audio is validated and copied to the profile directory:

async def add_profile_sample(
profile_id: str,
audio_path: str,
reference_text: str,
db: Session,
) -> ProfileSampleResponse:
# 1. Validate audio (duration, format, quality)
is_valid, error_msg = validate_reference_audio(audio_path)
if not is_valid:
    raise ValueError(f"Invalid reference audio: {error_msg}")

# 2. Copy to profile directory
sample_id = str(uuid.uuid4())
dest_path = profile_dir / f"{sample_id}.wav"
audio, sr = load_audio(audio_path)
save_audio(audio, str(dest_path), sr)

# 3. Create database record
db_sample = DBProfileSample(
    id=sample_id,
    profile_id=profile_id,
    audio_path=str(dest_path),
    reference_text=reference_text,
)
db.add(db_sample)
db.commit()

Voice Prompt Creation

When generating speech, samples are combined into a voice prompt:

async def create_voice_prompt_for_profile(
profile_id: str,
db: Session,
) -> dict:
samples = db.query(DBProfileSample).filter_by(profile_id=profile_id).all()

if len(samples) == 1:
    # Single sample - use directly
    voice_prompt, _ = await tts_model.create_voice_prompt(
        sample.audio_path,
        sample.reference_text,
    )
else:
    # Multiple samples - combine them
    combined_audio, combined_text = await tts_model.combine_voice_prompts(
        [s.audio_path for s in samples],
        [s.reference_text for s in samples],
    )
    voice_prompt, _ = await tts_model.create_voice_prompt(
        combined_audio_path,
        combined_text,
    )

return voice_prompt

Audio Validation

Reference audio is validated before being accepted:

Duration: 3-30 seconds recommended
Format: WAV, MP3, FLAC, OGG, M4A supported
Sample Rate: Engine-specific — the audio utility resamples to whatever the active engine expects (Whisper uses 16 kHz, most TTS engines use 24 kHz, LuxTTS outputs 48 kHz). Resampling happens on the fly; the stored sample retains its original rate.
Channels: Converted to mono if stereo

Export/Import

Profiles can be exported as ZIP archives for sharing:

API Endpoints

Method	Endpoint	Description
GET	`/profiles`	List all profiles
POST	`/profiles`	Create a profile
GET	`/profiles/{id}`	Get profile by ID
PUT	`/profiles/{id}`	Update profile
DELETE	`/profiles/{id}`	Delete profile
GET	`/profiles/{id}/samples`	Get profile samples
POST	`/profiles/{id}/samples`	Add sample to profile
PUT	`/profiles/samples/{id}`	Update sample text
DELETE	`/profiles/samples/{id}`	Delete sample
GET	`/profiles/{id}/export`	Export as ZIP
POST	`/profiles/import`	Import from ZIP

Best Practices

Sample Quality

Use clean audio with minimal background noise
Ensure the reference text exactly matches what is spoken
Multiple samples (3-5) improve voice cloning quality

Language Matching

Set the profile language to match the reference audio
Supported languages: en, zh, ja, ko, de, fr, ru, pt, es, it

Naming Conventions

Use descriptive names that identify the voice
Avoid special characters that may cause filesystem issues