OpenAI API Compatibility

Status: Planned for v0.2.0

Overview

This feature exposes OpenAI-compatible endpoints from Voicebox, allowing any tool, library, or application that speaks the OpenAI Audio API to use Voicebox as a drop-in local replacement.

flowchart LR
subgraph clients [External Clients]
    SDK[OpenAI SDK]
    Curl[curl / HTTP]
    Apps[Third-party Apps]
end

subgraph voicebox [Voicebox Server]
    OpenAI["/v1/audio/* endpoints"]
    TTS[TTSModel]
    Whisper[WhisperModel]
    Profiles[Voice Profiles]
end

SDK --> OpenAI
Curl --> OpenAI
Apps --> OpenAI
OpenAI --> TTS
OpenAI --> Whisper
OpenAI --> Profiles

Use Cases

OpenAI SDK users: openai.audio.speech.create() works with Voicebox
LLM frameworks: LangChain, AutoGen, etc. can use Voicebox for TTS
Shell scripts: curl commands copy-pasted from OpenAI docs work
Existing integrations: Any tool expecting OpenAI's API works without code changes

Endpoints to Implement

1. `POST /v1/audio/speech` (TTS)

OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createSpeech

Request:

{
  "model": "tts-1",
  "input": "Hello world!",
  "voice": "alloy",
  "response_format": "mp3",
  "speed": 1.0
}

Response: Audio file (mp3, wav, opus, aac, flac, pcm)

Voice Mapping Strategy:

voice parameter maps to Voicebox profile names (case-insensitive)
If no match, use a configurable default profile
Support special syntax: voice: "profile:uuid" for explicit profile ID

2. `POST /v1/audio/transcriptions` (Whisper)

OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createTranscription

Request: (multipart/form-data)

file: Audio file
model: "whisper-1"
language: Optional language hint
response_format: json, text, srt, verbose_json, vtt

Response:

{
  "text": "Hello world!"
}

Implementation Details

New File: `backend/openai_compat.py`

Create a dedicated module with an APIRouter for OpenAI-compatible endpoints:

from fastapi import APIRouter, UploadFile, File, Form, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import Literal, Optional

router = APIRouter(prefix="/v1/audio", tags=["OpenAI Compatible"])

class SpeechRequest(BaseModel):
model: str = "tts-1"
input: str
voice: str = "alloy"
response_format: Literal["mp3", "wav", "opus", "aac", "flac", "pcm"] = "mp3"
speed: float = 1.0

@router.post("/speech")
async def create_speech(request: SpeechRequest, db: Session = Depends(get_db)):
# 1. Map voice name to profile
# 2. Generate audio using existing TTSModel
# 3. Convert to requested format
# 4. Return audio stream
...

@router.post("/transcriptions")
async def create_transcription(
file: UploadFile = File(...),
model: str = Form("whisper-1"),
language: Optional[str] = Form(None),
response_format: str = Form("json"),
):
# 1. Save uploaded file
# 2. Transcribe using existing WhisperModel
# 3. Return in requested format
...

Voice Profile Resolution

Add helper in backend/profiles.py:

async def resolve_voice_for_openai(voice: str, db: Session) -> Optional[VoiceProfile]:
"""
Resolve OpenAI voice parameter to a Voicebox profile.

Priority:
1. Exact profile name match (case-insensitive)
2. Profile ID match (if voice starts with "profile:")
3. Default profile from config
4. First available profile
"""
...

Audio Format Conversion

Add conversion utilities in backend/utils/audio.py:

def convert_audio_format(
audio: np.ndarray,
sample_rate: int,
target_format: str,  # mp3, wav, opus, aac, flac, pcm
) -> bytes:
"""Convert audio to target format using ffmpeg or pydub."""
...

Configuration

Add to backend/config.py:

# OpenAI API Compatibility
OPENAI_COMPAT_ENABLED = True
OPENAI_COMPAT_DEFAULT_VOICE = None  # Profile ID or name for default voice
OPENAI_COMPAT_REQUIRE_AUTH = False  # Require API key validation
OPENAI_COMPAT_API_KEY = None        # If set, validate against this

Integration with main.py

In backend/main.py, include the router:

from . import openai_compat

# Add OpenAI-compatible routes
if config.OPENAI_COMPAT_ENABLED:
app.include_router(openai_compat.router)

Streaming Support (Future Enhancement)

Initial implementation returns complete audio. Streaming can be added later:

@router.post("/speech")
async def create_speech(request: SpeechRequest):
if request.stream:
    return StreamingResponse(
        generate_audio_chunks(request),
        media_type=f"audio/{request.response_format}"
    )
...

Testing

Example usage after implementation:

# TTS with curl
curl http://localhost:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "input": "Hello!", "voice": "MyProfile"}' \
  --output speech.mp3

# With OpenAI Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.audio.speech.create(
model="tts-1",
voice="MyProfile",
input="Hello world!"
)
response.stream_to_file("output.mp3")

# Transcription
curl http://localhost:8000/v1/audio/transcriptions \
  -F file=@audio.mp3 \
  -F model="whisper-1"

Security Considerations

Optional API key validation (for shared deployments)
Rate limiting on endpoints
Input length limits (same as existing /generate endpoint)

Dependencies

pydub or ffmpeg-python for audio format conversion (mp3, opus, etc.)
No changes to existing TTS/Whisper model code