OpenAI API Compatibility
Status: Planned for v0.2.0
Issue: #10 OpenAI API compatibility
Overview
This feature exposes OpenAI-compatible endpoints from Voicebox, allowing any tool, library, or application that speaks the OpenAI Audio API to use Voicebox as a drop-in local replacement.
flowchart LR
subgraph clients [External Clients]
SDK[OpenAI SDK]
Curl[curl / HTTP]
Apps[Third-party Apps]
end
subgraph voicebox [Voicebox Server]
OpenAI["/v1/audio/* endpoints"]
TTS[TTSModel]
Whisper[WhisperModel]
Profiles[Voice Profiles]
end
SDK --> OpenAI
Curl --> OpenAI
Apps --> OpenAI
OpenAI --> TTS
OpenAI --> Whisper
OpenAI --> Profiles
Use Cases
- OpenAI SDK users:
openai.audio.speech.create()works with Voicebox - LLM frameworks: LangChain, AutoGen, etc. can use Voicebox for TTS
- Shell scripts:
curlcommands copy-pasted from OpenAI docs work - Existing integrations: Any tool expecting OpenAI's API works without code changes
Endpoints to Implement
1. POST /v1/audio/speech (TTS)
OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createSpeech
Request:
{
"model": "tts-1",
"input": "Hello world!",
"voice": "alloy",
"response_format": "mp3",
"speed": 1.0
}
Response: Audio file (mp3, wav, opus, aac, flac, pcm)
Voice Mapping Strategy:
voiceparameter maps to Voicebox profile names (case-insensitive)- If no match, use a configurable default profile
- Support special syntax:
voice: "profile:uuid"for explicit profile ID
2. POST /v1/audio/transcriptions (Whisper)
OpenAI spec: https://platform.openai.com/docs/api-reference/audio/createTranscription
Request: (multipart/form-data)
file: Audio filemodel: "whisper-1"language: Optional language hintresponse_format: json, text, srt, verbose_json, vtt
Response:
{
"text": "Hello world!"
}
Implementation Details
New File: backend/openai_compat.py
Create a dedicated module with an APIRouter for OpenAI-compatible endpoints:
from fastapi import APIRouter, UploadFile, File, Form, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import Literal, Optional
router = APIRouter(prefix="/v1/audio", tags=["OpenAI Compatible"])
class SpeechRequest(BaseModel):
model: str = "tts-1"
input: str
voice: str = "alloy"
response_format: Literal["mp3", "wav", "opus", "aac", "flac", "pcm"] = "mp3"
speed: float = 1.0
@router.post("/speech")
async def create_speech(request: SpeechRequest, db: Session = Depends(get_db)):
# 1. Map voice name to profile
# 2. Generate audio using existing TTSModel
# 3. Convert to requested format
# 4. Return audio stream
...
@router.post("/transcriptions")
async def create_transcription(
file: UploadFile = File(...),
model: str = Form("whisper-1"),
language: Optional[str] = Form(None),
response_format: str = Form("json"),
):
# 1. Save uploaded file
# 2. Transcribe using existing WhisperModel
# 3. Return in requested format
...
Voice Profile Resolution
Add helper in backend/profiles.py:
async def resolve_voice_for_openai(voice: str, db: Session) -> Optional[VoiceProfile]:
"""
Resolve OpenAI voice parameter to a Voicebox profile.
Priority:
1. Exact profile name match (case-insensitive)
2. Profile ID match (if voice starts with "profile:")
3. Default profile from config
4. First available profile
"""
...
Audio Format Conversion
Add conversion utilities in backend/utils/audio.py:
def convert_audio_format(
audio: np.ndarray,
sample_rate: int,
target_format: str, # mp3, wav, opus, aac, flac, pcm
) -> bytes:
"""Convert audio to target format using ffmpeg or pydub."""
...
Configuration
Add to backend/config.py:
# OpenAI API Compatibility
OPENAI_COMPAT_ENABLED = True
OPENAI_COMPAT_DEFAULT_VOICE = None # Profile ID or name for default voice
OPENAI_COMPAT_REQUIRE_AUTH = False # Require API key validation
OPENAI_COMPAT_API_KEY = None # If set, validate against this
Integration with main.py
In backend/main.py, include the router:
from . import openai_compat
# Add OpenAI-compatible routes
if config.OPENAI_COMPAT_ENABLED:
app.include_router(openai_compat.router)
Streaming Support (Future Enhancement)
Initial implementation returns complete audio. Streaming can be added later:
@router.post("/speech")
async def create_speech(request: SpeechRequest):
if request.stream:
return StreamingResponse(
generate_audio_chunks(request),
media_type=f"audio/{request.response_format}"
)
...
Testing
Example usage after implementation:
# TTS with curl
curl http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "tts-1", "input": "Hello!", "voice": "MyProfile"}' \
--output speech.mp3
# With OpenAI Python SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.audio.speech.create(
model="tts-1",
voice="MyProfile",
input="Hello world!"
)
response.stream_to_file("output.mp3")
# Transcription
curl http://localhost:8000/v1/audio/transcriptions \
-F file=@audio.mp3 \
-F model="whisper-1"
Security Considerations
- Optional API key validation (for shared deployments)
- Rate limiting on endpoints
- Input length limits (same as existing
/generateendpoint)
Dependencies
pyduborffmpeg-pythonfor audio format conversion (mp3, opus, etc.)- No changes to existing TTS/Whisper model code