BYO API keys
SIVO doesn't middle-man your token cost. Your Deepgram, ElevenLabs, OpenAI, Groq, etc. keys live encrypted (AES-256-GCM per tenant) in our DB and are used directly against the providers. Wins:
- Pay the actual provider rate without SIVO markup.
- Your usage quota and rate-limit (not shared with other tenants).
- If you negotiate better enterprise rates, you reap them directly.
- Compliance: if your DPA mandates a specific AI provider (e.g. EU region), you pick.
STT — Speech to Text
Deepgram (Nova-2 / Nova-3)
- WebSocket streaming, latency <300ms for first partial.
- Best price/quality ratio for volume.
- Optional diarization (speaker separation) in the same stream.
- Languages: 30+ with great quality in es, en, pt, fr, de, it.
ElevenLabs Scribe v2 Realtime
- WebSocket streaming auth via
xi-api-keyheader. - Excellent in noisy environments and non-native accents.
- Models:
scribe_v2_realtime(streaming) andscribe_v2(batch). - Latency slightly above Deepgram, superior quality on hard cases.
OpenAI Whisper
- For customers requiring self-managed model hosting for compliance — SIVO deploys it in your region (Enterprise).
- Top quality in minority languages.
- Higher latency (no native streaming) — recommended for post-call, not live.
LLM — reasoning
Any OpenAI-compatible endpoint works. Tested in production:
- OpenAI (GPT-4o, GPT-4o-mini, GPT-4.1) — TTFT 667-2400ms.
- Groq (Llama 3.x, Mixtral) — TTFT ~120ms, best for low latency.
- Cerebras (Llama 3.x) — competitive TTFT.
- Together.ai (open-source models) — model flexibility.
- Anthropic Claude — via OpenAI-compatible proxy.
Tech note: stream_options: {include_usage: true}
is not supported by Groq — SIVO drops it automatically when detecting
groq.com in the base URL.
TTS — Text to Speech
ElevenLabs v2 (multilingual)
- WebSocket streaming (
stream-inputendpoint). - 30+ languages with consistent accent.
- Does NOT support audio tags
[laughs],[sighs]. - Lowest latency, recommended for production.
ElevenLabs v3
- HTTP streaming only (WS returns 403).
- Supports audio tags — the LLM can inject
[laughs],[sighs],[whispers]. - Top quality, recommended for premium AI agents.
OpenAI TTS
- Voices: alloy, echo, fable, onyx, nova, shimmer.
- Cheaper than ElevenLabs, respectable quality.
- Reasonable latency, no audio tags.
Recommended combos
For typical IVR + AI use cases:
- Best latency: Deepgram Nova-2 + Groq Llama 3 + ElevenLabs v2 → ~600ms end-to-end.
- Best quality: ElevenLabs Scribe v2 Realtime + OpenAI GPT-4o + ElevenLabs v3 → ~1.2s end-to-end.
- Best cost: Deepgram Nova-2 + Groq Llama 3 + OpenAI TTS → ~700ms end-to-end at minimum cost.