BYO API keys

SIVO doesn't middle-man your token cost. Your Deepgram, ElevenLabs, OpenAI, Groq, etc. keys live encrypted (AES-256-GCM per tenant) in our DB and are used directly against the providers. Wins:

Pay the actual provider rate without SIVO markup.
Your usage quota and rate-limit (not shared with other tenants).
If you negotiate better enterprise rates, you reap them directly.
Compliance: if your DPA mandates a specific AI provider (e.g. EU region), you pick.

STT — Speech to Text

Deepgram (Nova-2 / Nova-3)

WebSocket streaming, latency <300ms for first partial.
Best price/quality ratio for volume.
Optional diarization (speaker separation) in the same stream.
Languages: 30+ with great quality in es, en, pt, fr, de, it.

ElevenLabs Scribe v2 Realtime

WebSocket streaming auth via xi-api-key header.
Excellent in noisy environments and non-native accents.
Models: scribe_v2_realtime (streaming) and scribe_v2 (batch).
Latency slightly above Deepgram, superior quality on hard cases.

OpenAI Whisper

For customers requiring self-managed model hosting for compliance — SIVO deploys it in your region (Enterprise).
Top quality in minority languages.
Higher latency (no native streaming) — recommended for post-call, not live.

LLM — reasoning

Any OpenAI-compatible endpoint works. Tested in production:

OpenAI (GPT-4o, GPT-4o-mini, GPT-4.1) — TTFT 667-2400ms.
Groq (Llama 3.x, Mixtral) — TTFT ~120ms, best for low latency.
Cerebras (Llama 3.x) — competitive TTFT.
Together.ai (open-source models) — model flexibility.
Anthropic Claude — via OpenAI-compatible proxy.

Tech note: stream_options: {include_usage: true} is not supported by Groq — SIVO drops it automatically when detecting groq.com in the base URL.

TTS — Text to Speech

ElevenLabs v2 (multilingual)

WebSocket streaming (stream-input endpoint).
30+ languages with consistent accent.
Does NOT support audio tags [laughs], [sighs].
Lowest latency, recommended for production.

ElevenLabs v3

HTTP streaming only (WS returns 403).
Supports audio tags — the LLM can inject [laughs], [sighs], [whispers].
Top quality, recommended for premium AI agents.

OpenAI TTS

Voices: alloy, echo, fable, onyx, nova, shimmer.
Cheaper than ElevenLabs, respectable quality.
Reasonable latency, no audio tags.

Recommended combos

For typical IVR + AI use cases:

Best latency: Deepgram Nova-2 + Groq Llama 3 + ElevenLabs v2 → ~600ms end-to-end.
Best quality: ElevenLabs Scribe v2 Realtime + OpenAI GPT-4o + ElevenLabs v3 → ~1.2s end-to-end.
Best cost: Deepgram Nova-2 + Groq Llama 3 + OpenAI TTS → ~700ms end-to-end at minimum cost.

→ Full providers guide in docs

BYO STT, LLM and TTS — Deepgram, ElevenLabs, Whisper, OpenAI, Groq