SIVO
Integrations
Integrations

BYO STT, LLM and TTS — Deepgram, ElevenLabs, Whisper, OpenAI, Groq

Bring your own AI provider keys. SIVO orchestrates, you pay your provider.

BYO API keys

SIVO doesn't middle-man your token cost. Your Deepgram, ElevenLabs, OpenAI, Groq, etc. keys live encrypted (AES-256-GCM per tenant) in our DB and are used directly against the providers. Wins:

  • Pay the actual provider rate without SIVO markup.
  • Your usage quota and rate-limit (not shared with other tenants).
  • If you negotiate better enterprise rates, you reap them directly.
  • Compliance: if your DPA mandates a specific AI provider (e.g. EU region), you pick.

STT — Speech to Text

Deepgram (Nova-2 / Nova-3)

  • WebSocket streaming, latency <300ms for first partial.
  • Best price/quality ratio for volume.
  • Optional diarization (speaker separation) in the same stream.
  • Languages: 30+ with great quality in es, en, pt, fr, de, it.

ElevenLabs Scribe v2 Realtime

  • WebSocket streaming auth via xi-api-key header.
  • Excellent in noisy environments and non-native accents.
  • Models: scribe_v2_realtime (streaming) and scribe_v2 (batch).
  • Latency slightly above Deepgram, superior quality on hard cases.

OpenAI Whisper

  • For customers requiring self-managed model hosting for compliance — SIVO deploys it in your region (Enterprise).
  • Top quality in minority languages.
  • Higher latency (no native streaming) — recommended for post-call, not live.

LLM — reasoning

Any OpenAI-compatible endpoint works. Tested in production:

  • OpenAI (GPT-4o, GPT-4o-mini, GPT-4.1) — TTFT 667-2400ms.
  • Groq (Llama 3.x, Mixtral) — TTFT ~120ms, best for low latency.
  • Cerebras (Llama 3.x) — competitive TTFT.
  • Together.ai (open-source models) — model flexibility.
  • Anthropic Claude — via OpenAI-compatible proxy.

Tech note: stream_options: {include_usage: true} is not supported by Groq — SIVO drops it automatically when detecting groq.com in the base URL.

TTS — Text to Speech

ElevenLabs v2 (multilingual)

  • WebSocket streaming (stream-input endpoint).
  • 30+ languages with consistent accent.
  • Does NOT support audio tags [laughs], [sighs].
  • Lowest latency, recommended for production.

ElevenLabs v3

  • HTTP streaming only (WS returns 403).
  • Supports audio tags — the LLM can inject [laughs], [sighs], [whispers].
  • Top quality, recommended for premium AI agents.

OpenAI TTS

  • Voices: alloy, echo, fable, onyx, nova, shimmer.
  • Cheaper than ElevenLabs, respectable quality.
  • Reasonable latency, no audio tags.

Recommended combos

For typical IVR + AI use cases:

  • Best latency: Deepgram Nova-2 + Groq Llama 3 + ElevenLabs v2 → ~600ms end-to-end.
  • Best quality: ElevenLabs Scribe v2 Realtime + OpenAI GPT-4o + ElevenLabs v3 → ~1.2s end-to-end.
  • Best cost: Deepgram Nova-2 + Groq Llama 3 + OpenAI TTS → ~700ms end-to-end at minimum cost.

→ Full providers guide in docs

Your call center with AI superpowers, in minutes.

Start a 14-day free trial. No card. No lock-in.