SIVO
STT, LLM and TTS providers

Integrations

STT, LLM and TTS providers

How to connect SIVO with your preferred AI providers. BYO keys, your cost, your rate-limit.

Updated:
aisttllmttsintegrations

SIVO orchestrates three AI services per call (in AI agents and live transcription):

  • STT (Speech to Text) — converts audio into text.
  • LLM (language model) — reasons, decides, generates response.
  • TTS (Text to Speech) — converts the LLM response into voice.

You bring your own API keys. SIVO orchestrates, you pay your provider.

Why BYO

  • No SIVO markup on token cost.
  • Your dedicated quota and rate-limit (not shared with other tenants).
  • If you negotiate Enterprise rates with OpenAI/ElevenLabs, you keep them.
  • Compliance: if your DPA mandates a specific AI provider or region, you choose.

STT — Speech to Text

ProviderStreamingBest for
Deepgram Nova-2/Nova-3WebSocketBest cost/quality ratio. Recommended default.
ElevenLabs Scribe v2 RealtimeWebSocketNoisy environments, non-native voices.
OpenAI WhisperNo (batch)Post-call only. Minority languages.

Configure Deepgram

  1. Settings → Secrets → + STT Provider → Deepgram.
  2. Paste your API key.
  3. Pick model (nova-2-general recommended).
  4. Save.

Configure ElevenLabs

  1. Settings → Secrets → + STT Provider → ElevenLabs.
  2. Model: use scribe_v2_realtime (with _realtime suffix). scribe_v2 is batch and doesn’t work with streaming.
  3. Auth header: xi-api-key (SIVO sets it).

LLM — reasoning models

Any OpenAI-compatible endpoint works. Tested:

ProviderTTFT (first token)Recommendation
OpenAI GPT-4o667-2400msHigh quality, variable latency.
OpenAI GPT-4o-mini350-800msGood quality/latency/cost ratio.
Groq Llama 3.1 70B~120msBest latency. Default for voice.
Cerebras Llama 3.1 70B~150msAlternative to Groq, high throughput.
Together.aiVariableFor specific open-source models.
Anthropic Claude~500msVia OpenAI-compatible proxy.

Configure Groq

  1. Settings → Secrets → + LLM Provider → Groq.
  2. API key + model (llama-3.1-70b-versatile).
  3. SIVO detects groq.com in base URL and omits stream_options.include_usage automatically (Groq doesn’t support it).

Configure any OpenAI-compatible

  1. Settings → Secrets → + LLM Provider → Custom.
  2. Fill in:
    • Base URL (e.g. https://api.openai.com/v1, https://api.groq.com/openai/v1).
    • API key.
    • Default model.
  3. For Anthropic: use an OpenAI-compatible proxy (LiteLLM, OpenRouter).

TTS — Text to Speech

ProviderStreamingAudio tagsLatency
ElevenLabs v2 multilingualWebSocketLowest
ElevenLabs v3HTTP (no WS)[laughs], [sighs]Medium
OpenAI TTSStreamMedium

Configure ElevenLabs

  1. Settings → Secrets → + TTS Provider → ElevenLabs.
  2. Model:
    • eleven_multilingual_v2 — WebSocket, no audio tags, low latency. Default for voice.
    • eleven_v3 — HTTP only, with audio tags. Premium.
  3. Voice ID (pick from ElevenLabs library).
  4. language_code for accent consistency (es, en, etc.).

By typical use case:

Best latency (live voice)

  • STT: Deepgram Nova-2
  • LLM: Groq Llama 3.1 70B
  • TTS: ElevenLabs v2

Result: ~600ms end-to-end from silence to first bot audio.

Best quality (premium)

  • STT: ElevenLabs Scribe v2 Realtime
  • LLM: OpenAI GPT-4o
  • TTS: ElevenLabs v3 with audio tags

Result: ~1.2s end-to-end. Voice sounds more natural.

Best cost

  • STT: Deepgram Nova-2
  • LLM: Groq Llama 3.1 70B
  • TTS: OpenAI TTS

Result: ~700ms end-to-end at minimum cost (≈$0.05/min conversed).

Assign to AI agents

Once providers are configured, assign each one to an AI agent:

  1. AI Agents → your agent → Configuration.
  2. Select STT, LLM and TTS providers.
  3. Define system prompt, available functions and transfer nodes.

A single AI agent can have different configurations per environment (sandbox vs prod) for A/B testing.

Security

  • API keys encrypted with AES-256-GCM per tenant in DB.
  • Don’t leave SIVO’s backend — providers never see your customer identity.
  • Rotation: change the key in the panel and SIVO uses the new one on the next call (no restart).
  • If you revoke the key without replacing, calls with AI fail with provider_unavailable — the IVR flow can define an errorNodeId fallback.

Estimated costs

For 1 hour of continuous AI conversation with the low-latency combo:

StageApprox. cost
STT (Deepgram Nova-2)~$0.78
LLM (Groq Llama 3.1 70B)~$0.72
TTS (ElevenLabs v2)~$10.80
Total~$12.30/h conversed

Premium (GPT-4o + ElevenLabs v3) goes to ~$30-40/h. Minimum cost with OpenAI TTS drops to ~$5-7/h.

→ This is your cost with your provider. SIVO doesn’t bill on top.