Integrations
STT, LLM and TTS providers
How to connect SIVO with your preferred AI providers. BYO keys, your cost, your rate-limit.
SIVO orchestrates three AI services per call (in AI agents and live transcription):
- STT (Speech to Text) — converts audio into text.
- LLM (language model) — reasons, decides, generates response.
- TTS (Text to Speech) — converts the LLM response into voice.
You bring your own API keys. SIVO orchestrates, you pay your provider.
Why BYO
- No SIVO markup on token cost.
- Your dedicated quota and rate-limit (not shared with other tenants).
- If you negotiate Enterprise rates with OpenAI/ElevenLabs, you keep them.
- Compliance: if your DPA mandates a specific AI provider or region, you choose.
STT — Speech to Text
| Provider | Streaming | Best for |
|---|---|---|
| Deepgram Nova-2/Nova-3 | WebSocket | Best cost/quality ratio. Recommended default. |
| ElevenLabs Scribe v2 Realtime | WebSocket | Noisy environments, non-native voices. |
| OpenAI Whisper | No (batch) | Post-call only. Minority languages. |
Configure Deepgram
- Settings → Secrets → + STT Provider → Deepgram.
- Paste your API key.
- Pick model (
nova-2-generalrecommended). - Save.
Configure ElevenLabs
- Settings → Secrets → + STT Provider → ElevenLabs.
- Model: use
scribe_v2_realtime(with_realtimesuffix).scribe_v2is batch and doesn’t work with streaming. - Auth header:
xi-api-key(SIVO sets it).
LLM — reasoning models
Any OpenAI-compatible endpoint works. Tested:
| Provider | TTFT (first token) | Recommendation |
|---|---|---|
| OpenAI GPT-4o | 667-2400ms | High quality, variable latency. |
| OpenAI GPT-4o-mini | 350-800ms | Good quality/latency/cost ratio. |
| Groq Llama 3.1 70B | ~120ms | Best latency. Default for voice. |
| Cerebras Llama 3.1 70B | ~150ms | Alternative to Groq, high throughput. |
| Together.ai | Variable | For specific open-source models. |
| Anthropic Claude | ~500ms | Via OpenAI-compatible proxy. |
Configure Groq
- Settings → Secrets → + LLM Provider → Groq.
- API key + model (
llama-3.1-70b-versatile). - SIVO detects
groq.comin base URL and omitsstream_options.include_usageautomatically (Groq doesn’t support it).
Configure any OpenAI-compatible
- Settings → Secrets → + LLM Provider → Custom.
- Fill in:
- Base URL (e.g.
https://api.openai.com/v1,https://api.groq.com/openai/v1). - API key.
- Default model.
- Base URL (e.g.
- For Anthropic: use an OpenAI-compatible proxy (LiteLLM, OpenRouter).
TTS — Text to Speech
| Provider | Streaming | Audio tags | Latency |
|---|---|---|---|
| ElevenLabs v2 multilingual | WebSocket | ❌ | Lowest |
| ElevenLabs v3 | HTTP (no WS) | ✅ [laughs], [sighs] | Medium |
| OpenAI TTS | Stream | ❌ | Medium |
Configure ElevenLabs
- Settings → Secrets → + TTS Provider → ElevenLabs.
- Model:
eleven_multilingual_v2— WebSocket, no audio tags, low latency. Default for voice.eleven_v3— HTTP only, with audio tags. Premium.
- Voice ID (pick from ElevenLabs library).
language_codefor accent consistency (es,en, etc.).
Recommended combos
By typical use case:
Best latency (live voice)
- STT: Deepgram Nova-2
- LLM: Groq Llama 3.1 70B
- TTS: ElevenLabs v2
Result: ~600ms end-to-end from silence to first bot audio.
Best quality (premium)
- STT: ElevenLabs Scribe v2 Realtime
- LLM: OpenAI GPT-4o
- TTS: ElevenLabs v3 with audio tags
Result: ~1.2s end-to-end. Voice sounds more natural.
Best cost
- STT: Deepgram Nova-2
- LLM: Groq Llama 3.1 70B
- TTS: OpenAI TTS
Result: ~700ms end-to-end at minimum cost (≈$0.05/min conversed).
Assign to AI agents
Once providers are configured, assign each one to an AI agent:
- AI Agents → your agent → Configuration.
- Select STT, LLM and TTS providers.
- Define system prompt, available functions and transfer nodes.
A single AI agent can have different configurations per environment (sandbox vs prod) for A/B testing.
Security
- API keys encrypted with AES-256-GCM per tenant in DB.
- Don’t leave SIVO’s backend — providers never see your customer identity.
- Rotation: change the key in the panel and SIVO uses the new one on the next call (no restart).
- If you revoke the key without replacing, calls with AI fail with
provider_unavailable— the IVR flow can define anerrorNodeIdfallback.
Estimated costs
For 1 hour of continuous AI conversation with the low-latency combo:
| Stage | Approx. cost |
|---|---|
| STT (Deepgram Nova-2) | ~$0.78 |
| LLM (Groq Llama 3.1 70B) | ~$0.72 |
| TTS (ElevenLabs v2) | ~$10.80 |
| Total | ~$12.30/h conversed |
Premium (GPT-4o + ElevenLabs v3) goes to ~$30-40/h. Minimum cost with OpenAI TTS drops to ~$5-7/h.
→ This is your cost with your provider. SIVO doesn’t bill on top.