Integrations

STT, LLM and TTS providers

How to connect SIVO with your preferred AI providers. BYO keys, your cost, your rate-limit.

Updated: May 20, 2026

aisttllmttsintegrations

SIVO orchestrates three AI services per call (in AI agents and live transcription):

You bring your own API keys. SIVO orchestrates, you pay your provider.

Why BYO

No SIVO markup on token cost.
Your dedicated quota and rate-limit (not shared with other tenants).
If you negotiate Enterprise rates with OpenAI/ElevenLabs, you keep them.
Compliance: if your DPA mandates a specific AI provider or region, you choose.

Provider	Streaming	Best for
Deepgram Nova-2/Nova-3	WebSocket	Best cost/quality ratio. Recommended default.
ElevenLabs Scribe v2 Realtime	WebSocket	Noisy environments, non-native voices.
OpenAI Whisper	No (batch)	Post-call only. Minority languages.

Settings → Secrets → + STT Provider → ElevenLabs.
Model: use scribe_v2_realtime (with _realtime suffix). scribe_v2 is batch and doesn’t work with streaming.
Auth header: xi-api-key (SIVO sets it).

Any OpenAI-compatible endpoint works. Tested:

Provider	TTFT (first token)	Recommendation
OpenAI GPT-4o	667-2400ms	High quality, variable latency.
OpenAI GPT-4o-mini	350-800ms	Good quality/latency/cost ratio.
Groq Llama 3.1 70B	~120ms	Best latency. Default for voice.
Cerebras Llama 3.1 70B	~150ms	Alternative to Groq, high throughput.
Together.ai	Variable	For specific open-source models.
Anthropic Claude	~500ms	Via OpenAI-compatible proxy.

Settings → Secrets → + LLM Provider → Groq.
API key + model (llama-3.1-70b-versatile).
SIVO detects groq.com in base URL and omits stream_options.include_usage automatically (Groq doesn’t support it).

Settings → Secrets → + LLM Provider → Custom.
Fill in:
- Base URL (e.g. https://api.openai.com/v1, https://api.groq.com/openai/v1).
- API key.
- Default model.
For Anthropic: use an OpenAI-compatible proxy (LiteLLM, OpenRouter).

Provider	Streaming	Audio tags	Latency
ElevenLabs v2 multilingual	WebSocket	❌	Lowest
ElevenLabs v3	HTTP (no WS)	✅ `[laughs]`, `[sighs]`	Medium
OpenAI TTS	Stream	❌	Medium

Settings → Secrets → + TTS Provider → ElevenLabs.
Model:
- eleven_multilingual_v2 — WebSocket, no audio tags, low latency. Default for voice.
- eleven_v3 — HTTP only, with audio tags. Premium.
Voice ID (pick from ElevenLabs library).
language_code for accent consistency (es, en, etc.).

By typical use case:

Result: ~600ms end-to-end from silence to first bot audio.

Result: ~1.2s end-to-end. Voice sounds more natural.

Result: ~700ms end-to-end at minimum cost (≈$0.05/min conversed).

Once providers are configured, assign each one to an AI agent:

A single AI agent can have different configurations per environment (sandbox vs prod) for A/B testing.

API keys encrypted with AES-256-GCM per tenant in DB.
Don’t leave SIVO’s backend — providers never see your customer identity.
Rotation: change the key in the panel and SIVO uses the new one on the next call (no restart).
If you revoke the key without replacing, calls with AI fail with provider_unavailable — the IVR flow can define an errorNodeId fallback.

For 1 hour of continuous AI conversation with the low-latency combo:

Premium (GPT-4o + ElevenLabs v3) goes to ~$30-40/h. Minimum cost with OpenAI TTS drops to ~$5-7/h.

→ This is your cost with your provider. SIVO doesn’t bill on top.