We use Deepgram for transcription (ASR/STT), since it is the most mature provider, offering the most stable API and largest feature set (e.g. supported languages) overall
We use LiveKit for managed video streaming infrastructure and also use LiveKit agents as basis for our conversational agents. WebRTC streaming is really hard and LiveKit provides a great developer-friendly managed solution for that.
We currently use ElevenLabs Flash 2.5 for voice cloning and real-time speech generation (TTS), which has ultra low latency (~75 ms claimed, 250ms measured; even faster than Cartesia) while still supporting multiple language