AI Infrastructure · Reviewed June 1, 2026

Ultravox

Real-time speech-native multimodal LLM — Ultravox understands audio directly without separate ASR, achieving 150ms TTFT. Open weights, by Fixie AI.

Pricing
Freemium
Rating
4.78/ 5 · 125 reviews
Last reviewed
June 1, 2026
Channels
Ultravox ai infrastructure tool screenshot
01

Overview

Ultravox: Speech-Native Multimodal LLM

Ultravox is a fast multimodal LLM by Fixie AI that understands human speech directly — no separate Automatic Speech Recognition (ASR) stage. The direct audio-to-LLM coupling cuts out a pipeline step that traditional voice agents require, achieving ~150ms time-to-first-token (TTFT) for genuinely real-time conversation.

Open-weight model available on Hugging Face, plus a managed Realtime platform at ultravox.ai for building voice-to-voice agents. Used by developers who want speech-native architecture rather than ASR + LLM + TTS chains.

Key Features

  • Direct audio-to-text understanding (no ASR pipeline step)
  • ~150ms time-to-first-token
  • Open weights on Hugging Face for self-hosting
  • Realtime managed platform for voice-to-voice agents
  • Multiple model sizes (1B/3B/8B parameters)

Ideal Use Case

Voice agent developers who care about latency above all and want to skip the ASR step; researchers exploring speech-native LLM architectures; teams building voice agents on partner inference platforms (BaseTen, fal.ai).

Why Use Ultravox

Traditional voice agents have a pipeline: STT → LLM → TTS, each adding latency and failure modes. Ultravox collapses STT + LLM into a single model that understands audio directly. Architecturally cleaner, latency-better, and the open-weight release means full control.

FAQ

Q: Does Ultravox replace TTS too? A: Not yet — it understands audio directly but emits text. TTS is still needed for the response. Future versions plan voice-to-voice end-to-end.

Q: Is Ultravox open source? A: Yes — model weights on Hugging Face under permissive license.

Q: Who is Fixie AI? A: The team behind Ultravox; founded by ex-Google folks focused on agentic AI infrastructure.

tl;dr

Speech-native multimodal LLM. Audio → text directly, 150ms TTFT, open weights. The architectural clean voice agent option.

Related

Looking for more options? Browse the AI Infrastructure directory or read our best AI infrastructure tools listicle. Ultravox is also tracked on Crunchbase.

02

Why Use Ultravox

Rating
4.78
Across 125 verified reviews
Saved
260
By ToolDirectory readers
Pricing
Freemium
Publisher-listed pricing model
Listed
Since 2026
Continuously re-reviewed by editors
Category
AI Infrastructure
Primary listing
Verified by editors during the most recent review · ToolDirectory.AI
Ultravox ai infrastructure tool screenshot
03

User Reviews

4.78
Out of 5 · 125 ratings
5
108
4
10
3
4
2
2
1
1
04

Similar Tools

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI