Modalities

Text-to-Speech

AI that converts written text into natural-sounding spoken audio — used for narration, accessibility, voice assistants, and content creation.

01 ——

In plain English

Text-to-speech (TTS) is AI that turns written text into spoken audio. Modern TTS produces voices that are nearly indistinguishable from real humans — with control over tone, accent, emotion, and pacing.

Common uses:

Audiobooks and narration — narrate articles, books, courses
Accessibility — screen readers for blind and low-vision users
Voice agents — give chatbots a voice
Localisation — voice-over videos in dozens of languages
IVR / phone systems — replace the old robotic phone-tree voices

Leading TTS providers:

ElevenLabs — high-quality voice cloning and synthesis
OpenAI TTS, Google Cloud TTS, Azure Speech
Resemble, PlayHT, Murf — production-focused tools

Combined with speech-to-text and an LLM, TTS enables full voice-conversational AI products.

02 ——

Related terms

Speech-to-Text

AI that converts spoken audio into written text — the technology behind voice assistants, transcription tools, and meeting recorders.

Voice Cloning

AI that learns to mimic a specific person's voice from a short sample, then generates new speech in that voice from any text.

Conversational AI

AI systems designed for natural back-and-forth dialogue with users — covering chatbots, voice assistants, and AI agents.

Multi-modal

An AI model that can understand and work with multiple types of input — text, images, audio, or video — not just text.

Back to glossaryLast reviewed June 2026

Text-to-Speech

In plain English

Related terms

Sign up for our newsletter

Sign up for our newsletter

AI Tools Directory

Explore

Latest collections

Policy