Modalities

Speech-to-Text

AI that converts spoken audio into written text — the technology behind voice assistants, transcription tools, and meeting recorders.

01 ——

In plain English

Speech-to-text — also called Automatic Speech Recognition (ASR) — is AI that turns recorded or live audio of human speech into written text. It powers transcription services, voice assistants, captioning, and voice control.

Common applications:

  • Meeting transcripts — Otter, Fireflies, Granola
  • Voice assistants — Siri, Alexa, ChatGPT voice
  • Subtitles & captions — YouTube, Zoom, podcasts
  • Voice typing — dictation in docs, emails, code editors
  • Call analytics — sales call coaching, support QA

Modern ASR: OpenAI's Whisper transformed the field in 2022 — open-source, multilingual, and highly accurate. Most modern transcription products either use Whisper or one of its competitors (AssemblyAI, Deepgram, Google's Chirp).

Quality is now near-human for clean English audio; accents, multiple speakers, and noisy environments remain harder.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI