Modalities

Speech-to-Text

AI that converts spoken audio into written text — the technology behind voice assistants, transcription tools, and meeting recorders.

01 ——

In plain English

Speech-to-text — also called Automatic Speech Recognition (ASR) — is AI that turns recorded or live audio of human speech into written text. It powers transcription services, voice assistants, captioning, and voice control.

Common applications:

Meeting transcripts — Otter, Fireflies, Granola
Voice assistants — Siri, Alexa, ChatGPT voice
Subtitles & captions — YouTube, Zoom, podcasts
Voice typing — dictation in docs, emails, code editors
Call analytics — sales call coaching, support QA

Modern ASR: OpenAI's Whisper transformed the field in 2022 — open-source, multilingual, and highly accurate. Most modern transcription products either use Whisper or one of its competitors (AssemblyAI, Deepgram, Google's Chirp).

Quality is now near-human for clean English audio; accents, multiple speakers, and noisy environments remain harder.

02 ——