Modalities

Multi-modal

An AI model that can understand and work with multiple types of input — text, images, audio, or video — not just text.

01 ——

In plain English

A multi-modal AI model can process more than one type of data. Instead of only reading text, it might also understand images, hear audio, or watch video — and combine all of those to generate a response.

Examples:

  • Text + image: Upload a photo of a broken pipe and ask "what's wrong here?"
  • Text + audio: Speak a question and get a spoken answer back
  • Text + video: Describe what's happening in a video clip

Why it matters: Most real-world information isn't just text. Multi-modal models can work with screenshots, diagrams, voice messages, PDFs with charts, and more — making them far more useful for everyday tasks.

GPT-4o, Claude 3, and Gemini Ultra are all multi-modal models.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI