Multi-modal
An AI model that can understand and work with multiple types of input — text, images, audio, or video — not just text.
In plain English
A multi-modal AI model can process more than one type of data. Instead of only reading text, it might also understand images, hear audio, or watch video — and combine all of those to generate a response.
Examples:
- Text + image: Upload a photo of a broken pipe and ask "what's wrong here?"
- Text + audio: Speak a question and get a spoken answer back
- Text + video: Describe what's happening in a video clip
Why it matters: Most real-world information isn't just text. Multi-modal models can work with screenshots, diagrams, voice messages, PDFs with charts, and more — making them far more useful for everyday tasks.
GPT-4o, Claude 3, and Gemini Ultra are all multi-modal models.