Modalities

Vision-Language Model

A multimodal model that processes both images and text — letting you ask questions about an image, generate captions, or reason over visual content.

01 ——

In plain English

A vision-language model (VLM) is a multimodal model trained to handle images and text together. You can show it an image and ask a question, give it a diagram and ask for code, or feed it a chart and ask what it shows. It's the architecture behind nearly every modern "look at this and tell me about it" feature.

What VLMs can do:

  • Visual question answering — "what's wrong with this CSS?" while looking at a screenshot
  • OCR + reasoning — read a document, answer questions about its content
  • Diagram understanding — interpret flowcharts, architecture diagrams
  • Image captioning and alt-text — at scale
  • Visual agents — see a screen, decide what to click (powers computer-use agents)

Notable VLMs (2026):

  • GPT-5 vision / GPT-4o vision (OpenAI)
  • Claude Sonnet / Opus 4.x (Anthropic, vision-capable across all tiers)
  • Gemini 2.5 (Google, deeply multimodal)
  • Llama 3.2 / 4 vision variants (Meta, open-weight)
  • Qwen-VL (Alibaba)

Where VLMs struggle: Fine detail (text in small images), counting many objects, and exact spatial reasoning (relative positions) are still weak spots.

02 ——

Related terms

Back to glossaryLast reviewed May 2026
Vol. 4 · Issue 19 · Last reviewed 2026-05-30

Sign up for our newsletter

Receive weekly updates so you can stay up-to-date with the world of AI