Modalities

Vision-Language Model

A multimodal model that processes both images and text — letting you ask questions about an image, generate captions, or reason over visual content.

01 ——

In plain English

A vision-language model (VLM) is a multimodal model trained to handle images and text together. You can show it an image and ask a question, give it a diagram and ask for code, or feed it a chart and ask what it shows. It's the architecture behind nearly every modern "look at this and tell me about it" feature.

What VLMs can do:

Visual question answering — "what's wrong with this CSS?" while looking at a screenshot
OCR + reasoning — read a document, answer questions about its content
Diagram understanding — interpret flowcharts, architecture diagrams
Image captioning and alt-text — at scale
Visual agents — see a screen, decide what to click (powers computer-use agents)

Notable VLMs (2026):

GPT-5 vision / GPT-4o vision (OpenAI)
Claude Sonnet / Opus 4.x (Anthropic, vision-capable across all tiers)
Gemini 2.5 (Google, deeply multimodal)
Llama 3.2 / 4 vision variants (Meta, open-weight)
Qwen-VL (Alibaba)

Where VLMs struggle: Fine detail (text in small images), counting many objects, and exact spatial reasoning (relative positions) are still weak spots.

02 ——