Vision-Language Model
A multimodal model that processes both images and text — letting you ask questions about an image, generate captions, or reason over visual content.
In plain English
A vision-language model (VLM) is a multimodal model trained to handle images and text together. You can show it an image and ask a question, give it a diagram and ask for code, or feed it a chart and ask what it shows. It's the architecture behind nearly every modern "look at this and tell me about it" feature.
What VLMs can do:
- Visual question answering — "what's wrong with this CSS?" while looking at a screenshot
- OCR + reasoning — read a document, answer questions about its content
- Diagram understanding — interpret flowcharts, architecture diagrams
- Image captioning and alt-text — at scale
- Visual agents — see a screen, decide what to click (powers computer-use agents)
Notable VLMs (2026):
- GPT-5 vision / GPT-4o vision (OpenAI)
- Claude Sonnet / Opus 4.x (Anthropic, vision-capable across all tiers)
- Gemini 2.5 (Google, deeply multimodal)
- Llama 3.2 / 4 vision variants (Meta, open-weight)
- Qwen-VL (Alibaba)
Where VLMs struggle: Fine detail (text in small images), counting many objects, and exact spatial reasoning (relative positions) are still weak spots.