Vision Language Model (VLM)

A Vision Language Model (VLM) is a type of Multimodal AI that can process and understand both text and images simultaneously. It extends the capabilities of a traditional Large Language Model to the visual domain, enabling tasks like Visual Question Answering (VQA) and image captioning.

Core Architecture

Most modern VLMs follow a modular “connect-the-dots” architecture:

Vision Encoder: A specialized model (often CLIP or SigLIP) that converts an input image into a series of vector embeddings (mathematical representations of visual features).
Projector: A small adapter layer (often a simple Multilayer Perceptron or MLP) that translates the “visual embeddings” into the same “word embedding space” used by the LLM. It acts as a bridge, teaching the LLM to “read” images as if they were foreign words.
LLM Backbone: A standard pre-trained LLM (like Llama 3 or Vicuna) that takes the projected visual tokens + user text prompts and generates a text response.

Capabilities

Visual Question Answering (VQA): “What is unusual about this image?”
OCR (Optical Character Recognition): Reading text from documents, signs, or screenshots.
Analysis: Explaining charts, diagrams, or complex scenes.

Examples

Proprietary: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro.
Open Source: LLaVA (Large Language and Vision Assistant), Qwen-VL, Pixtral.

Vision Language Model (VLM)

Vision Language Model (VLM)

Core Architecture

Capabilities

Examples

Chat with Mike 3.0