Vision Language Model (VLM)
A Vision Language Model (VLM) is a type of Multimodal AI that can process and understand both text and images simultaneously. It extends the capabilities of a traditional Large Language Model to the visual domain, enabling tasks like Visual Question Answering (VQA) and image captioning.
Core Architecture
Most modern VLMs follow a modular “connect-the-dots” architecture:
- Vision Encoder: A specialized model (often CLIP or SigLIP) that converts an input image into a series of vector embeddings (mathematical representations of visual features).
- Projector: A small adapter layer (often a simple Multilayer Perceptron or MLP) that translates the “visual embeddings” into the same “word embedding space” used by the LLM. It acts as a bridge, teaching the LLM to “read” images as if they were foreign words.
- LLM Backbone: A standard pre-trained LLM (like Llama 3 or Vicuna) that takes the projected visual tokens + user text prompts and generates a text response.
Capabilities
- Visual Question Answering (VQA): “What is unusual about this image?”
- OCR (Optical Character Recognition): Reading text from documents, signs, or screenshots.
- Analysis: Explaining charts, diagrams, or complex scenes.
Examples
- Proprietary: GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro.
- Open Source: LLaVA (Large Language and Vision Assistant), Qwen-VL, Pixtral.
