Vision Language Model (VLM)

Table of Contents

Vision Language Model (VLM)

A Vision Language Model (VLM) is a type of Multimodal AI that can process and understand both text and images simultaneously. It extends the capabilities of a traditional Large Language Model to the visual domain, enabling tasks like Visual Question Answering (VQA) and image captioning.

Core Architecture

Most modern VLMs follow a modular “connect-the-dots” architecture:

  1. Vision Encoder: A specialized model (often CLIP or SigLIP) that converts an input image into a series of vector embeddings (mathematical representations of visual features).
  2. Projector: A small adapter layer (often a simple Multilayer Perceptron or MLP) that translates the “visual embeddings” into the same “word embedding space” used by the LLM. It acts as a bridge, teaching the LLM to “read” images as if they were foreign words.
  3. LLM Backbone: A standard pre-trained LLM (like Llama 3 or Vicuna) that takes the projected visual tokens + user text prompts and generates a text response.

Capabilities

Examples

    Mike 3.0

    Send a message to start the chat!

    You can ask the bot anything about me and it will help to find the relevant information!

    Try asking: