Skip to content

propardhu/UnderstandingVisionLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

combine both text and image encoding pipelines into a unified architecture diagram for LLaMA 3.2-Vision-Instruct.


🧠 End-to-End Architecture: Text + Image Flow in LLaMA 3.2

                          +-------------------------+
                          |      [Text Prompt]      |
                          |  "What is your name?"   |
                          +-----------+-------------+
                                      ↓
                           Tokenizer → Token IDs
                                      ↓
                 +----------------------------------------+
                 | Token Embedding Table (Vocab × Dim)    |
                 | Example: (128K × 4096)                 |
                 +----------------------------------------+
                                      ↓
                 Add Rotary Positional Embeddings (to Q & K)
                                      ↓
        ┌──────────────────────────────────────────────────────────┐
        │  TEXT DECODER (40 layers: 32 Self-Attn + 8 Cross-Attn)   │
        │                                                          │
        │  LlamaDecoderLayer (Layer 0 → 39):                       │
        │    • Causal Self-Attention                               │
        │    • [Every few layers] Cross-Attention to image         │
        │    • MLP + RMSNorm + Residuals                           │
        └──────────────────────────────────────────────────────────┘
                                      ↓
                             Final RMSNorm Layer
                                      ↓
                      LM Head (Linear) → Vocabulary logits
                                      ↓
                           Sample / Generate Token

Meanwhile...

                          +------------------------+
                          |       [Image]          |
                          |    224 × 224 RGB       |
                          +-----------+------------+
                                      ↓
                     Conv2D(kernel=14, stride=14) → Patch Embedding
                     → Splits into 14×14 patches → 1280 dim each
                                      ↓
           Add Positional + Tile + Aspect-Ratio Embeddings
                                      ↓
         ┌────────────────────────────────────────────┐
         │       VISION ENCODER (32 Transformer layers)│
         └────────────────────────────────────────────┘
                                      ↓
         ┌────────────────────────────────────────────┐
         │   GLOBAL TRANSFORMER (8 layers, optional)  │
         └────────────────────────────────────────────┘
                                      ↓
                    Output: 6272+ Image Embedding Tokens
                                      ↓
                   Passed into Text Decoder via Cross-Attention

🧩 Final Unified Flow

                 [TEXT INPUT]                       [IMAGE INPUT]
             "What is your name?"                     (RGB image)
                     ↓                                     ↓
         Token Embeddings + Pos. Info        Patch Embedding + Tile Info
                     ↓                                     ↓
     ┌──────────────────────────────┐      ┌─────────────────────────────┐
     │   TEXT DECODER (40 layers)   │◄─────┤     VISION ENCODER (32+8)   │
     │  (Self + Cross Attention)    │      └─────────────────────────────┘
     └──────────────────────────────┘
                     ↓
           Final logits → Next Token Prediction

🧠 Why This Matters

  • The model separately encodes text and images, and fuses them inside the decoder.
  • Text tokens can attend to image embeddings via cross-attention layers at regular intervals (like layers 3, 8, 13…).
  • This enables multimodal reasoning like:

    “What’s happening in this image?” or “Describe the famous person.”


Would you like this turned into:

  • A visual diagram (.png or .svg)?
  • A slide-ready or Medium article section?

Or shall we explore an actual example from your prompt?

About

Running Llama3.2-11B-Vision

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors