combine both text and image encoding pipelines into a unified architecture diagram for LLaMA 3.2-Vision-Instruct.
+-------------------------+
| [Text Prompt] |
| "What is your name?" |
+-----------+-------------+
↓
Tokenizer → Token IDs
↓
+----------------------------------------+
| Token Embedding Table (Vocab × Dim) |
| Example: (128K × 4096) |
+----------------------------------------+
↓
Add Rotary Positional Embeddings (to Q & K)
↓
┌──────────────────────────────────────────────────────────┐
│ TEXT DECODER (40 layers: 32 Self-Attn + 8 Cross-Attn) │
│ │
│ LlamaDecoderLayer (Layer 0 → 39): │
│ • Causal Self-Attention │
│ • [Every few layers] Cross-Attention to image │
│ • MLP + RMSNorm + Residuals │
└──────────────────────────────────────────────────────────┘
↓
Final RMSNorm Layer
↓
LM Head (Linear) → Vocabulary logits
↓
Sample / Generate Token
Meanwhile...
+------------------------+
| [Image] |
| 224 × 224 RGB |
+-----------+------------+
↓
Conv2D(kernel=14, stride=14) → Patch Embedding
→ Splits into 14×14 patches → 1280 dim each
↓
Add Positional + Tile + Aspect-Ratio Embeddings
↓
┌────────────────────────────────────────────┐
│ VISION ENCODER (32 Transformer layers)│
└────────────────────────────────────────────┘
↓
┌────────────────────────────────────────────┐
│ GLOBAL TRANSFORMER (8 layers, optional) │
└────────────────────────────────────────────┘
↓
Output: 6272+ Image Embedding Tokens
↓
Passed into Text Decoder via Cross-Attention
[TEXT INPUT] [IMAGE INPUT]
"What is your name?" (RGB image)
↓ ↓
Token Embeddings + Pos. Info Patch Embedding + Tile Info
↓ ↓
┌──────────────────────────────┐ ┌─────────────────────────────┐
│ TEXT DECODER (40 layers) │◄─────┤ VISION ENCODER (32+8) │
│ (Self + Cross Attention) │ └─────────────────────────────┘
└──────────────────────────────┘
↓
Final logits → Next Token Prediction
- The model separately encodes text and images, and fuses them inside the decoder.
- Text tokens can attend to image embeddings via cross-attention layers at regular intervals (like layers 3, 8, 13…).
- This enables multimodal reasoning like:
“What’s happening in this image?” or “Describe the famous person.”
Would you like this turned into:
- A visual diagram (.png or .svg)?
- A slide-ready or Medium article section?
Or shall we explore an actual example from your prompt?