ApartsinProjects
diff --git a/‎part-1-foundations/module-00-ml-pytorch-foundations/section-0.1.html‎
Lines changed: 12 additions & 18 deletions b/‎part-1-foundations/module-00-ml-pytorch-foundations/section-0.1.html‎
Lines changed: 12 additions & 18 deletions
diff --git a/‎part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html‎
Lines changed: 19 additions & 11 deletions b/‎part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html‎
Lines changed: 19 additions & 11 deletions
@@ -102,8 +102,6 @@
 
         /* Navigation */
 
-
-
 /* Code blocks */
         pre {
             background: var(--code-bg);
@@ -164,30 +162,32 @@
             margin-bottom: 0.5rem;
         }
         .callout p:last-child { margin-bottom: 0; }
+        .callout.big-picture { background: linear-gradient(135deg, #f3e5f5, #ede7f6); border: 1px solid #ce93d8; border-left: 5px solid #8e24aa; }
+        .callout.big-picture .callout-title { color: #6a1b9a; }
+        .callout.key-insight { background: linear-gradient(135deg, #e8f5e9, #f1f8e9); border: 1px solid #a5d6a7; border-left: 5px solid #43a047; }
+        .callout.key-insight .callout-title { color: #2e7d32; }
+        .callout.warning { background: linear-gradient(135deg, #fff8e1, #fff3e0); border: 1px solid #ffcc80; border-left: 5px solid #f57c00; }
+        .callout.warning .callout-title { color: #e65100; }
+        .callout.note { background: linear-gradient(135deg, #e3f2fd, #e8eaf6); border: 1px solid #90caf9; border-left: 5px solid #1976d2; }
+        .callout.note .callout-title { color: #1565c0; }
+        .callout.fun-note { background: linear-gradient(135deg, #fce4ec, #f3e5f5); border: 1px solid #f48fb1; border-left: 5px solid #e91e63; }
+        .callout.fun-note .callout-title { color: #c2185b; }
 
-        .callout.big-picture {
             background: var(--insight-bg);
             border-color: var(--insight-border);
         }
-        .callout.big-picture .callout-title { color: var(--insight-border); }
 
-        .callout.key-insight {
             background: var(--key-bg);
             border-color: var(--key-border);
         }
-        .callout.key-insight .callout-title { color: var(--key-border); }
 
-        .callout.note {
             background: var(--note-bg);
             border-color: var(--note-border);
         }
-        .callout.note .callout-title { color: var(--note-border); }
 
-        .callout.warning {
             background: var(--warn-bg);
             border-color: var(--warn-border);
         }
-        .callout.warning .callout-title { color: var(--warn-border); }
 
         /* SVG diagrams */
         .diagram-container {
@@ -471,7 +471,6 @@ <h3>Common Feature Engineering Techniques</h3>
 
     <p>Notice how standardization brings all features to a comparable scale. Without this step, the "square footage" feature (values in the thousands) would dominate "bedrooms" (values from 1 to 5) during optimization, not because it is more important, but simply because it is numerically larger.</p>
 
-
     <h2>2. Supervised Learning: Classification and Regression</h2>
 
     <p>Supervised learning is the backbone of modern ML. The idea is straightforward: you give the model examples of inputs paired with correct outputs, and it learns the mapping between them. This is analogous to learning from a textbook that has an answer key. You study the problems, check your answers, and gradually improve.</p>
@@ -600,7 +599,7 @@ <h3>Why Gradient Descent Works</h3>
             <text x="470" y="63" text-anchor="middle" font-family="Segoe UI, sans-serif" font-size="11" fill="#555">Step size = learning rate &times; gradient</text>
             <text x="470" y="80" text-anchor="middle" font-family="Segoe UI, sans-serif" font-size="10" fill="#888">Too big: overshoot. Too small: slow.</text>
         </svg>
-        <div class="diagram-caption">Figure 1: Gradient descent follows the slope downhill, step by step. The learning rate controls step size.</div>
+        <div class="diagram-caption">Figure 0.1.1: Gradient descent follows the slope downhill, step by step. The learning rate controls step size.</div>
     </div>
 
     <h3>Variants of Gradient Descent</h3>
@@ -650,7 +649,6 @@ <h3>Variants of Gradient Descent</h3>
         <p>The <strong>learning rate</strong> is the single most important hyperparameter in optimization. Too large, and the steps overshoot the minimum, causing the loss to diverge. Too small, and training takes forever (or gets stuck). Modern practice uses learning rate schedulers that start with a larger rate and decay it over time, combining fast early progress with fine-grained later convergence.</p>
     </div>
 
-
     <h2>4. Overfitting, Underfitting, and Regularization</h2>
 
     <figure class="illustration">
@@ -796,7 +794,7 @@ <h3>Decomposing Prediction Error</h3>
             <rect x="410" y="55" width="100" height="24" rx="4" fill="#fef9e7" stroke="#e94560" stroke-width="1"/>
             <text x="460" y="71" text-anchor="middle" font-family="Segoe UI, sans-serif" font-size="11" fill="#e94560">Overfitting</text>
         </svg>
-        <div class="diagram-caption">Figure 2: As model complexity increases, bias decreases but variance increases. The optimal model minimizes total error.</div>
+        <div class="diagram-caption">Figure 0.1.2: As model complexity increases, bias decreases but variance increases. The optimal model minimizes total error.</div>
     </div>
 
     <p>Consider this analogy. Imagine asking multiple artists to draw a portrait from memory after briefly seeing a photograph. A stick-figure artist (high bias) will always produce an oversimplified drawing, regardless of the photo. A hyper-detailed artist (high variance) might capture every pore in one sitting but produce a wildly different portrait each time because they are also capturing fleeting shadows and reflections. The best artist has enough skill to capture the essential likeness (low bias) while remaining consistent across attempts (low variance).</p>
@@ -806,7 +804,6 @@ <h3>Decomposing Prediction Error</h3>
         <p>Modern deep learning complicates the classical bias-variance tradeoff. Very large neural networks (including LLMs) are so overparameterized that they can memorize the training set perfectly, yet they still generalize well. This phenomenon, sometimes called "benign overfitting" or the "double descent" curve, is an active area of research. The classical framework remains a valuable mental model, but reality is richer than the simple U-shaped curve suggests.</p>
     </div>
 
-
     <h2>6. Cross-Validation and Model Selection</h2>
 
     <p>You have several candidate models, each with different hyperparameters (learning rate, regularization strength, model complexity). How do you choose the best one? You cannot use training performance because that rewards overfitting. You need a reliable estimate of generalization performance.</p>
@@ -900,7 +897,6 @@ <h3>Model Selection Strategy</h3>
         <p>In the LLM world, cross-validation is less common because datasets are enormous and models are expensive to train. Instead, practitioners rely on large held-out evaluation sets, benchmark suites (like MMLU or HumanEval), and qualitative evaluation. But the principle is the same: always evaluate on data the model did not train on.</p>
     </div>
 
-
     <h2>7. Putting It All Together: The Full Pipeline</h2>
 
     <p>Let us trace a complete example that ties every concept together. Suppose you are building a spam classifier for emails.</p>
@@ -917,7 +913,6 @@ <h2>7. Putting It All Together: The Full Pipeline</h2>
 
     <p>This exact workflow scales to far more complex settings. When researchers train GPT-style models, they follow the same logical steps at a vastly larger scale: represent text as token sequences (features), define cross-entropy loss over next-token prediction, optimize with Adam (a sophisticated variant of SGD that adapts the learning rate per parameter; we will explain Adam in detail in <a href="section-0.3.html" style="color: var(--accent, #0f3460);">Section 0.3</a>), apply dropout and weight decay, and evaluate on held-out benchmarks. In Section 0.3, you will implement this entire workflow in PyTorch, the framework used for most modern LLM research.</p>
 
-
     <!-- Interactive Quiz -->
     <div class="quiz">
         <h3>&#10004; Check Your Understanding</h3>
@@ -963,7 +958,6 @@ <h3>&#10004; Check Your Understanding</h3>
         </details>
     </div>
 
-
     <!-- Key Takeaways -->
     <div class="takeaways">
         <h2>Key Takeaways</h2>
 
@@ -35,8 +35,6 @@
 
     /* Navigation */
 
-
-
 .nav-prev::before { content: ""; }
     .nav-next::after { content: ""; }
 
@@ -165,6 +163,16 @@
     }
     .callout p { font-size: 0.95rem; margin-bottom: 0.5rem; }
     .callout p:last-child { margin-bottom: 0; }
+        .callout.big-picture { background: linear-gradient(135deg, #f3e5f5, #ede7f6); border: 1px solid #ce93d8; border-left: 5px solid #8e24aa; }
+        .callout.big-picture .callout-title { color: #6a1b9a; }
+        .callout.key-insight { background: linear-gradient(135deg, #e8f5e9, #f1f8e9); border: 1px solid #a5d6a7; border-left: 5px solid #43a047; }
+        .callout.key-insight .callout-title { color: #2e7d32; }
+        .callout.warning { background: linear-gradient(135deg, #fff8e1, #fff3e0); border: 1px solid #ffcc80; border-left: 5px solid #f57c00; }
+        .callout.warning .callout-title { color: #e65100; }
+        .callout.note { background: linear-gradient(135deg, #e3f2fd, #e8eaf6); border: 1px solid #90caf9; border-left: 5px solid #1976d2; }
+        .callout.note .callout-title { color: #1565c0; }
+        .callout.fun-note { background: linear-gradient(135deg, #fce4ec, #f3e5f5); border: 1px solid #f48fb1; border-left: 5px solid #e91e63; }
+        .callout.fun-note .callout-title { color: #c2185b; }
 
     .callout-insight {
         background: #f0fdf4;
@@ -440,7 +448,7 @@ <h1>Deep Learning Essentials</h1>
 
 <main class="content">
 
-<div class="callout callout-bigpicture">
+<div class="callout big-picture">
     <div class="callout-title">Big Picture: From Basic ML to Neural Networks</div>
     <p>In <a href="section-0.1.html" style="color: var(--accent, #0f3460);">Section 0.1</a>, you learned how a model can learn from data using gradient descent and loss functions. Those ideas were powerful, but they were limited to finding simple patterns (linear boundaries, shallow decision trees). Deep learning changed everything by <em>stacking layers of simple functions</em> to learn extraordinarily complex representations. This single idea, composing simple transformations into deep hierarchies, is what lets a neural network translate languages, generate images, and power the conversational AI systems you will build in this book.</p>
 </div>
@@ -512,7 +520,7 @@ <h3>1.1 The Perceptron: Your First Artificial Neuron</h3>
     <text x="290" y="192" text-anchor="middle" font-family="sans-serif" font-size="11" fill="#888">Weighted Sum</text>
     <text x="556" y="180" text-anchor="middle" font-family="sans-serif" font-size="11" fill="#888">Output</text>
 </svg>
-<div class="diagram-caption">Figure 1: Anatomy of a single perceptron (artificial neuron). Each input is multiplied by a weight, the products are summed with a bias, and an activation function produces the output.</div>
+<div class="diagram-caption">Figure 0.2.1: Anatomy of a single perceptron (artificial neuron). Each input is multiplied by a weight, the products are summed with a bias, and an activation function produces the output.</div>
 </div>
 
 <h3>1.2 Multi-Layer Perceptrons: Stacking LEGO Bricks</h3>
@@ -526,7 +534,7 @@ <h3>1.2 Multi-Layer Perceptrons: Stacking LEGO Bricks</h3>
     <li><strong>Output layer:</strong> Produces the final prediction (a class probability, a regression value, etc.).</li>
 </ul>
 
-<div class="callout callout-insight">
+<div class="callout key-insight">
     <div class="callout-title">Key Insight</div>
     <p>The <strong>Universal Approximation Theorem</strong> tells us that an MLP with just one hidden layer and enough neurons can approximate <em>any</em> continuous function to arbitrary accuracy. In practice, though, <em>deeper</em> networks (more layers with fewer neurons each) tend to learn more efficiently than extremely wide, shallow ones. Depth lets the network build hierarchical features: edges compose into textures, textures into parts, parts into objects.</p>
 </div>
@@ -546,7 +554,7 @@ <h3>1.3 Activation Functions</h3>
     <tr><td><strong>Softmax</strong></td><td>e<sup>z<sub>i</sub></sup> / &Sigma;e<sup>z<sub>j</sub></sup></td><td>(0, 1), sums to 1</td><td>Multi-class classification output layer.</td></tr>
 </table>
 
-<div class="callout callout-warning">
+<div class="callout warning">
     <div class="callout-title">Warning: The Dying ReLU Problem</div>
     <p>If a neuron's weights cause its input to always be negative, ReLU outputs zero for every sample, and the gradient is also zero, so the neuron never updates again. It is "dead." Variants like <strong>Leaky ReLU</strong> (which outputs a small negative slope instead of zero) and <strong>GELU</strong> address this issue.</p>
 </div>
@@ -645,7 +653,7 @@ <h3>2.1 A Concrete Numerical Example</h3>
         </marker>
     </defs>
 </svg>
-<div class="diagram-caption">Figure 2: Backpropagation through a single neuron. The forward pass computes the loss (left to right), then gradients flow backward (right to left) via the chain rule.</div>
+<div class="diagram-caption">Figure 0.2.2: Backpropagation through a single neuron. The forward pass computes the loss (left to right), then gradients flow backward (right to left) via the chain rule.</div>
 </div>
 
 <p>Let us trace through this step by step:</p>
@@ -677,7 +685,7 @@ <h4>Debugging a Vanishing Gradient in Production</h4>
   <p><strong>Lesson:</strong> <strong>When training stalls, check your gradient flow before changing your architecture. Activation functions, initialization, and normalization are the first levers to pull.</strong></p>
 </div>
 
-<div class="callout callout-note">
+<div class="callout note">
     <div class="callout-title">Note</div>
     <p>In a real network with millions of parameters, this same process happens simultaneously for every weight. Modern frameworks like PyTorch compute all these gradients automatically using a technique called <strong>automatic differentiation</strong>, which builds a computational graph during the forward pass and traverses it in reverse during the backward pass.</p>
 </div>
@@ -729,7 +737,7 @@ <h3>3.3 Weight Initialization</h3>
     <tr><td><strong>Kaiming/He</strong></td><td>ReLU and variants</td><td>Scales weights by &radic;(2/n<sub>in</sub>), accounting for ReLU zeroing out half the values.</td></tr>
 </table>
 
-<div class="callout callout-insight">
+<div class="callout key-insight">
     <div class="callout-title">Key Insight</div>
     <p>Batch normalization, dropout, and proper weight initialization are not optional extras. They are <em>essential infrastructure</em> for training deep networks reliably. Skipping any one of them often leads to unstable training, poor generalization, or both. Modern architectures like Transformers replace BatchNorm with LayerNorm, but the principle of normalizing intermediate representations remains universal.</p>
 </div>
@@ -794,7 +802,7 @@ <h2>4. Convolutional Neural Networks (CNNs) Overview</h2>
 
 <p>After several convolutional and pooling layers, the output is flattened and passed through fully connected layers for the final prediction.</p>
 
-<div class="callout callout-note">
+<div class="callout note">
     <div class="callout-title">Note</div>
     <p>While this book focuses on language models and conversational AI (which primarily use Transformers), understanding CNNs remains valuable. Many multimodal AI systems combine vision encoders (CNNs or Vision Transformers) with language models. You will encounter this pattern when studying vision-language models in later modules.</p>
 </div>
@@ -897,7 +905,7 @@ <h3>5.3 Gradient Clipping</h3>
 Early stopping at epoch 18
 Best validation loss: 1.0492</div>
 
-<div class="callout callout-insight">
+<div class="callout key-insight">
     <div class="callout-title">Key Insight: The Three Safety Nets</div>
     <p>Think of these three techniques as complementary safety nets. <strong>Gradient clipping</strong> prevents catastrophic updates on any single step. <strong>Learning rate scheduling</strong> ensures the optimization trajectory stays smooth over the full training run. <strong>Early stopping</strong> catches overfitting at the macro level by watching validation performance. Together, they make deep learning training far more reliable.</p>
 </div>