You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cross-reference hyperlinks and Big Picture backward links across all chapters
- Add 195 cross-reference hyperlinks across 80 files (Module N, Section N.N, Chapter N)
- Add backward-facing links to 95 Big Picture boxes across 103 files
- All links use relative paths with consistent styling
- Idempotent scripts ensure no double-linking
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
<p>Notice how standardization brings all features to a comparable scale. Without this step, the "square footage" feature (values in the thousands) would dominate "bedrooms" (values from 1 to 5) during optimization, not because it is more important, but simply because it is numerically larger.</p>
473
473
474
-
475
474
<h2>2. Supervised Learning: Classification and Regression</h2>
476
475
477
476
<p>Supervised learning is the backbone of modern ML. The idea is straightforward: you give the model examples of inputs paired with correct outputs, and it learns the mapping between them. This is analogous to learning from a textbook that has an answer key. You study the problems, check your answers, and gradually improve.</p>
<divclass="diagram-caption">Figure 1: Gradient descent follows the slope downhill, step by step. The learning rate controls step size.</div>
602
+
<divclass="diagram-caption">Figure 0.1.1: Gradient descent follows the slope downhill, step by step. The learning rate controls step size.</div>
604
603
</div>
605
604
606
605
<h3>Variants of Gradient Descent</h3>
@@ -650,7 +649,6 @@ <h3>Variants of Gradient Descent</h3>
650
649
<p>The <strong>learning rate</strong> is the single most important hyperparameter in optimization. Too large, and the steps overshoot the minimum, causing the loss to diverge. Too small, and training takes forever (or gets stuck). Modern practice uses learning rate schedulers that start with a larger rate and decay it over time, combining fast early progress with fine-grained later convergence.</p>
651
650
</div>
652
651
653
-
654
652
<h2>4. Overfitting, Underfitting, and Regularization</h2>
<divclass="diagram-caption">Figure 2: As model complexity increases, bias decreases but variance increases. The optimal model minimizes total error.</div>
797
+
<divclass="diagram-caption">Figure 0.1.2: As model complexity increases, bias decreases but variance increases. The optimal model minimizes total error.</div>
800
798
</div>
801
799
802
800
<p>Consider this analogy. Imagine asking multiple artists to draw a portrait from memory after briefly seeing a photograph. A stick-figure artist (high bias) will always produce an oversimplified drawing, regardless of the photo. A hyper-detailed artist (high variance) might capture every pore in one sitting but produce a wildly different portrait each time because they are also capturing fleeting shadows and reflections. The best artist has enough skill to capture the essential likeness (low bias) while remaining consistent across attempts (low variance).</p>
<p>Modern deep learning complicates the classical bias-variance tradeoff. Very large neural networks (including LLMs) are so overparameterized that they can memorize the training set perfectly, yet they still generalize well. This phenomenon, sometimes called "benign overfitting" or the "double descent" curve, is an active area of research. The classical framework remains a valuable mental model, but reality is richer than the simple U-shaped curve suggests.</p>
807
805
</div>
808
806
809
-
810
807
<h2>6. Cross-Validation and Model Selection</h2>
811
808
812
809
<p>You have several candidate models, each with different hyperparameters (learning rate, regularization strength, model complexity). How do you choose the best one? You cannot use training performance because that rewards overfitting. You need a reliable estimate of generalization performance.</p>
<p>In the LLM world, cross-validation is less common because datasets are enormous and models are expensive to train. Instead, practitioners rely on large held-out evaluation sets, benchmark suites (like MMLU or HumanEval), and qualitative evaluation. But the principle is the same: always evaluate on data the model did not train on.</p>
901
898
</div>
902
899
903
-
904
900
<h2>7. Putting It All Together: The Full Pipeline</h2>
905
901
906
902
<p>Let us trace a complete example that ties every concept together. Suppose you are building a spam classifier for emails.</p>
@@ -917,7 +913,6 @@ <h2>7. Putting It All Together: The Full Pipeline</h2>
917
913
918
914
<p>This exact workflow scales to far more complex settings. When researchers train GPT-style models, they follow the same logical steps at a vastly larger scale: represent text as token sequences (features), define cross-entropy loss over next-token prediction, optimize with Adam (a sophisticated variant of SGD that adapts the learning rate per parameter; we will explain Adam in detail in <ahref="section-0.3.html" style="color: var(--accent, #0f3460);">Section 0.3</a>), apply dropout and weight decay, and evaluate on held-out benchmarks. In Section 0.3, you will implement this entire workflow in PyTorch, the framework used for most modern LLM research.</p>
919
915
920
-
921
916
<!-- Interactive Quiz -->
922
917
<divclass="quiz">
923
918
<h3>✔ Check Your Understanding</h3>
@@ -963,7 +958,6 @@ <h3>✔ Check Your Understanding</h3>
<divclass="callout-title">Big Picture: From Basic ML to Neural Networks</div>
445
453
<p>In <ahref="section-0.1.html" style="color: var(--accent, #0f3460);">Section 0.1</a>, you learned how a model can learn from data using gradient descent and loss functions. Those ideas were powerful, but they were limited to finding simple patterns (linear boundaries, shallow decision trees). Deep learning changed everything by <em>stacking layers of simple functions</em> to learn extraordinarily complex representations. This single idea, composing simple transformations into deep hierarchies, is what lets a neural network translate languages, generate images, and power the conversational AI systems you will build in this book.</p>
446
454
</div>
@@ -512,7 +520,7 @@ <h3>1.1 The Perceptron: Your First Artificial Neuron</h3>
<divclass="diagram-caption">Figure 1: Anatomy of a single perceptron (artificial neuron). Each input is multiplied by a weight, the products are summed with a bias, and an activation function produces the output.</div>
523
+
<divclass="diagram-caption">Figure 0.2.1: Anatomy of a single perceptron (artificial neuron). Each input is multiplied by a weight, the products are summed with a bias, and an activation function produces the output.</div>
<li><strong>Output layer:</strong> Produces the final prediction (a class probability, a regression value, etc.).</li>
527
535
</ul>
528
536
529
-
<divclass="callout callout-insight">
537
+
<divclass="callout key-insight">
530
538
<divclass="callout-title">Key Insight</div>
531
539
<p>The <strong>Universal Approximation Theorem</strong> tells us that an MLP with just one hidden layer and enough neurons can approximate <em>any</em> continuous function to arbitrary accuracy. In practice, though, <em>deeper</em> networks (more layers with fewer neurons each) tend to learn more efficiently than extremely wide, shallow ones. Depth lets the network build hierarchical features: edges compose into textures, textures into parts, parts into objects.</p>
<tr><td><strong>Softmax</strong></td><td>e<sup>z<sub>i</sub></sup> / Σe<sup>z<sub>j</sub></sup></td><td>(0, 1), sums to 1</td><td>Multi-class classification output layer.</td></tr>
547
555
</table>
548
556
549
-
<divclass="callout callout-warning">
557
+
<divclass="callout warning">
550
558
<divclass="callout-title">Warning: The Dying ReLU Problem</div>
551
559
<p>If a neuron's weights cause its input to always be negative, ReLU outputs zero for every sample, and the gradient is also zero, so the neuron never updates again. It is "dead." Variants like <strong>Leaky ReLU</strong> (which outputs a small negative slope instead of zero) and <strong>GELU</strong> address this issue.</p>
552
560
</div>
@@ -645,7 +653,7 @@ <h3>2.1 A Concrete Numerical Example</h3>
645
653
</marker>
646
654
</defs>
647
655
</svg>
648
-
<divclass="diagram-caption">Figure 2: Backpropagation through a single neuron. The forward pass computes the loss (left to right), then gradients flow backward (right to left) via the chain rule.</div>
656
+
<divclass="diagram-caption">Figure 0.2.2: Backpropagation through a single neuron. The forward pass computes the loss (left to right), then gradients flow backward (right to left) via the chain rule.</div>
649
657
</div>
650
658
651
659
<p>Let us trace through this step by step:</p>
@@ -677,7 +685,7 @@ <h4>Debugging a Vanishing Gradient in Production</h4>
677
685
<p><strong>Lesson:</strong><strong>When training stalls, check your gradient flow before changing your architecture. Activation functions, initialization, and normalization are the first levers to pull.</strong></p>
678
686
</div>
679
687
680
-
<divclass="callout callout-note">
688
+
<divclass="callout note">
681
689
<divclass="callout-title">Note</div>
682
690
<p>In a real network with millions of parameters, this same process happens simultaneously for every weight. Modern frameworks like PyTorch compute all these gradients automatically using a technique called <strong>automatic differentiation</strong>, which builds a computational graph during the forward pass and traverses it in reverse during the backward pass.</p>
<tr><td><strong>Kaiming/He</strong></td><td>ReLU and variants</td><td>Scales weights by √(2/n<sub>in</sub>), accounting for ReLU zeroing out half the values.</td></tr>
730
738
</table>
731
739
732
-
<divclass="callout callout-insight">
740
+
<divclass="callout key-insight">
733
741
<divclass="callout-title">Key Insight</div>
734
742
<p>Batch normalization, dropout, and proper weight initialization are not optional extras. They are <em>essential infrastructure</em> for training deep networks reliably. Skipping any one of them often leads to unstable training, poor generalization, or both. Modern architectures like Transformers replace BatchNorm with LayerNorm, but the principle of normalizing intermediate representations remains universal.</p>
<p>After several convolutional and pooling layers, the output is flattened and passed through fully connected layers for the final prediction.</p>
796
804
797
-
<divclass="callout callout-note">
805
+
<divclass="callout note">
798
806
<divclass="callout-title">Note</div>
799
807
<p>While this book focuses on language models and conversational AI (which primarily use Transformers), understanding CNNs remains valuable. Many multimodal AI systems combine vision encoders (CNNs or Vision Transformers) with language models. You will encounter this pattern when studying vision-language models in later modules.</p>
<divclass="callout-title">Key Insight: The Three Safety Nets</div>
902
910
<p>Think of these three techniques as complementary safety nets. <strong>Gradient clipping</strong> prevents catastrophic updates on any single step. <strong>Learning rate scheduling</strong> ensures the optimization trajectory stays smooth over the full training run. <strong>Early stopping</strong> catches overfitting at the macro level by watching validation performance. Together, they make deep learning training far more reliable.</p>
0 commit comments