Skip to content

Commit 6f08f8b

Browse files
apartsinclaude
andcommitted
Cross-reference hyperlinks and Big Picture backward links across all chapters
- Add 195 cross-reference hyperlinks across 80 files (Module N, Section N.N, Chapter N) - Add backward-facing links to 95 Big Picture boxes across 103 files - All links use relative paths with consistent styling - Idempotent scripts ensure no double-linking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 1d050b4 commit 6f08f8b

131 files changed

Lines changed: 1604 additions & 1648 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

part-1-foundations/module-00-ml-pytorch-foundations/section-0.1.html

Lines changed: 12 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -102,8 +102,6 @@
102102

103103
/* Navigation */
104104

105-
106-
107105
/* Code blocks */
108106
pre {
109107
background: var(--code-bg);
@@ -164,30 +162,32 @@
164162
margin-bottom: 0.5rem;
165163
}
166164
.callout p:last-child { margin-bottom: 0; }
165+
.callout.big-picture { background: linear-gradient(135deg, #f3e5f5, #ede7f6); border: 1px solid #ce93d8; border-left: 5px solid #8e24aa; }
166+
.callout.big-picture .callout-title { color: #6a1b9a; }
167+
.callout.key-insight { background: linear-gradient(135deg, #e8f5e9, #f1f8e9); border: 1px solid #a5d6a7; border-left: 5px solid #43a047; }
168+
.callout.key-insight .callout-title { color: #2e7d32; }
169+
.callout.warning { background: linear-gradient(135deg, #fff8e1, #fff3e0); border: 1px solid #ffcc80; border-left: 5px solid #f57c00; }
170+
.callout.warning .callout-title { color: #e65100; }
171+
.callout.note { background: linear-gradient(135deg, #e3f2fd, #e8eaf6); border: 1px solid #90caf9; border-left: 5px solid #1976d2; }
172+
.callout.note .callout-title { color: #1565c0; }
173+
.callout.fun-note { background: linear-gradient(135deg, #fce4ec, #f3e5f5); border: 1px solid #f48fb1; border-left: 5px solid #e91e63; }
174+
.callout.fun-note .callout-title { color: #c2185b; }
167175

168-
.callout.big-picture {
169176
background: var(--insight-bg);
170177
border-color: var(--insight-border);
171178
}
172-
.callout.big-picture .callout-title { color: var(--insight-border); }
173179

174-
.callout.key-insight {
175180
background: var(--key-bg);
176181
border-color: var(--key-border);
177182
}
178-
.callout.key-insight .callout-title { color: var(--key-border); }
179183

180-
.callout.note {
181184
background: var(--note-bg);
182185
border-color: var(--note-border);
183186
}
184-
.callout.note .callout-title { color: var(--note-border); }
185187

186-
.callout.warning {
187188
background: var(--warn-bg);
188189
border-color: var(--warn-border);
189190
}
190-
.callout.warning .callout-title { color: var(--warn-border); }
191191

192192
/* SVG diagrams */
193193
.diagram-container {
@@ -471,7 +471,6 @@ <h3>Common Feature Engineering Techniques</h3>
471471

472472
<p>Notice how standardization brings all features to a comparable scale. Without this step, the "square footage" feature (values in the thousands) would dominate "bedrooms" (values from 1 to 5) during optimization, not because it is more important, but simply because it is numerically larger.</p>
473473

474-
475474
<h2>2. Supervised Learning: Classification and Regression</h2>
476475

477476
<p>Supervised learning is the backbone of modern ML. The idea is straightforward: you give the model examples of inputs paired with correct outputs, and it learns the mapping between them. This is analogous to learning from a textbook that has an answer key. You study the problems, check your answers, and gradually improve.</p>
@@ -600,7 +599,7 @@ <h3>Why Gradient Descent Works</h3>
600599
<text x="470" y="63" text-anchor="middle" font-family="Segoe UI, sans-serif" font-size="11" fill="#555">Step size = learning rate &times; gradient</text>
601600
<text x="470" y="80" text-anchor="middle" font-family="Segoe UI, sans-serif" font-size="10" fill="#888">Too big: overshoot. Too small: slow.</text>
602601
</svg>
603-
<div class="diagram-caption">Figure 1: Gradient descent follows the slope downhill, step by step. The learning rate controls step size.</div>
602+
<div class="diagram-caption">Figure 0.1.1: Gradient descent follows the slope downhill, step by step. The learning rate controls step size.</div>
604603
</div>
605604

606605
<h3>Variants of Gradient Descent</h3>
@@ -650,7 +649,6 @@ <h3>Variants of Gradient Descent</h3>
650649
<p>The <strong>learning rate</strong> is the single most important hyperparameter in optimization. Too large, and the steps overshoot the minimum, causing the loss to diverge. Too small, and training takes forever (or gets stuck). Modern practice uses learning rate schedulers that start with a larger rate and decay it over time, combining fast early progress with fine-grained later convergence.</p>
651650
</div>
652651

653-
654652
<h2>4. Overfitting, Underfitting, and Regularization</h2>
655653

656654
<figure class="illustration">
@@ -796,7 +794,7 @@ <h3>Decomposing Prediction Error</h3>
796794
<rect x="410" y="55" width="100" height="24" rx="4" fill="#fef9e7" stroke="#e94560" stroke-width="1"/>
797795
<text x="460" y="71" text-anchor="middle" font-family="Segoe UI, sans-serif" font-size="11" fill="#e94560">Overfitting</text>
798796
</svg>
799-
<div class="diagram-caption">Figure 2: As model complexity increases, bias decreases but variance increases. The optimal model minimizes total error.</div>
797+
<div class="diagram-caption">Figure 0.1.2: As model complexity increases, bias decreases but variance increases. The optimal model minimizes total error.</div>
800798
</div>
801799

802800
<p>Consider this analogy. Imagine asking multiple artists to draw a portrait from memory after briefly seeing a photograph. A stick-figure artist (high bias) will always produce an oversimplified drawing, regardless of the photo. A hyper-detailed artist (high variance) might capture every pore in one sitting but produce a wildly different portrait each time because they are also capturing fleeting shadows and reflections. The best artist has enough skill to capture the essential likeness (low bias) while remaining consistent across attempts (low variance).</p>
@@ -806,7 +804,6 @@ <h3>Decomposing Prediction Error</h3>
806804
<p>Modern deep learning complicates the classical bias-variance tradeoff. Very large neural networks (including LLMs) are so overparameterized that they can memorize the training set perfectly, yet they still generalize well. This phenomenon, sometimes called "benign overfitting" or the "double descent" curve, is an active area of research. The classical framework remains a valuable mental model, but reality is richer than the simple U-shaped curve suggests.</p>
807805
</div>
808806

809-
810807
<h2>6. Cross-Validation and Model Selection</h2>
811808

812809
<p>You have several candidate models, each with different hyperparameters (learning rate, regularization strength, model complexity). How do you choose the best one? You cannot use training performance because that rewards overfitting. You need a reliable estimate of generalization performance.</p>
@@ -900,7 +897,6 @@ <h3>Model Selection Strategy</h3>
900897
<p>In the LLM world, cross-validation is less common because datasets are enormous and models are expensive to train. Instead, practitioners rely on large held-out evaluation sets, benchmark suites (like MMLU or HumanEval), and qualitative evaluation. But the principle is the same: always evaluate on data the model did not train on.</p>
901898
</div>
902899

903-
904900
<h2>7. Putting It All Together: The Full Pipeline</h2>
905901

906902
<p>Let us trace a complete example that ties every concept together. Suppose you are building a spam classifier for emails.</p>
@@ -917,7 +913,6 @@ <h2>7. Putting It All Together: The Full Pipeline</h2>
917913

918914
<p>This exact workflow scales to far more complex settings. When researchers train GPT-style models, they follow the same logical steps at a vastly larger scale: represent text as token sequences (features), define cross-entropy loss over next-token prediction, optimize with Adam (a sophisticated variant of SGD that adapts the learning rate per parameter; we will explain Adam in detail in <a href="section-0.3.html" style="color: var(--accent, #0f3460);">Section 0.3</a>), apply dropout and weight decay, and evaluate on held-out benchmarks. In Section 0.3, you will implement this entire workflow in PyTorch, the framework used for most modern LLM research.</p>
919915

920-
921916
<!-- Interactive Quiz -->
922917
<div class="quiz">
923918
<h3>&#10004; Check Your Understanding</h3>
@@ -963,7 +958,6 @@ <h3>&#10004; Check Your Understanding</h3>
963958
</details>
964959
</div>
965960

966-
967961
<!-- Key Takeaways -->
968962
<div class="takeaways">
969963
<h2>Key Takeaways</h2>

part-1-foundations/module-00-ml-pytorch-foundations/section-0.2.html

Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,6 @@
3535

3636
/* Navigation */
3737

38-
39-
4038
.nav-prev::before { content: ""; }
4139
.nav-next::after { content: ""; }
4240

@@ -165,6 +163,16 @@
165163
}
166164
.callout p { font-size: 0.95rem; margin-bottom: 0.5rem; }
167165
.callout p:last-child { margin-bottom: 0; }
166+
.callout.big-picture { background: linear-gradient(135deg, #f3e5f5, #ede7f6); border: 1px solid #ce93d8; border-left: 5px solid #8e24aa; }
167+
.callout.big-picture .callout-title { color: #6a1b9a; }
168+
.callout.key-insight { background: linear-gradient(135deg, #e8f5e9, #f1f8e9); border: 1px solid #a5d6a7; border-left: 5px solid #43a047; }
169+
.callout.key-insight .callout-title { color: #2e7d32; }
170+
.callout.warning { background: linear-gradient(135deg, #fff8e1, #fff3e0); border: 1px solid #ffcc80; border-left: 5px solid #f57c00; }
171+
.callout.warning .callout-title { color: #e65100; }
172+
.callout.note { background: linear-gradient(135deg, #e3f2fd, #e8eaf6); border: 1px solid #90caf9; border-left: 5px solid #1976d2; }
173+
.callout.note .callout-title { color: #1565c0; }
174+
.callout.fun-note { background: linear-gradient(135deg, #fce4ec, #f3e5f5); border: 1px solid #f48fb1; border-left: 5px solid #e91e63; }
175+
.callout.fun-note .callout-title { color: #c2185b; }
168176

169177
.callout-insight {
170178
background: #f0fdf4;
@@ -440,7 +448,7 @@ <h1>Deep Learning Essentials</h1>
440448

441449
<main class="content">
442450

443-
<div class="callout callout-bigpicture">
451+
<div class="callout big-picture">
444452
<div class="callout-title">Big Picture: From Basic ML to Neural Networks</div>
445453
<p>In <a href="section-0.1.html" style="color: var(--accent, #0f3460);">Section 0.1</a>, you learned how a model can learn from data using gradient descent and loss functions. Those ideas were powerful, but they were limited to finding simple patterns (linear boundaries, shallow decision trees). Deep learning changed everything by <em>stacking layers of simple functions</em> to learn extraordinarily complex representations. This single idea, composing simple transformations into deep hierarchies, is what lets a neural network translate languages, generate images, and power the conversational AI systems you will build in this book.</p>
446454
</div>
@@ -512,7 +520,7 @@ <h3>1.1 The Perceptron: Your First Artificial Neuron</h3>
512520
<text x="290" y="192" text-anchor="middle" font-family="sans-serif" font-size="11" fill="#888">Weighted Sum</text>
513521
<text x="556" y="180" text-anchor="middle" font-family="sans-serif" font-size="11" fill="#888">Output</text>
514522
</svg>
515-
<div class="diagram-caption">Figure 1: Anatomy of a single perceptron (artificial neuron). Each input is multiplied by a weight, the products are summed with a bias, and an activation function produces the output.</div>
523+
<div class="diagram-caption">Figure 0.2.1: Anatomy of a single perceptron (artificial neuron). Each input is multiplied by a weight, the products are summed with a bias, and an activation function produces the output.</div>
516524
</div>
517525

518526
<h3>1.2 Multi-Layer Perceptrons: Stacking LEGO Bricks</h3>
@@ -526,7 +534,7 @@ <h3>1.2 Multi-Layer Perceptrons: Stacking LEGO Bricks</h3>
526534
<li><strong>Output layer:</strong> Produces the final prediction (a class probability, a regression value, etc.).</li>
527535
</ul>
528536

529-
<div class="callout callout-insight">
537+
<div class="callout key-insight">
530538
<div class="callout-title">Key Insight</div>
531539
<p>The <strong>Universal Approximation Theorem</strong> tells us that an MLP with just one hidden layer and enough neurons can approximate <em>any</em> continuous function to arbitrary accuracy. In practice, though, <em>deeper</em> networks (more layers with fewer neurons each) tend to learn more efficiently than extremely wide, shallow ones. Depth lets the network build hierarchical features: edges compose into textures, textures into parts, parts into objects.</p>
532540
</div>
@@ -546,7 +554,7 @@ <h3>1.3 Activation Functions</h3>
546554
<tr><td><strong>Softmax</strong></td><td>e<sup>z<sub>i</sub></sup> / &Sigma;e<sup>z<sub>j</sub></sup></td><td>(0, 1), sums to 1</td><td>Multi-class classification output layer.</td></tr>
547555
</table>
548556

549-
<div class="callout callout-warning">
557+
<div class="callout warning">
550558
<div class="callout-title">Warning: The Dying ReLU Problem</div>
551559
<p>If a neuron's weights cause its input to always be negative, ReLU outputs zero for every sample, and the gradient is also zero, so the neuron never updates again. It is "dead." Variants like <strong>Leaky ReLU</strong> (which outputs a small negative slope instead of zero) and <strong>GELU</strong> address this issue.</p>
552560
</div>
@@ -645,7 +653,7 @@ <h3>2.1 A Concrete Numerical Example</h3>
645653
</marker>
646654
</defs>
647655
</svg>
648-
<div class="diagram-caption">Figure 2: Backpropagation through a single neuron. The forward pass computes the loss (left to right), then gradients flow backward (right to left) via the chain rule.</div>
656+
<div class="diagram-caption">Figure 0.2.2: Backpropagation through a single neuron. The forward pass computes the loss (left to right), then gradients flow backward (right to left) via the chain rule.</div>
649657
</div>
650658

651659
<p>Let us trace through this step by step:</p>
@@ -677,7 +685,7 @@ <h4>Debugging a Vanishing Gradient in Production</h4>
677685
<p><strong>Lesson:</strong> <strong>When training stalls, check your gradient flow before changing your architecture. Activation functions, initialization, and normalization are the first levers to pull.</strong></p>
678686
</div>
679687

680-
<div class="callout callout-note">
688+
<div class="callout note">
681689
<div class="callout-title">Note</div>
682690
<p>In a real network with millions of parameters, this same process happens simultaneously for every weight. Modern frameworks like PyTorch compute all these gradients automatically using a technique called <strong>automatic differentiation</strong>, which builds a computational graph during the forward pass and traverses it in reverse during the backward pass.</p>
683691
</div>
@@ -729,7 +737,7 @@ <h3>3.3 Weight Initialization</h3>
729737
<tr><td><strong>Kaiming/He</strong></td><td>ReLU and variants</td><td>Scales weights by &radic;(2/n<sub>in</sub>), accounting for ReLU zeroing out half the values.</td></tr>
730738
</table>
731739

732-
<div class="callout callout-insight">
740+
<div class="callout key-insight">
733741
<div class="callout-title">Key Insight</div>
734742
<p>Batch normalization, dropout, and proper weight initialization are not optional extras. They are <em>essential infrastructure</em> for training deep networks reliably. Skipping any one of them often leads to unstable training, poor generalization, or both. Modern architectures like Transformers replace BatchNorm with LayerNorm, but the principle of normalizing intermediate representations remains universal.</p>
735743
</div>
@@ -794,7 +802,7 @@ <h2>4. Convolutional Neural Networks (CNNs) Overview</h2>
794802

795803
<p>After several convolutional and pooling layers, the output is flattened and passed through fully connected layers for the final prediction.</p>
796804

797-
<div class="callout callout-note">
805+
<div class="callout note">
798806
<div class="callout-title">Note</div>
799807
<p>While this book focuses on language models and conversational AI (which primarily use Transformers), understanding CNNs remains valuable. Many multimodal AI systems combine vision encoders (CNNs or Vision Transformers) with language models. You will encounter this pattern when studying vision-language models in later modules.</p>
800808
</div>
@@ -897,7 +905,7 @@ <h3>5.3 Gradient Clipping</h3>
897905
Early stopping at epoch 18
898906
Best validation loss: 1.0492</div>
899907

900-
<div class="callout callout-insight">
908+
<div class="callout key-insight">
901909
<div class="callout-title">Key Insight: The Three Safety Nets</div>
902910
<p>Think of these three techniques as complementary safety nets. <strong>Gradient clipping</strong> prevents catastrophic updates on any single step. <strong>Learning rate scheduling</strong> ensures the optimization trajectory stays smooth over the full training run. <strong>Early stopping</strong> catches overfitting at the macro level by watching validation performance. Together, they make deep learning training far more reliable.</p>
903911
</div>

0 commit comments

Comments
 (0)