|
342 | 342 | font-weight: 600; |
343 | 343 | } |
344 | 344 | .epigraph cite::before { content: "\2014\00a0"; } |
| 345 | + |
| 346 | + /* Practical Example boxes */ |
| 347 | + .callout.practical-example { |
| 348 | + background: linear-gradient(135deg, #fff8e1, #fff3e0); |
| 349 | + border: 1px solid #ffe0b2; |
| 350 | + border-left: 5px solid #ff9800; |
| 351 | + padding: 1.2rem 1.5rem; |
| 352 | + margin: 1.5rem 0; |
| 353 | + border-radius: 0 8px 8px 0; |
| 354 | + } |
| 355 | + .callout.practical-example h4 { color: #e65100; margin-top: 0; } |
| 356 | + .callout.practical-example p { margin-bottom: 0.4rem; font-size: 0.95rem; } |
| 357 | + |
| 358 | + /* Bibliography */ |
| 359 | + .bibliography { margin-top: 3rem; padding-top: 2rem; border-top: 2px solid #e0e0e0; } |
| 360 | + .bibliography h2 { font-size: 1.8rem; margin-bottom: 1.5rem; color: #1a1a2e; } |
| 361 | + .bibliography h3 { font-size: 1.2rem; margin: 1.5rem 0 0.8rem; color: #2c3e50; font-weight: 600; } |
| 362 | + .bib-list { padding-left: 1.5rem; margin: 0; } |
| 363 | + .bib-list li { margin-bottom: 1rem; line-height: 1.5; } |
| 364 | + .bib-entry { margin: 0; font-size: 0.95rem; } |
| 365 | + .bib-entry a { color: #2980b9; text-decoration: none; border-bottom: 1px dotted #2980b9; } |
| 366 | + .bib-entry a:hover { color: #e94560; border-bottom-color: #e94560; } |
| 367 | + .bib-annotation { margin: 0.2rem 0 0 0; font-size: 0.88rem; color: #666; font-style: italic; } |
345 | 368 | </style> |
346 | 369 | </head> |
347 | 370 | <body> |
@@ -496,6 +519,11 @@ <h3>Why Gradient Descent Works</h3> |
496 | 519 |
|
497 | 520 | <p>We could try random guessing, but the space is impossibly large. Instead, we use a beautiful insight from calculus: <strong>the gradient tells us which direction is uphill</strong>. If we walk in the opposite direction, we go downhill, reducing the loss.</p> |
498 | 521 |
|
| 522 | + <figure class="illustration"> |
| 523 | + <img src="images/gradient-descent-landscape.png" alt="A hilly landscape illustrating gradient descent, showing a path from a high point down to the lowest valley" style="max-width: 100%; border-radius: 8px;"> |
| 524 | + <figcaption>Gradient descent navigates a loss landscape by following the steepest downhill direction at each step, seeking the lowest valley (minimum loss).</figcaption> |
| 525 | + </figure> |
| 526 | + |
499 | 527 | <p>Imagine you are blindfolded on a hilly landscape, and your goal is to find the lowest valley. You cannot see, but you can feel the slope under your feet. At each step, you feel which direction slopes downward most steeply and take a step that way. This is gradient descent.</p> |
500 | 528 |
|
501 | 529 | <div class="math-block"> |
@@ -597,6 +625,11 @@ <h3>Variants of Gradient Descent</h3> |
597 | 625 |
|
598 | 626 | <h2>4. Overfitting, Underfitting, and Regularization</h2> |
599 | 627 |
|
| 628 | + <figure class="illustration"> |
| 629 | + <img src="images/overfitting-vs-generalization.png" alt="Side-by-side comparison of an overfitting model that memorizes noise versus a well-generalized model that captures the true pattern" style="max-width: 100%; border-radius: 8px;"> |
| 630 | + <figcaption>Overfitting versus generalization: the model on the left has memorized every training point (including noise), while the model on the right captures the underlying pattern.</figcaption> |
| 631 | + </figure> |
| 632 | + |
600 | 633 | <p>Here is a scenario every ML practitioner encounters: your model achieves 99% accuracy on the training data, then you test it on new data and it drops to 60%. What happened? The model did not learn the underlying pattern; it <strong>memorized</strong> the training examples. This is called <strong>overfitting</strong>.</p> |
601 | 634 |
|
602 | 635 | <h3>The Two Failure Modes</h3> |
@@ -671,6 +704,17 @@ <h4>Dropout</h4> |
671 | 704 |
|
672 | 705 | <p>The degree-2 polynomial generalizes well because its complexity matches the true underlying pattern. The degree-9 polynomial memorized the training data (including its noise) and produces an absurd prediction for an unseen input. This is overfitting in its purest form.</p> |
673 | 706 |
|
| 707 | +<div class="callout practical-example"> |
| 708 | + <h4>Practical Example: Regularization Saves a Recommendation Engine</h4> |
| 709 | + <p><strong>Who:</strong> ML engineer at a mid-size e-commerce company (12M monthly active users)</p> |
| 710 | + <p><strong>Situation:</strong> Building a product recommendation model to predict click-through rates from user browsing history features.</p> |
| 711 | + <p><strong>Problem:</strong> The model achieved 0.92 AUC on training data but only 0.61 AUC on the production holdout set, a textbook overfitting gap of 0.31.</p> |
| 712 | + <p><strong>Dilemma:</strong> The team debated whether to collect more data (expensive, 3 months of logging), simplify the model (risk losing signal), or add regularization (quick but might hurt training performance).</p> |
| 713 | + <p><strong>Decision:</strong> They applied L2 regularization (weight decay of 0.01) and dropout (rate 0.3) to their neural network, keeping the same architecture.</p> |
| 714 | + <p><strong>How:</strong> Added <code>weight_decay=0.01</code> to the Adam optimizer and inserted <code>nn.Dropout(0.3)</code> after each hidden layer. Training took the same wall-clock time.</p> |
| 715 | + <p><strong>Result:</strong> Training AUC dropped to 0.84, but holdout AUC jumped to 0.79. The overfitting gap shrank from 0.31 to 0.05, and click-through rate in production A/B testing increased by 14%.</p> |
| 716 | + <p><strong>Lesson:</strong> <strong>Regularization is often the fastest and cheapest way to close a train/test performance gap. Try it before collecting more data or redesigning the architecture.</strong></p> |
| 717 | +</div> |
674 | 718 |
|
675 | 719 | <h2>5. Bias-Variance Tradeoff and Generalization Theory</h2> |
676 | 720 |
|
@@ -905,6 +949,28 @@ <h2>Key Takeaways</h2> |
905 | 949 | </ol> |
906 | 950 | </div> |
907 | 951 |
|
| 952 | +<section class="bibliography"> |
| 953 | + <h2>Bibliography</h2> |
| 954 | + <h3>Foundational Textbooks</h3> |
| 955 | + <ol class="bib-list"> |
| 956 | + <li><p class="bib-entry">Bishop, C. M. (2006). <em>Pattern Recognition and Machine Learning</em>. Springer. <a href="https://www.microsoft.com/en-us/research/publication/pattern-recognition-machine-learning/">Microsoft Research</a></p> |
| 957 | + <p class="bib-annotation">The classic ML textbook covering supervised learning, generalization, and regularization with mathematical rigor.</p></li> |
| 958 | + <li><p class="bib-entry">Hastie, T., Tibshirani, R., & Friedman, J. (2009). <em>The Elements of Statistical Learning</em> (2nd ed.). Springer. <a href="https://hastie.su.domains/ElemStatLearn/">Free PDF</a></p> |
| 959 | + <p class="bib-annotation">Comprehensive treatment of the bias-variance tradeoff, cross-validation, and regularization methods.</p></li> |
| 960 | + </ol> |
| 961 | + <h3>Key Papers and Resources</h3> |
| 962 | + <ol class="bib-list"> |
| 963 | + <li><p class="bib-entry">Robbins, H. & Monro, S. (1951). "A Stochastic Approximation Method." <em>Annals of Mathematical Statistics</em>, 22(3), 400-407. <a href="https://doi.org/10.1214/aoms/1177729586">doi:10.1214/aoms</a></p> |
| 964 | + <p class="bib-annotation">The foundational paper on stochastic gradient descent, still the basis of all modern neural network training.</p></li> |
| 965 | + <li><p class="bib-entry">Kingma, D. P. & Ba, J. (2015). "Adam: A Method for Stochastic Optimization." <a href="https://arxiv.org/abs/1412.6980">arXiv:1412.6980</a></p> |
| 966 | + <p class="bib-annotation">Introduced the Adam optimizer, the default choice for most deep learning applications.</p></li> |
| 967 | + <li><p class="bib-entry">Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." <em>JRSS Series B</em>, 58(1), 267-288. <a href="https://doi.org/10.1111/j.2517-6161.1996.tb02080.x">doi:10.1111/j.2517-6161</a></p> |
| 968 | + <p class="bib-annotation">The original L1 regularization (Lasso) paper, foundational for understanding sparse feature selection.</p></li> |
| 969 | + <li><p class="bib-entry">Ng, A. (2018). "Machine Learning Yearning." <a href="https://github.com/ajaymache/machine-learning-yearning">GitHub</a></p> |
| 970 | + <p class="bib-annotation">Practical advice on structuring ML projects, debugging models, and understanding train/test splits.</p></li> |
| 971 | + </ol> |
| 972 | +</section> |
| 973 | + |
908 | 974 | </main> |
909 | 975 |
|
910 | 976 | <nav class="chapter-nav chapter-nav-bottom"> |
|
0 commit comments