Skip to content

Commit 8fc7bde

Browse files
jeremymanningclaude
andcommitted
Split slides 16-17 into 4 slides: scaling wall, continual learning, local revolution, multimodal
- New continual learning slide: static models, catastrophic forgetting (Shi et al., 2025), Nested Learning paradigm (Behrouz et al., NeurIPS 2025), Hope architecture - New multimodal slide: Gemini 2.5 Pro (6hr video), Apollo (CVPR 2025), Tarsier2, open challenges (Zhou et al., CVPR 2025) - Local revolution slide now standalone with just the edge computing table Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 071da4c commit 8fc7bde

3 files changed

Lines changed: 105 additions & 22 deletions

File tree

slides/week9/lecture26.html

Lines changed: 57 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1055,20 +1055,34 @@ <h1 id="whats-next-for-llms-the-scaling-wall">What's next for LLMs? The scaling
10551055
<div class="warning-box" data-title="Three converging limits">
10561056
<ol>
10571057
<li><strong>Running out of data</strong> — Most high-quality human text has already been used. Training on AI-generated data causes <a href="https://www.nature.com/articles/s41586-024-07566-y"><strong>model collapse</strong></a>: tails of the distribution vanish irreversibly (<a href="https://www.nature.com/articles/s41586-024-07566-y">Shumailov et al., 2024</a>)</li>
1058-
<li><strong>Energy consumption</strong> — Data centers consumed ~415 TWh in 2024 (~1.5% of global electricity); projected to reach <strong>945 TWh by 2030</strong> (~3%) (<a href="https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai">IEA, 2025</a>)</li>
1058+
<li><strong>Energy consumption</strong> — Data centers consumed ~415 TWh in 2024 (~1.5% of global electricity); projected to reach <strong>945 TWh by 2030</strong> (~3%; <a href="https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai">IEA, 2025</a>)</li>
10591059
<li><strong>Diminishing returns</strong> — Each order-of-magnitude increase in compute yields smaller performance gains. The era of &quot;just make it bigger&quot; may be ending</li>
10601060
</ol>
10611061
</div>
10621062
<div class="tip-box" data-title="The response: make models smaller and smarter">
10631063
<ul>
1064-
<li><a href="https://arxiv.org/abs/2106.09685"><strong>LoRA</strong></a> (<a href="https://arxiv.org/abs/2106.09685">Hu et al., 2021</a>): Fine-tune with 10,000× fewer parameters by injecting low-rank adapters</li>
1064+
<li><strong>LoRA</strong> (<a href="https://arxiv.org/abs/2106.09685">Hu et al., 2021</a>): Fine-tune with 10,000× fewer parameters by injecting low-rank adapters</li>
10651065
<li><strong>Mixture of Experts</strong> (MoE): Activate only a fraction of parameters per token (see <a href="../week9/lecture25.html">Lecture 25</a>)</li>
10661066
<li><strong>Distillation</strong>: Train small models to mimic large ones (see <a href="../week7/lecture19.html">Lecture 19</a>)</li>
1067-
<li><a href="https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/"><strong>Nested Learning</strong></a> (<a href="https://arxiv.org/abs/2512.24695">Google, NeurIPS 2025</a>): Models that learn continuously without catastrophic forgetting</li>
10681067
</ul>
10691068
</div>
10701069
</section>
1071-
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="17" data-class="scale-75" data-theme="cdl-theme" lang="en-US" class="scale-75" style="--class:scale-75;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
1070+
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="17" data-class="scale-80" data-theme="cdl-theme" lang="en-US" class="scale-80" style="--class:scale-80;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
1071+
<h1 id="whats-next-for-llms-continual-learning">What's next for LLMs? Continual learning</h1>
1072+
<div class="definition-box" data-title="The problem: static models in a changing world">
1073+
<p>Transformer-based LLMs are <strong>static</strong> after training completes. They cannot learn from new experiences, update their knowledge, or adapt to new domains without retraining — a process that costs millions of dollars and takes weeks or months. When fine-tuned on new data, they suffer <strong>catastrophic forgetting</strong>: previously learned capabilities degrade or vanish entirely (<a href="https://arxiv.org/abs/2404.16789">Shi et al., 2025</a>).</p>
1074+
</div>
1075+
<div class="note-box" data-title="Nested Learning: a new paradigm">
1076+
<p><a href="https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/"><strong>Nested Learning</strong></a> (<a href="https://arxiv.org/abs/2512.24695">Behrouz et al., 2025, <em>NeurIPS</em></a>) reimagines model architecture as a set of <strong>nested optimization problems</strong>, each with its own &quot;context flow.&quot; Key insights:</p>
1077+
<ul>
1078+
<li>Gradient-based optimizers (Adam, SGD) are actually <strong>associative memory modules</strong> that compress gradient information</li>
1079+
<li>A <strong>self-modifying learning module</strong> learns its own update algorithm</li>
1080+
<li>A <strong>continuum memory system</strong> generalizes traditional long/short-term memory</li>
1081+
</ul>
1082+
<p>Their proof-of-concept architecture (<strong>Hope</strong>) outperforms standard transformers on language modeling while supporting continual learning without catastrophic forgetting.</p>
1083+
</div>
1084+
</section>
1085+
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="18" data-class="scale-75" data-theme="cdl-theme" lang="en-US" class="scale-75" style="--class:scale-75;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
10721086
<h1 id="whats-next-for-llms-the-local-revolution">What's next for LLMs? The local revolution</h1>
10731087
<div class="note-box" data-title="The shift from cloud to edge">
10741088
<table>
@@ -1098,16 +1112,45 @@ <h1 id="whats-next-for-llms-the-local-revolution">What's next for LLMs? The loca
10981112
</tbody>
10991113
</table>
11001114
</div>
1101-
<div class="important-box" data-title="Multimodal models: beyond text">
1102-
<p>The frontier is rapidly moving beyond language. Models now process <strong>text + images + audio + video</strong> natively:</p>
1103-
<ul>
1104-
<li><a href="https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/">Gemini 2.5 Pro</a>: 2M-token context can ingest 2 hours of video</li>
1105-
<li>Open models like <a href="https://arxiv.org/abs/2501.01909">Tarsier2</a> outperform GPT-4o on video understanding benchmarks</li>
1106-
<li>Real-time multimodal interaction (voice, vision, screen sharing) is becoming standard</li>
1107-
</ul>
1115+
</section>
1116+
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="19" data-class="scale-70" data-theme="cdl-theme" lang="en-US" class="scale-70" style="--class:scale-70;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
1117+
<h1 id="whats-next-for-llms-multimodal-models">What's next for LLMs? Multimodal models</h1>
1118+
<div class="definition-box" data-title="Beyond text: the multimodal frontier">
1119+
<p>Users increasingly expect models to handle <strong>text + images + audio + video</strong> natively. Multimodal is quickly becoming the default, not the exception.</p>
1120+
</div>
1121+
<div class="note-box" data-title="Current capabilities">
1122+
<table>
1123+
<thead>
1124+
<tr>
1125+
<th>Model</th>
1126+
<th>Capability</th>
1127+
<th>Limitation</th>
1128+
</tr>
1129+
</thead>
1130+
<tbody>
1131+
<tr>
1132+
<td><a href="https://developers.googleblog.com/en/gemini-2-5-video-understanding/"><strong>Gemini 2.5 Pro</strong></a></td>
1133+
<td>2M-token context; up to <strong>6 hours</strong> of video; 84.8% on VideoMME (<a href="https://developers.googleblog.com/en/gemini-2-5-video-understanding/">Google, 2025</a>)</td>
1134+
<td>Closed-source; API-only</td>
1135+
</tr>
1136+
<tr>
1137+
<td><a href="https://arxiv.org/abs/2412.10360"><strong>Apollo</strong></a></td>
1138+
<td>Open family (3B–7B); Apollo-3B outperforms most 7B models on video tasks (<a href="https://arxiv.org/abs/2412.10360">Zohar et al., CVPR 2025</a>)</td>
1139+
<td>Limited to shorter clips</td>
1140+
</tr>
1141+
<tr>
1142+
<td><a href="https://arxiv.org/abs/2501.01909"><strong>Tarsier2</strong></a></td>
1143+
<td>Outperforms GPT-4o on video description benchmarks</td>
1144+
<td>Narrow task focus</td>
1145+
</tr>
1146+
</tbody>
1147+
</table>
1148+
</div>
1149+
<div class="important-box" data-title="Open challenges">
1150+
<p>Long-form video remains hard: <strong>token redundancy</strong> inflates compute, <strong>context windows</strong> fragment temporal coherence, and <strong>cross-modal reasoning</strong> across hours of content is still unreliable (<a href="https://openaccess.thecvf.com/content/CVPR2025/html/Zhou_MLVU_Benchmarking_Multi-task_Long_Video_Understanding_CVPR_2025_paper.html">Zhou et al., 2025, <em>CVPR</em></a>). Expect rapid progress over the next 1–2 years as architectures mature.</p>
11081151
</div>
11091152
</section>
1110-
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="18" data-theme="cdl-theme" lang="en-US" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
1153+
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="20" data-theme="cdl-theme" lang="en-US" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
11111154
<h1 id="the-question-that-matters">The question that matters</h1>
11121155
<div class="important-box" data-title="Technology is not neutral">
11131156
<p>Every design decision embeds values. Every system reflects choices. Every deployment affects real people.</p>
@@ -1122,7 +1165,7 @@ <h1 id="the-question-that-matters">The question that matters</h1>
11221165
<p><strong>You</strong> will be the generation that shapes how this technology is used. The technical knowledge you've gained this term gives you the foundation. The ethical questions don't have answer keys.</p>
11231166
</div>
11241167
</section>
1125-
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="19" data-class="scale-85" data-theme="cdl-theme" lang="en-US" class="scale-85" style="--class:scale-85;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
1168+
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="21" data-class="scale-85" data-theme="cdl-theme" lang="en-US" class="scale-85" style="--class:scale-85;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
11261169
<h1 id="further-reading">Further reading</h1>
11271170
<div class="note-box" data-title="Further reading">
11281171
<p><a href="https://arxiv.org/abs/2412.14093"><strong>Greenblatt et al. (2024, <em>arXiv</em>)</strong></a> &quot;Alignment Faking in Large Language Models&quot; — Models that strategically pretend to be aligned.</p>
@@ -1133,7 +1176,7 @@ <h1 id="further-reading">Further reading</h1>
11331176
<p><a href="https://hbr.org/2026/01/companies-are-laying-off-workers-because-of-ais-potential-not-its-performance"><strong>Harvard Business Review (2026)</strong></a> &quot;Companies Are Laying Off Workers Because of AI's Potential, Not Its Performance.&quot;</p>
11341177
</div>
11351178
</section>
1136-
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="20" data-theme="cdl-theme" lang="en-US" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
1179+
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="22" data-theme="cdl-theme" lang="en-US" style="--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
11371180
<h1 id="questions">Questions?</h1>
11381181
<div class="emoji-figure">
11391182
<div class="emoji-col">

slides/week9/lecture26.md

Lines changed: 48 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -352,17 +352,39 @@ A survey of AI researchers ([Grace et al., 2025](https://arxiv.org/abs/2502.1487
352352
<div class="warning-box" data-title="Three converging limits">
353353

354354
1. **Running out of data** — Most high-quality human text has already been used. Training on AI-generated data causes [**model collapse**](https://www.nature.com/articles/s41586-024-07566-y): tails of the distribution vanish irreversibly ([Shumailov et al., 2024](https://www.nature.com/articles/s41586-024-07566-y))
355-
2. **Energy consumption** — Data centers consumed ~415 TWh in 2024 (~1.5% of global electricity); projected to reach **945 TWh by 2030** (~3%) ([IEA, 2025](https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai))
355+
2. **Energy consumption** — Data centers consumed ~415 TWh in 2024 (~1.5% of global electricity); projected to reach **945 TWh by 2030** (~3%; [IEA, 2025](https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai))
356356
3. **Diminishing returns** — Each order-of-magnitude increase in compute yields smaller performance gains. The era of "just make it bigger" may be ending
357357

358358
</div>
359359

360360
<div class="tip-box" data-title="The response: make models smaller and smarter">
361361

362-
- [**LoRA**](https://arxiv.org/abs/2106.09685) ([Hu et al., 2021](https://arxiv.org/abs/2106.09685)): Fine-tune with 10,000× fewer parameters by injecting low-rank adapters
362+
- **LoRA** ([Hu et al., 2021](https://arxiv.org/abs/2106.09685)): Fine-tune with 10,000× fewer parameters by injecting low-rank adapters
363363
- **Mixture of Experts** (MoE): Activate only a fraction of parameters per token (see [Lecture 25](../week9/lecture25.html))
364364
- **Distillation**: Train small models to mimic large ones (see [Lecture 19](../week7/lecture19.html))
365-
- [**Nested Learning**](https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/) ([Google, NeurIPS 2025](https://arxiv.org/abs/2512.24695)): Models that learn continuously without catastrophic forgetting
365+
366+
</div>
367+
368+
---
369+
<!-- _class: scale-80 -->
370+
371+
# What's next for LLMs? Continual learning
372+
373+
<div class="definition-box" data-title="The problem: static models in a changing world">
374+
375+
Transformer-based LLMs are **static** after training completes. They cannot learn from new experiences, update their knowledge, or adapt to new domains without retraining — a process that costs millions of dollars and takes weeks or months. When fine-tuned on new data, they suffer **catastrophic forgetting**: previously learned capabilities degrade or vanish entirely ([Shi et al., 2025](https://arxiv.org/abs/2404.16789)).
376+
377+
</div>
378+
379+
<div class="note-box" data-title="Nested Learning: a new paradigm">
380+
381+
[**Nested Learning**](https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/) ([Behrouz et al., 2025, *NeurIPS*](https://arxiv.org/abs/2512.24695)) reimagines model architecture as a set of **nested optimization problems**, each with its own "context flow." Key insights:
382+
383+
- Gradient-based optimizers (Adam, SGD) are actually **associative memory modules** that compress gradient information
384+
- A **self-modifying learning module** learns its own update algorithm
385+
- A **continuum memory system** generalizes traditional long/short-term memory
386+
387+
Their proof-of-concept architecture (**Hope**) outperforms standard transformers on language modeling while supporting continual learning without catastrophic forgetting.
366388

367389
</div>
368390

@@ -382,12 +404,30 @@ A survey of AI researchers ([Grace et al., 2025](https://arxiv.org/abs/2502.1487
382404

383405
</div>
384406

385-
<div class="important-box" data-title="Multimodal models: beyond text">
407+
---
408+
<!-- _class: scale-70 -->
409+
410+
# What's next for LLMs? Multimodal models
411+
412+
<div class="definition-box" data-title="Beyond text: the multimodal frontier">
413+
414+
Users increasingly expect models to handle **text + images + audio + video** natively. Multimodal is quickly becoming the default, not the exception.
415+
416+
</div>
417+
418+
<div class="note-box" data-title="Current capabilities">
419+
420+
| Model | Capability | Limitation |
421+
|-|-|-|
422+
| [**Gemini 2.5 Pro**](https://developers.googleblog.com/en/gemini-2-5-video-understanding/) | 2M-token context; up to **6 hours** of video; 84.8% on VideoMME ([Google, 2025](https://developers.googleblog.com/en/gemini-2-5-video-understanding/)) | Closed-source; API-only |
423+
| [**Apollo**](https://arxiv.org/abs/2412.10360) | Open family (3B–7B); Apollo-3B outperforms most 7B models on video tasks ([Zohar et al., CVPR 2025](https://arxiv.org/abs/2412.10360)) | Limited to shorter clips |
424+
| [**Tarsier2**](https://arxiv.org/abs/2501.01909) | Outperforms GPT-4o on video description benchmarks | Narrow task focus |
425+
426+
</div>
427+
428+
<div class="important-box" data-title="Open challenges">
386429

387-
The frontier is rapidly moving beyond language. Models now process **text + images + audio + video** natively:
388-
- [Gemini 2.5 Pro](https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/): 2M-token context can ingest 2 hours of video
389-
- Open models like [Tarsier2](https://arxiv.org/abs/2501.01909) outperform GPT-4o on video understanding benchmarks
390-
- Real-time multimodal interaction (voice, vision, screen sharing) is becoming standard
430+
Long-form video remains hard: **token redundancy** inflates compute, **context windows** fragment temporal coherence, and **cross-modal reasoning** across hours of content is still unreliable ([Zhou et al., 2025, *CVPR*](https://openaccess.thecvf.com/content/CVPR2025/html/Zhou_MLVU_Benchmarking_Multi-task_Long_Video_Understanding_CVPR_2025_paper.html)). Expect rapid progress over the next 1–2 years as architectures mature.
391431

392432
</div>
393433

slides/week9/lecture26.pdf

86.2 KB
Binary file not shown.

0 commit comments

Comments
 (0)