You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Split slides 16-17 into 4 slides: scaling wall, continual learning, local revolution, multimodal
- New continual learning slide: static models, catastrophic forgetting (Shi et al., 2025),
Nested Learning paradigm (Behrouz et al., NeurIPS 2025), Hope architecture
- New multimodal slide: Gemini 2.5 Pro (6hr video), Apollo (CVPR 2025),
Tarsier2, open challenges (Zhou et al., CVPR 2025)
- Local revolution slide now standalone with just the edge computing table
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
<li><strong>Running out of data</strong> — Most high-quality human text has already been used. Training on AI-generated data causes <ahref="https://www.nature.com/articles/s41586-024-07566-y"><strong>model collapse</strong></a>: tails of the distribution vanish irreversibly (<ahref="https://www.nature.com/articles/s41586-024-07566-y">Shumailov et al., 2024</a>)</li>
1058
-
<li><strong>Energy consumption</strong> — Data centers consumed ~415 TWh in 2024 (~1.5% of global electricity); projected to reach <strong>945 TWh by 2030</strong> (~3%) (<ahref="https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai">IEA, 2025</a>)</li>
1058
+
<li><strong>Energy consumption</strong> — Data centers consumed ~415 TWh in 2024 (~1.5% of global electricity); projected to reach <strong>945 TWh by 2030</strong> (~3%; <ahref="https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai">IEA, 2025</a>)</li>
1059
1059
<li><strong>Diminishing returns</strong> — Each order-of-magnitude increase in compute yields smaller performance gains. The era of "just make it bigger" may be ending</li>
1060
1060
</ol>
1061
1061
</div>
1062
1062
<divclass="tip-box" data-title="The response: make models smaller and smarter">
1063
1063
<ul>
1064
-
<li><ahref="https://arxiv.org/abs/2106.09685"><strong>LoRA</strong></a> (<ahref="https://arxiv.org/abs/2106.09685">Hu et al., 2021</a>): Fine-tune with 10,000× fewer parameters by injecting low-rank adapters</li>
1064
+
<li><strong>LoRA</strong> (<ahref="https://arxiv.org/abs/2106.09685">Hu et al., 2021</a>): Fine-tune with 10,000× fewer parameters by injecting low-rank adapters</li>
1065
1065
<li><strong>Mixture of Experts</strong> (MoE): Activate only a fraction of parameters per token (see <ahref="../week9/lecture25.html">Lecture 25</a>)</li>
1066
1066
<li><strong>Distillation</strong>: Train small models to mimic large ones (see <ahref="../week7/lecture19.html">Lecture 19</a>)</li>
1067
-
<li><ahref="https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/"><strong>Nested Learning</strong></a> (<ahref="https://arxiv.org/abs/2512.24695">Google, NeurIPS 2025</a>): Models that learn continuously without catastrophic forgetting</li>
<h1id="whats-next-for-llms-continual-learning">What's next for LLMs? Continual learning</h1>
1072
+
<divclass="definition-box" data-title="The problem: static models in a changing world">
1073
+
<p>Transformer-based LLMs are <strong>static</strong> after training completes. They cannot learn from new experiences, update their knowledge, or adapt to new domains without retraining — a process that costs millions of dollars and takes weeks or months. When fine-tuned on new data, they suffer <strong>catastrophic forgetting</strong>: previously learned capabilities degrade or vanish entirely (<ahref="https://arxiv.org/abs/2404.16789">Shi et al., 2025</a>).</p>
1074
+
</div>
1075
+
<divclass="note-box" data-title="Nested Learning: a new paradigm">
1076
+
<p><ahref="https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/"><strong>Nested Learning</strong></a> (<ahref="https://arxiv.org/abs/2512.24695">Behrouz et al., 2025, <em>NeurIPS</em></a>) reimagines model architecture as a set of <strong>nested optimization problems</strong>, each with its own "context flow." Key insights:</p>
1077
+
<ul>
1078
+
<li>Gradient-based optimizers (Adam, SGD) are actually <strong>associative memory modules</strong> that compress gradient information</li>
1079
+
<li>A <strong>self-modifying learning module</strong> learns its own update algorithm</li>
1080
+
<li>A <strong>continuum memory system</strong> generalizes traditional long/short-term memory</li>
1081
+
</ul>
1082
+
<p>Their proof-of-concept architecture (<strong>Hope</strong>) outperforms standard transformers on language modeling while supporting continual learning without catastrophic forgetting.</p>
<h1id="whats-next-for-llms-multimodal-models">What's next for LLMs? Multimodal models</h1>
1118
+
<divclass="definition-box" data-title="Beyond text: the multimodal frontier">
1119
+
<p>Users increasingly expect models to handle <strong>text + images + audio + video</strong> natively. Multimodal is quickly becoming the default, not the exception.</p>
<td>2M-token context; up to <strong>6 hours</strong> of video; 84.8% on VideoMME (<ahref="https://developers.googleblog.com/en/gemini-2-5-video-understanding/">Google, 2025</a>)</td>
<td>Open family (3B–7B); Apollo-3B outperforms most 7B models on video tasks (<ahref="https://arxiv.org/abs/2412.10360">Zohar et al., CVPR 2025</a>)</td>
<p>Long-form video remains hard: <strong>token redundancy</strong> inflates compute, <strong>context windows</strong> fragment temporal coherence, and <strong>cross-modal reasoning</strong> across hours of content is still unreliable (<ahref="https://openaccess.thecvf.com/content/CVPR2025/html/Zhou_MLVU_Benchmarking_Multi-task_Long_Video_Understanding_CVPR_2025_paper.html">Zhou et al., 2025, <em>CVPR</em></a>). Expect rapid progress over the next 1–2 years as architectures mature.</p>
<h1id="the-question-that-matters">The question that matters</h1>
1112
1155
<divclass="important-box" data-title="Technology is not neutral">
1113
1156
<p>Every design decision embeds values. Every system reflects choices. Every deployment affects real people.</p>
@@ -1122,7 +1165,7 @@ <h1 id="the-question-that-matters">The question that matters</h1>
1122
1165
<p><strong>You</strong> will be the generation that shapes how this technology is used. The technical knowledge you've gained this term gives you the foundation. The ethical questions don't have answer keys.</p>
<p><ahref="https://arxiv.org/abs/2412.14093"><strong>Greenblatt et al. (2024, <em>arXiv</em>)</strong></a> "Alignment Faking in Large Language Models" — Models that strategically pretend to be aligned.</p>
<p><ahref="https://hbr.org/2026/01/companies-are-laying-off-workers-because-of-ais-potential-not-its-performance"><strong>Harvard Business Review (2026)</strong></a> "Companies Are Laying Off Workers Because of AI's Potential, Not Its Performance."</p>
1.**Running out of data** — Most high-quality human text has already been used. Training on AI-generated data causes [**model collapse**](https://www.nature.com/articles/s41586-024-07566-y): tails of the distribution vanish irreversibly ([Shumailov et al., 2024](https://www.nature.com/articles/s41586-024-07566-y))
355
-
2.**Energy consumption** — Data centers consumed ~415 TWh in 2024 (~1.5% of global electricity); projected to reach **945 TWh by 2030** (~3%) ([IEA, 2025](https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai))
355
+
2.**Energy consumption** — Data centers consumed ~415 TWh in 2024 (~1.5% of global electricity); projected to reach **945 TWh by 2030** (~3%; [IEA, 2025](https://www.iea.org/reports/energy-and-ai/energy-demand-from-ai))
356
356
3.**Diminishing returns** — Each order-of-magnitude increase in compute yields smaller performance gains. The era of "just make it bigger" may be ending
357
357
358
358
</div>
359
359
360
360
<divclass="tip-box"data-title="The response: make models smaller and smarter">
361
361
362
-
-[**LoRA**](https://arxiv.org/abs/2106.09685) ([Hu et al., 2021](https://arxiv.org/abs/2106.09685)): Fine-tune with 10,000× fewer parameters by injecting low-rank adapters
362
+
-**LoRA** ([Hu et al., 2021](https://arxiv.org/abs/2106.09685)): Fine-tune with 10,000× fewer parameters by injecting low-rank adapters
363
363
-**Mixture of Experts** (MoE): Activate only a fraction of parameters per token (see [Lecture 25](../week9/lecture25.html))
364
364
-**Distillation**: Train small models to mimic large ones (see [Lecture 19](../week7/lecture19.html))
365
-
-[**Nested Learning**](https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/) ([Google, NeurIPS 2025](https://arxiv.org/abs/2512.24695)): Models that learn continuously without catastrophic forgetting
365
+
366
+
</div>
367
+
368
+
---
369
+
<!-- _class: scale-80 -->
370
+
371
+
# What's next for LLMs? Continual learning
372
+
373
+
<divclass="definition-box"data-title="The problem: static models in a changing world">
374
+
375
+
Transformer-based LLMs are **static** after training completes. They cannot learn from new experiences, update their knowledge, or adapt to new domains without retraining — a process that costs millions of dollars and takes weeks or months. When fine-tuned on new data, they suffer **catastrophic forgetting**: previously learned capabilities degrade or vanish entirely ([Shi et al., 2025](https://arxiv.org/abs/2404.16789)).
376
+
377
+
</div>
378
+
379
+
<divclass="note-box"data-title="Nested Learning: a new paradigm">
380
+
381
+
[**Nested Learning**](https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/) ([Behrouz et al., 2025, *NeurIPS*](https://arxiv.org/abs/2512.24695)) reimagines model architecture as a set of **nested optimization problems**, each with its own "context flow." Key insights:
382
+
383
+
- Gradient-based optimizers (Adam, SGD) are actually **associative memory modules** that compress gradient information
384
+
- A **self-modifying learning module** learns its own update algorithm
385
+
- A **continuum memory system** generalizes traditional long/short-term memory
386
+
387
+
Their proof-of-concept architecture (**Hope**) outperforms standard transformers on language modeling while supporting continual learning without catastrophic forgetting.
366
388
367
389
</div>
368
390
@@ -382,12 +404,30 @@ A survey of AI researchers ([Grace et al., 2025](https://arxiv.org/abs/2502.1487
|[**Gemini 2.5 Pro**](https://developers.googleblog.com/en/gemini-2-5-video-understanding/)| 2M-token context; up to **6 hours** of video; 84.8% on VideoMME ([Google, 2025](https://developers.googleblog.com/en/gemini-2-5-video-understanding/)) | Closed-source; API-only |
423
+
|[**Apollo**](https://arxiv.org/abs/2412.10360)| Open family (3B–7B); Apollo-3B outperforms most 7B models on video tasks ([Zohar et al., CVPR 2025](https://arxiv.org/abs/2412.10360)) | Limited to shorter clips |
424
+
|[**Tarsier2**](https://arxiv.org/abs/2501.01909)| Outperforms GPT-4o on video description benchmarks | Narrow task focus |
The frontier is rapidly moving beyond language. Models now process **text + images + audio + video** natively:
388
-
-[Gemini 2.5 Pro](https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/): 2M-token context can ingest 2 hours of video
389
-
- Open models like [Tarsier2](https://arxiv.org/abs/2501.01909) outperform GPT-4o on video understanding benchmarks
390
-
- Real-time multimodal interaction (voice, vision, screen sharing) is becoming standard
430
+
Long-form video remains hard: **token redundancy** inflates compute, **context windows** fragment temporal coherence, and **cross-modal reasoning** across hours of content is still unreliable ([Zhou et al., 2025, *CVPR*](https://openaccess.thecvf.com/content/CVPR2025/html/Zhou_MLVU_Benchmarking_Multi-task_Long_Video_Understanding_CVPR_2025_paper.html)). Expect rapid progress over the next 1–2 years as architectures mature.
0 commit comments