Skip to content

Commit 114eda2

Browse files
jeremymanningclaude
andcommitted
Fix reward tampering date (2025→2024) and update Stochastic Parrots link to canonical ACM URL
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 5ecb20f commit 114eda2

3 files changed

Lines changed: 6 additions & 6 deletions

File tree

slides/week9/lecture26.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -741,7 +741,7 @@ <h1 id="alignment-faking">Alignment faking</h1>
741741
</foreignObject></svg><svg data-marpit-svg="" viewBox="0 0 1280 720"><foreignObject width="1280" height="720"><section id="5" data-class="scale-80" data-theme="cdl-theme" lang="C" class="scale-80" style="--class:scale-80;--theme:cdl-theme;" data-transition-back="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}" data-transition="{&quot;name&quot;:&quot;fade&quot;,&quot;duration&quot;:&quot;0.25s&quot;,&quot;builtinFallback&quot;:true}">
742742
<h1 id="reward-tampering">Reward tampering</h1>
743743
<div class="definition-box" data-title="What happened">
744-
<p><a href="https://www.anthropic.com/research/reward-tampering">Anthropic's reward tampering research</a> (<a href="https://www.anthropic.com/research/reward-tampering">Denison et al., 2025</a>) revealed that training models to be sycophantic (agreeable) produced unexpected <strong>emergent dangerous behaviors</strong>:</p>
744+
<p><a href="https://www.anthropic.com/research/reward-tampering">Anthropic's reward tampering research</a> (<a href="https://www.anthropic.com/research/reward-tampering">Denison et al., 2024</a>) revealed that training models to be sycophantic (agreeable) produced unexpected <strong>emergent dangerous behaviors</strong>:</p>
745745
</div>
746746
<div class="warning-box" data-title="What emerged without explicit training">
747747
<p>Models trained to agree with users spontaneously learned to:</p>
@@ -1110,10 +1110,10 @@ <h1 id="the-question-that-matters">The question that matters</h1>
11101110
<h1 id="further-reading">Further reading</h1>
11111111
<div class="note-box" data-title="Further reading">
11121112
<p><a href="https://arxiv.org/abs/2412.14093"><strong>Greenblatt et al. (2024, <em>arXiv</em>)</strong></a> &quot;Alignment Faking in Large Language Models&quot; — Models that strategically pretend to be aligned.</p>
1113-
<p><a href="https://www.anthropic.com/research/reward-tampering"><strong>Anthropic (2025)</strong></a> &quot;Reward Tampering&quot; — Emergent deceptive behaviors from sycophancy training.</p>
1113+
<p><a href="https://www.anthropic.com/research/reward-tampering"><strong>Anthropic (2024)</strong></a> &quot;Reward Tampering&quot; — Emergent deceptive behaviors from sycophancy training.</p>
11141114
<p><a href="https://transformer-circuits.pub/2025/attribution-graphs/methods.html"><strong>Anthropic (2025)</strong></a> &quot;Circuit Tracing&quot; — Seeing inside the black box with attribution graphs.</p>
11151115
<p><a href="https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026"><strong>International AI Safety Report (2026)</strong></a> — Global scientific consensus on AI safety.</p>
1116-
<p><a href="https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf"><strong>Bender et al. (2021, <em>FAccT</em>)</strong></a> &quot;On the Dangers of Stochastic Parrots&quot; — Environmental and social costs of large language models.</p>
1116+
<p><a href="https://dl.acm.org/doi/10.1145/3442188.3445922"><strong>Bender et al. (2021, <em>FAccT</em>)</strong></a> &quot;On the Dangers of Stochastic Parrots&quot; — Environmental and social costs of large language models.</p>
11171117
<p><a href="https://hbr.org/2026/01/companies-are-laying-off-workers-because-of-ais-potential-not-its-performance"><strong>Harvard Business Review (2026)</strong></a> &quot;Companies Are Laying Off Workers Because of AI's Potential, Not Its Performance.&quot;</p>
11181118
</div>
11191119
</section>

slides/week9/lecture26.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ The model *reasoned* about its own training process and chose a deceptive strate
7070

7171
<div class="definition-box" data-title="What happened">
7272

73-
[Anthropic's reward tampering research](https://www.anthropic.com/research/reward-tampering) ([Denison et al., 2025](https://www.anthropic.com/research/reward-tampering)) revealed that training models to be sycophantic (agreeable) produced unexpected **emergent dangerous behaviors**:
73+
[Anthropic's reward tampering research](https://www.anthropic.com/research/reward-tampering) ([Denison et al., 2024](https://www.anthropic.com/research/reward-tampering)) revealed that training models to be sycophantic (agreeable) produced unexpected **emergent dangerous behaviors**:
7474

7575
</div>
7676

@@ -415,13 +415,13 @@ You are graduating into a world where:
415415

416416
[**Greenblatt et al. (2024, *arXiv*)**](https://arxiv.org/abs/2412.14093) "Alignment Faking in Large Language Models" — Models that strategically pretend to be aligned.
417417

418-
[**Anthropic (2025)**](https://www.anthropic.com/research/reward-tampering) "Reward Tampering" — Emergent deceptive behaviors from sycophancy training.
418+
[**Anthropic (2024)**](https://www.anthropic.com/research/reward-tampering) "Reward Tampering" — Emergent deceptive behaviors from sycophancy training.
419419

420420
[**Anthropic (2025)**](https://transformer-circuits.pub/2025/attribution-graphs/methods.html) "Circuit Tracing" — Seeing inside the black box with attribution graphs.
421421

422422
[**International AI Safety Report (2026)**](https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026) — Global scientific consensus on AI safety.
423423

424-
[**Bender et al. (2021, *FAccT*)**](https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf) "On the Dangers of Stochastic Parrots" — Environmental and social costs of large language models.
424+
[**Bender et al. (2021, *FAccT*)**](https://dl.acm.org/doi/10.1145/3442188.3445922) "On the Dangers of Stochastic Parrots" — Environmental and social costs of large language models.
425425

426426
[**Harvard Business Review (2026)**](https://hbr.org/2026/01/companies-are-laying-off-workers-because-of-ais-potential-not-its-performance) "Companies Are Laying Off Workers Because of AI's Potential, Not Its Performance."
427427

slides/week9/lecture26.pdf

-10 Bytes
Binary file not shown.

0 commit comments

Comments
 (0)