You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p><ahref="https://www.anthropic.com/research/reward-tampering">Anthropic's reward tampering research</a> (<ahref="https://www.anthropic.com/research/reward-tampering">Denison et al., 2025</a>) revealed that training models to be sycophantic (agreeable) produced unexpected <strong>emergent dangerous behaviors</strong>:</p>
744
+
<p><ahref="https://www.anthropic.com/research/reward-tampering">Anthropic's reward tampering research</a> (<ahref="https://www.anthropic.com/research/reward-tampering">Denison et al., 2024</a>) revealed that training models to be sycophantic (agreeable) produced unexpected <strong>emergent dangerous behaviors</strong>:</p>
745
745
</div>
746
746
<divclass="warning-box" data-title="What emerged without explicit training">
747
747
<p>Models trained to agree with users spontaneously learned to:</p>
@@ -1110,10 +1110,10 @@ <h1 id="the-question-that-matters">The question that matters</h1>
<p><ahref="https://arxiv.org/abs/2412.14093"><strong>Greenblatt et al. (2024, <em>arXiv</em>)</strong></a> "Alignment Faking in Large Language Models" — Models that strategically pretend to be aligned.</p>
<p><ahref="https://transformer-circuits.pub/2025/attribution-graphs/methods.html"><strong>Anthropic (2025)</strong></a> "Circuit Tracing" — Seeing inside the black box with attribution graphs.</p>
1115
1115
<p><ahref="https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026"><strong>International AI Safety Report (2026)</strong></a> — Global scientific consensus on AI safety.</p>
1116
-
<p><ahref="https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf"><strong>Bender et al. (2021, <em>FAccT</em>)</strong></a> "On the Dangers of Stochastic Parrots" — Environmental and social costs of large language models.</p>
1116
+
<p><ahref="https://dl.acm.org/doi/10.1145/3442188.3445922"><strong>Bender et al. (2021, <em>FAccT</em>)</strong></a> "On the Dangers of Stochastic Parrots" — Environmental and social costs of large language models.</p>
1117
1117
<p><ahref="https://hbr.org/2026/01/companies-are-laying-off-workers-because-of-ais-potential-not-its-performance"><strong>Harvard Business Review (2026)</strong></a> "Companies Are Laying Off Workers Because of AI's Potential, Not Its Performance."</p>
[Anthropic's reward tampering research](https://www.anthropic.com/research/reward-tampering) ([Denison et al., 2025](https://www.anthropic.com/research/reward-tampering)) revealed that training models to be sycophantic (agreeable) produced unexpected **emergent dangerous behaviors**:
73
+
[Anthropic's reward tampering research](https://www.anthropic.com/research/reward-tampering) ([Denison et al., 2024](https://www.anthropic.com/research/reward-tampering)) revealed that training models to be sycophantic (agreeable) produced unexpected **emergent dangerous behaviors**:
74
74
75
75
</div>
76
76
@@ -415,13 +415,13 @@ You are graduating into a world where:
415
415
416
416
[**Greenblatt et al. (2024, *arXiv*)**](https://arxiv.org/abs/2412.14093) "Alignment Faking in Large Language Models" — Models that strategically pretend to be aligned.
[**Anthropic (2025)**](https://transformer-circuits.pub/2025/attribution-graphs/methods.html) "Circuit Tracing" — Seeing inside the black box with attribution graphs.
421
421
422
422
[**International AI Safety Report (2026)**](https://internationalaisafetyreport.org/publication/international-ai-safety-report-2026) — Global scientific consensus on AI safety.
423
423
424
-
[**Bender et al. (2021, *FAccT*)**](https://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf) "On the Dangers of Stochastic Parrots" — Environmental and social costs of large language models.
424
+
[**Bender et al. (2021, *FAccT*)**](https://dl.acm.org/doi/10.1145/3442188.3445922) "On the Dangers of Stochastic Parrots" — Environmental and social costs of large language models.
425
425
426
426
[**Harvard Business Review (2026)**](https://hbr.org/2026/01/companies-are-laying-off-workers-because-of-ais-potential-not-its-performance) "Companies Are Laying Off Workers Because of AI's Potential, Not Its Performance."
0 commit comments