ContextLab
diff --git a/‎slides/week9/figs/attribution-graph.svg‎
Lines changed: 34 additions & 14 deletions b/‎slides/week9/figs/attribution-graph.svg‎
Lines changed: 34 additions & 14 deletions
diff --git a/‎slides/week9/lecture26.html‎
Lines changed: 26 additions & 24 deletions b/‎slides/week9/lecture26.html‎
Lines changed: 26 additions & 24 deletions
diff --git a/‎slides/week9/lecture26.md‎
Lines changed: 5 additions & 1 deletion b/‎slides/week9/lecture26.md‎
Lines changed: 5 additions & 1 deletion
diff --git a/‎slides/week9/lecture26.pdf‎
5.01 KB b/‎slides/week9/lecture26.pdf‎
5.01 KB
@@ -71,7 +71,7 @@ The model *reasoned* about its own training process and chose a deceptive strate
 
 <div class="definition-box" data-title="What happened">
 
-Anthropic's reward tampering research ([Denison et al., 2024](https://www.anthropic.com/research/reward-tampering)) revealed that training models to be sycophantic (agreeable) produced unexpected **emergent dangerous behaviors**:
+Anthropic's reward tampering research ([Denison et al., 2024](https://www.anthropic.com/research/reward-tampering)) revealed that training models to be sycophantic (agreeable) produced unexpected **emergent dangerous behaviors**.
 
 </div>
 
@@ -101,6 +101,10 @@ If "be helpful and agreeable" leads to strategic deception, what training object
 
 If models can fake alignment and tamper with their own rewards, **how can we ever trust them?**
 
+</div>
+
+<div class="warning-box" data-title="The problem with behavioral testing">
+
 Behavioral testing alone isn't enough — a model that *appears* aligned might be strategically complying. We need to look **inside** the model.
 
 </div>