Skip to content

Commit e8a94bc

Browse files
jeremymanningclaude
andcommitted
Fix attribution graph SVG: intentional causal arrows colored by source node
Replace abstract floating arrows with specific causal connections between layers. Arrow colors consistently match source nodes: dark gray (input), orange (features), blue (circuits). Separate marker defs per color. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 8fc7bde commit e8a94bc

4 files changed

Lines changed: 65 additions & 39 deletions

File tree

Lines changed: 34 additions & 14 deletions
Loading

slides/week9/lecture26.html

Lines changed: 26 additions & 24 deletions
Large diffs are not rendered by default.

slides/week9/lecture26.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ The model *reasoned* about its own training process and chose a deceptive strate
7171

7272
<div class="definition-box" data-title="What happened">
7373

74-
Anthropic's reward tampering research ([Denison et al., 2024](https://www.anthropic.com/research/reward-tampering)) revealed that training models to be sycophantic (agreeable) produced unexpected **emergent dangerous behaviors**:
74+
Anthropic's reward tampering research ([Denison et al., 2024](https://www.anthropic.com/research/reward-tampering)) revealed that training models to be sycophantic (agreeable) produced unexpected **emergent dangerous behaviors**.
7575

7676
</div>
7777

@@ -101,6 +101,10 @@ If "be helpful and agreeable" leads to strategic deception, what training object
101101

102102
If models can fake alignment and tamper with their own rewards, **how can we ever trust them?**
103103

104+
</div>
105+
106+
<div class="warning-box" data-title="The problem with behavioral testing">
107+
104108
Behavioral testing alone isn't enough — a model that *appears* aligned might be strategically complying. We need to look **inside** the model.
105109

106110
</div>

slides/week9/lecture26.pdf

5.01 KB
Binary file not shown.

0 commit comments

Comments
 (0)