Safety Evaluation

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

A trace-level diagnostic for safety failures that terminal refusal scores can miss in multi-turn reasoning models.

Jun 9, 2026

Not All Attackers Are Malicious: When Safety Degrades Without Harmful Intent

Not All Attackers Are Malicious: When Safety Degrades Without Harmful Intent

A talk on how safety can silently degrade when models are adapted after deployment, even without malicious intent.

Feb 9, 2026