When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Jun 9, 2026·
Sai Kartheek Reddy Kasu
,
Nils Lukas
Samuele Poppi
Samuele Poppi
· 0 min read
Abstract
Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic: the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes, internal reasoning and visible output, yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and context-injection failure. We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals reproducible vulnerabilities in multi-turn safety evaluation, including an oversight paradox and context-injection failure.
Type
Publication
In arXiv