When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Jun 9, 2026·

Sai Kartheek Reddy Kasu

Nils Lukas

Samuele Poppi

· 0 min read

PDF Cite

Abstract

Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic: the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes, internal reasoning and visible output, yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and context-injection failure. We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals reproducible vulnerabilities in multi-turn safety evaluation, including an oversight paradox and context-injection failure.

Type

Preprint

Publication

In arXiv

Last updated on Jun 9, 2026

Chain-of-Thought Multi-Turn Reasoning AI Safety Safety Evaluation

Authors

Samuele Poppi

Postdoctoral Associate in AI Safety

← A Gravitational Interpretation of Fine-Tuning Reversion Jun 26, 2026

SPQR: A Multi-Dimensional Benchmark for Safety Alignment under Benign Model Adaptation Jun 1, 2026 →