Robust Safety Monitoring of Language Models via Activation Watermarking

Mar 24, 2026·

Toluwani Aremu

Daniil Ognev

Samuele Poppi

Nils Lukas

· 0 min read

PDF Cite

Abstract

Large language models can be misused to reveal sensitive information. Model providers rely on monitoring to detect and flag unsafe behavior during inference, but adaptive adversaries can craft attacks that both evade detection and elicit unsafe behavior. We cast robust LLM monitoring as a security game, where adversaries who know about the monitor try to extract sensitive information while a provider must accurately detect adversarial queries at low false positive rates. We show that existing monitors are vulnerable to adaptive attackers and design improved defenses through activation watermarking by carefully introducing uncertainty for the attacker during inference. Activation watermarking outperforms guard baselines under adaptive attackers who know the monitoring algorithm but not the secret key.

Type

Preprint

Publication

In arXiv

Last updated on Mar 24, 2026

LLMs Safety Safety Monitoring Activation Watermarking Adaptive Attacks

Authors

Samuele Poppi

Postdoctoral Associate in AI Safety

← SPQR: A Multi-Dimensional Benchmark for Safety Alignment under Benign Model Adaptation Jun 1, 2026

Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Output Prefilling Jan 1, 2026 →