A robust monitoring framework for LLM safety that uses activation watermarking against adaptive attackers.
Mar 24, 2026