Research Compass

I study AI systems at the point where deployment changes them: fine-tuning, adaptation, misuse, new languages, new modalities, and new incentives.

My work is driven by a simple concern: alignment is not a one-time property. A model can look safe before release and become brittle after ordinary updates, downstream customization, or adversarial pressure. I am interested in the mechanisms behind that shift, and in evaluations that make those failures visible before they matter.

I like research that connects control with understanding. Benchmarks are useful when they expose a real deployment failure mode; interpretability is useful when it changes what we can predict or prevent; defenses are useful when they survive contact with adaptation.

What Guides Me
  • Safety must survive adaptation.Models are rarely frozen after release, so control mechanisms should be evaluated under benign and adversarial change.
  • Understanding should be actionable.Mechanistic insight matters most when it helps diagnose, predict, or intervene on model behavior.
  • Evidence beats surface behavior.Final answers, refusal rates, and aggregate scores can hide important failures; useful evaluations inspect traces, representations, and distributions.
  • Trustworthy AI is a systems problem.Safety, privacy, authenticity, and unlearning interact across training data, model internals, deployment interfaces, and human oversight.
Current Threads

AI Control Across Deployment

Methods and evaluations for keeping language, vision-language, and generative models aligned before release and robust after fine-tuning, adaptation, or policy shifts.

The Mechanics of Model Change

Interpreting how fine-tuning changes models, from multilingual LLM safety to gravitational interpretations of fine-tuning reversion and future mechanistic studies.

Evidence, Forgetting & Forensics

Tools to explain decisions, remove learned information, and assess synthetic media, connecting XAI, unlearning, and deepfake robustness under one reliability lens.