I study AI systems at the point where deployment changes them: fine-tuning, adaptation, misuse, new languages, new modalities, and new incentives.
My work is driven by a simple concern: alignment is not a one-time property. A model can look safe before release and become brittle after ordinary updates, downstream customization, or adversarial pressure. I am interested in the mechanisms behind that shift, and in evaluations that make those failures visible before they matter.
I like research that connects control with understanding. Benchmarks are useful when they expose a real deployment failure mode; interpretability is useful when it changes what we can predict or prevent; defenses are useful when they survive contact with adaptation.
Methods and evaluations for keeping language, vision-language, and generative models aligned before release and robust after fine-tuning, adaptation, or policy shifts.
Interpreting how fine-tuning changes models, from multilingual LLM safety to gravitational interpretations of fine-tuning reversion and future mechanistic studies.
Tools to explain decisions, remove learned information, and assess synthetic media, connecting XAI, unlearning, and deepfake robustness under one reliability lens.