Tony Wang's Personal Website

Welcome to my personal website. I’m a 3rd year PhD student at MIT under the supervision of Nir Shavit. I mostly spend my time thinking about how humanity can build friendly AGIs without greatly harming itself in its attempt to do so.

Research Interests

Much of my previous work and thinking has been on adversarial robustness. I’ve thought about the phenomenon both in very simplified toy settings as well as in the setting of superhuman game-playing agents. I’m interested in adversarial robustness for two key reasons:

At the moment, I’m working on robustness in both the vision and language domains. My key focus is on developing techniques that can improve robustness against unrestricted adversaries. In the vision domain, I am particularly interested in how we can make progress on something like the Unrestricted Adversarial Examples Challenge. In the language domain (where most of my efforts are right now), I want to answer the following question:

What are the core difficulties with preventing jailbreaks in language models,
and how can these difficulties be overcome?

My current take is that the answer involves scalable oversight and possibly relaxed adversarial training.

A north star for my research agenda is to develop techniques that could let us reliably instill Asimov’s three laws into future AGI systems.


If you would like to chat about anything I’ve mentioned on this site, feel free to contact me at twang6 [at] mit [dot] edu.

Some links: Twitter, Google Scholar, CV.