Tony Wang

Tony Wang's Personal Website

Welcome to my personal website. I am currently at the US AI Safety Institute on leave from my PhD at MIT. At the safety institute, I work on the design, implementation, execution, and analysis of frontier model evaluations. The overall goal of my work and research is to enable humanity to realize the benefits of AGI while adequately managing its risks.

Research Interests

Much of my previous work and thinking has been on adversarial robustness. I’ve thought about the phenomenon both in simplified toy settings as well as in the setting of superhuman game-playing agents. I’m interested in adversarial robustness for two key reasons:

Adversarial robustness is very closely related to the worst-case performance of a system. Safe systems are ones which by definition have acceptable worst-case performance, so adversarial methods can serve as both a good auditing mechanism and as a training signal for safety.
Our inability to steer advanced AI systems in a robust way reflects our inability to control the “core” values and tendencies of AI systems. For example, when OpenAI RLHF’s GPT-4, it behaves aligned in the average case, but the existence of jailbreaks shows that we have not effectively changed the “true” values of the system. Rather we have only instilled in it a bunch of heuristics that make it behave nice most of the time.
I think working on robustness is a good way to improve our ability to do alignment of “core values”.
It might also be that this goal is ill-formed – the intuition that generally intelligent systems have core values may be false and maybe the hot-mess theory of intelligence is right. Even in this pessimistic case, studying robustness could still give us highly decision-relevant knowledge about the nature of intelligence.

At the moment, I’m working on robustness in both the vision and language domains. My key focus is on developing techniques that can improve robustness against unrestricted adversaries. In the vision domain, I am particularly interested in how we can make progress on something like the Unrestricted Adversarial Examples Challenge. In the language domain (where most of my efforts are right now), I want to answer the following question:

What are the core difficulties with preventing jailbreaks in language models,
and how can these difficulties be overcome?

My current take is that the answer involves scalable oversight, and techniques like relaxed adversarial training, representation engineering and stateful defenses.

A north star for my research is to develop techniques that could let us reliably instill Asimov’s laws (at least the first two) into AGI systems.

Contact

If you would like to chat about anything I’ve mentioned on this site, feel free to contact me at twang6 [at] mit [dot] edu.

Some links: Twitter, Google Scholar, CV.