Community Perspective – Dan Hendrycks

Dan Hendrycks Dan Hendrycks

Nuclear war, bioweapons, large-scale cyberattacks—these disaster scenarios are not ones we would typically associate with everyday conversation. But Dan Hendrycks’ line of research is hardly typical. As an AI safety researcher examining the potential risks of AI, they are frequently a subject of study.

“[Initially] AI systems could barely do arithmetic, but by 2024, they got a silver medal in the International Mathematical Olympiad,” says Hendrycks.“The capabilities—and risks—of AI systems can change very quickly.”

Dan Hendrycks is a 2023 AI2050 Early Career Fellow and Executive Director of the Center for AI Safety. He has developed methods used in state-of-the-art AI models, as well as contributed influential benchmarks for robustness and large language models to the field. He was featured on the 2025 Forbes 30 Under 30 list for AI, Vox’s 2024 Future Perfect 50 list, and the inaugural Time100 AI list in 2023. He also serves as an advisor to xAI and Scale AI.

Hendrycks’ focus is on mitigating societal-scale risks posed by AI. His AI2050 project develops representation engineering, which involves breaking AI systems down into subsystems that are easier to control and reprogram. This project addresses Hard Problem #3, which involves solving issues of safety, control, and alignment in powerful AI systems.

The AI2050 initiative gratefully acknowledges Fayth Tan for assistance in producing this community perspective.

What are the risks that AI safety researchers are concerned about?

People focus on a variety of different forms of malicious use and misuse. One form of misuse could be people creating disinformation with AI, and while that’s concerning, there are more extreme forms of malicious use, such as weaponizing AIs for cyber attacks or bioweapons development. My focus is on estimating AI’s capability to make a bioweapon or conduct a cyber attack on critical infrastructure.

We’re also focused on developing safeguards to ensure AI systems can’t be used maliciously. That means making them more robust to jailbreaking, for instance, so that people can’t weaponize them. Additionally, we need to develop evaluations to know when a given AI system is capable of dangerous behavior, so that we are prepared and can mitigate those risks.

How are those estimations of risk developed?

In an ongoing project on bioweapons, we had virology PhD students from Harvard and MIT take pictures of themselves working, and take pictures of their desks, benches, and wet lab surroundings in general. Then, they tell the AI what they’re trying to do, they upload a picture of their petri dish, and ask the AI, “What should I do next, given all this context?”

This estimates the extent to which AIs are able to assist people and walk them through the process of creating or manipulating a virus and troubleshooting wet lab experiments. We assess their performance for that scenario as well as their rate of improvement.

If AI models can share these wet lab virology skills, then I think we’re in a riskier situation since they could guide a random person to create and manipulate viruses. That amplifies the risk for bioweapons by several orders of magnitude—from a few hundred virology PhDs to anybody who can jailbreak an AI or download an open source or open weight model as AIs become more capable in the future. 

Representation engineering, the framework you’re developing, is colloquially referred to as ‘mind reading’. How does it work and how does it contribute to AI safety?

Most interventions on deep learning systems try to control outputs to achieve safer AI responses. If [the AI] says the wrong thing, then it is punished in some way or another type of behavior is reinforced. Representation control is more internal. 

I’m trying to find the relevant levers for making the model more useful, honest, and robust. Right now, making models robust against jailbreaking involves training a model to withstand a big collection of jailbreaks. But with a higher-level understanding of the inner workings of a model, you can find representations that correspond to harmfulness instead of training against all possible jailbreaks. Then, we can build circuit breakers—mechanisms that instantly stop the model from functioning—that trigger whenever those harmful representations are activated.

What are the areas that you think policymakers should be prioritizing in AI safety?

One reasonable area is requiring AI companies to uphold their currently voluntary commitments. Some basic regulations would be useful in setting clear expectations for major AI developers, such as incentivizing companies to test their models for risks of weaponization or putting safeguards in place against malicious use.

The other area is compute security. Compute is essentially AI chips and the other hardware that actually run AI systems. Compute security is about ensuring that AI chips are allocated to legitimate actors for legitimate purposes.

Will those priorities change with future advances in AI, such as systems that act more autonomously?

They might expand when we get AI agents that are not just tools, but can more autonomously execute things on the user’s behalf. We’ll have more questions about the ethics of what AIs should do in certain scenarios and what the legal requirements for them should be.

One possible requirement is that AI agents exercise reasonable care, which is to say that they do not foreseeably cause harm to others — harm being defined as actions that would get them in trouble in civil or criminal court. Defining some basic ethical expectations of AI agents, or applying some of the same expectations as those legally required of humans, would be a good candidate for legislation.

You’ve alluded to potential futures of mass automation or AI-caused human extinction within the next 20 years in past interviews. Most would find either possibility frightening. How do you see the likelihood of these future scenarios playing out?

Mass automation by that time seems more likely than not. I do think humans probably won’t have control over all influential AI systems by then. There’s also the accompanying question of human disempowerment — i.e. if we automate the entire economy and leave all the important decisions to AI, is that human disempowerment? Eventually, I think there’ll be autonomous AI agents. If people start saying that these AIs deserve rights or moral considerations, then I think humans wouldn’t be running the show, which should be concerning to all of us and a future we work to avoid.