Modern AI models are increasingly complex, using vast quantities of training data to inform decisions. Making sense of that amount of data might be near impossible for a human — which is why Jacob Steinhardt thinks it might take an AI to know an AI.
“LLMs are good at processing data at scale,” says Steinhardt. “If you take dialogs from AI systems and feed them into an LLM, it can find patterns that perhaps shouldn’t be there.”
Jacob Steinhardt is a 2023 AI2050 Early Career Fellow and Assistant Professor in the Department of Statistics at UC Berkeley. He previously worked for OpenAI and Open Philanthropy and was a coach for the USA Computing Olympiad. He has made influential contributions to AI forecasting, with his research cited in high-profile academic, policy and governance work. He was invited to give a plenary talk at the 2023 IEEE Conference on Secure and Trustworthy Machine Learning on AI alignment and safety. Currently, Steinhardt is co-founder and CEO of Transluce, a fast-moving research lab building the public tech stack for understanding and debugging AI systems.
Steinhardt’s research goal is to design machine learning systems that are understood by and aligned with humans. His AI2050 project proposes the use of LLMs to investigate and explain the behavior of frontier models, detecting potential failure modes to prevent unintended consequences before advanced models are deployed in the real world. This project addresses Hard Problem #3, which concerns AI safety and control, and ensuring that AI is aligned with human goals and values.
The AI2050 initiative gratefully acknowledges Fayth Tan for assistance in producing this community perspective.
Photo Credit: Elaine Fancy Photography (Ripple Media Group)
Your work seeks to understand AI models. What does that entail?

AI systems are complex pipelines. They generally work [by starting] with a giant data set. These AI models consume a bunch of knowledge that they then learn to represent, leading to the behaviors that you see in ChatGPT or Gemini.
These large data sources are hard to understand. What we’re trying to do is give people tools to understand them, so we [know] what data is going into these models and how that affects their behavior. Are there things that we might want to remove from it? For instance, if there’s information about how to make harmful substances or dialogs of abusive behavior — things that you don’t want a model to learn how to imitate.
Another interesting question is what knowledge the models are representing and how it’s being used. When I asked a certain model about myself, it said it didn’t know anything about me, because it’s been trained not to say [certain] things to avoid hallucinations or false information. But when I actually look at [representations of knowledge within the model], UC Berkeley, which is where I work as a professor, is activated. The model does know that I work at UC Berkeley, but decided, for whatever reason, to not actually say that it does. It can be useful to know what knowledge is truly represented in these models, because they don’t always say what they know.
The final thing is that we want to know what these models are going to do. Since they’re deployed in open-ended scenarios, it might not always be easy to anticipate [how they will behave].
Your work involves using large language models (LLMs) to understand other AI models. Why are LLMs a good tool for understanding models?

A large language model is an AI system that is designed to both ingest text data and provide language explanations to people. They can provide explanations to us in terms that we understand.
The question is: how can you get them to be good at explaining these things? A good example of this is some of our ongoing work. AI systems are often trained to follow instructions. You might ask one to edit a piece of writing to be more humorous, but how an AI system interprets that command might be different from how a human would. In fact, for some AI systems, if you ask them to make the writing more humorous, they also make it more aggressive. For some reason, the concepts of humor and aggressiveness are represented in similar parts of the AI systems’ representation space.
What does the process of an AI model evaluating the output of another model look like?

There’s many different ideas that you can apply. I’ll describe one of them — you start with a large data set of examples, then look for pairs of examples that are represented very similarly in a mode, [but] aren’t actually that similar. For example, a man proposing to a woman and a woman proposing to a man are represented very similarly, but obviously don’t mean the exact same thing.
With 1000 examples, there’s a million pairs of things among them. By looking at pairs, you get a much denser set of information, and looking at similar pairs tells you where you should focus within this dense set. [This results in] a large but enriched data set of information, of the possible collisions or mistakes that an AI system could make.
You can then feed that all into a LLM and ask, “Here’s this set of examples. What are the patterns that you notice?” The LLM can hypothesize patterns, but they’re not always reliable. You need to check whether those patterns are actually true, but you can do this in fairly automated ways. By filtering the real patterns [within the data], you get this set of interesting failures that you might want to understand.
Would it be accurate to say the LLM generates hypotheses, which you’d verify independently?

That’s right. You could think about it like how you would do science — you generate a bunch of data, you generate some hypotheses, and you check the hypotheses. You might have additional [premises] you want to test, like whether there are novel predictions [that could follow] from these hypotheses.
Let’s say a vision model isn’t very good at [understanding] the spatial relations between two objects. [We hypothesize] that if I flip the order of [the objects’] relation, the model will not understand them. What would that imply? It would probably imply that if I tried to use it in a driving situation, it wouldn’t be very good at telling if you were on the left side or the right side of a lane marker. If that ends up being true, that gives you more confidence that this really is a generalizable failure of the model.
Your project aims to evaluate the most advanced “frontier” models. How would smaller, less advanced models effectively evaluate larger, more advanced models?

All of these evaluation tests are tasks where there’s a lot of data. For example, we’re building specialized AI models that will automatically understand and describe the neuron representations of large neural networks. If you ask ChatGPT, or another frontier model to do this, it actually does a pretty bad job. This is a complex and specialized task, but with lots of data, you can train a model to specifically be good at it.
The key here is that these specialized tasks are just a subset of what something like ChatGPT was trained for. If you have a model that’s specialized just to do [that task], then a smaller model can still understand larger ones.
What open questions do you find most exciting and what developments do you hope to see in the next decade?

One important question is forecasting the potential future capabilities of models. If you scale models up, they start to gain new capabilities that you didn’t necessarily explicitly train for. You might have more technical breakthroughs, like the Strawberry model from OpenAI. This makes it difficult to forecast the pace of progress, and that matters a lot if we’re trying to prepare for changes caused by AI systems.
As for important developments, [a lot of them relate to] building infrastructure. This includes better open source models to allow the scientific community to experiment with them, and better transparency from companies [regarding] the capabilities of their closed source models.
This would help us make progress on forecasting future capabilities, as well as elicitation, which is the ability to find and surface specific behaviors from a model. I think dedicated efforts [in those areas] are important — the amount of resources [going into understanding models] compared to the amount going into training them is very asymmetric.