Community Perspective – Hima Lakkaraju

Hima Lakkaraju Hima Lakkaraju

If you’ve ever made a mistake, chances are, you’ve been asked to explain yourself. While people can easily understand how and why decisions led to something going wrong, Hima Lakkaraju’s research shows that asking large language models to do the same is far less straightforward—even as they facilitate high-stakes decisions made in healthcare, finance, or law.

“If a model is a black box, it’s not explaining itself,” says Lakkaraju. “How can a human decision-maker translate that into a decision that they can stand behind?” 

Himabindu (Hima) Lakkaraju is a 2023 AI2050 Early Career Fellow and Assistant Professor at Harvard University. Her research was recognized with a NSF CAREER award and multiple best paper awards at top-tier ML conferences. In 2020, she co-founded the Trustworthy ML Initiative to foster a research community around trustworthy machine learning.

Lakkaraju’s research improves the interpretability, fairness, privacy, robustness, and reasoning capabilities of machine learning models. Her AI2050 project evaluates explanations from large language models, developing strategies to guarantee the reliability of such explanations to align them with regulatory standards. This project addresses Hard Problem #2, which concerns overcoming AI shortcomings, especially those with consequences for public trust and safety.

The AI2050 initiative gratefully acknowledges Fayth Tan for assistance in producing this community perspective.

What does the “right to an explanation” mean in the context of AI?

My work is motivated by regulation—whether it’s the General Data Protection Regulation or AI Act in the EU, there’s a blueprint for an AI Bill of Rights. All these different regulatory frameworks mention, in some form, the “right to explanation.” They all emphasize the importance of having explainability or transparency in different settings. 

But we don’t have the required technical research or tools which [can achieve] the goals in these regulatory frameworks. My research seeks to understand the gaps between AI research and regulatory guidelines, then try to bridge those gaps from the technical side.

Why is explainability important?

Whenever we touch high-stakes decision making using any kind of model, whether it’s disease diagnosis or who should get loans, the onus of a decision still lies with a human. It can’t be completely automated because there are accountability issues. If your treatment goes wrong, a doctor or a hospital needs to be held accountable. If you automate everything, then that accountability is all up in the air. 

This is where explainability becomes a critical factor. Whether they’re a doctor or loan officer, people don’t just need to know the predictions made by these models, but also the factors that were used to determine that output. Only then can they decide if they agree with a tool or if they should make their own determination.

The Wall Street Journal asked readers to weigh in on whether disclosing the involvement of AI in decision-making should be necessary. Two of the reasons given for why disclosure isn’t necessary were that one, you're not always privy to important decisions made by humans. And two, that the quality of the decision matters more than how the decision was made. What are your thoughts on these rationales?

To the first point—humans make imperfect decisions all the time, but we don’t ask them why, so why are we asking machines? We’re talking across different contexts. The level of transparency in college admissions isn’t as high compared to when a patient sues a hospital for a medical procedure that went wrong. There’s a lot of post hoc analysis. We don’t need explanations everywhere, but there are specific scenarios where they’re really needed. 

Also, a single bad decision-maker can only make so many bad decisions, but the point of an automated tool is to deploy them at scale. It’s not just one person at one university, or at one hospital, it [could be] standardized statewide or even nationwide. 

To the second point that the quality of the decision is more important than knowing what factors were used to make it—this touches upon the trade–offs between the quality of the outputs versus the ability to explain the tool. The real utility of explainability is in scenarios where we believe that the tools are never going to be 100% accurate. They’re going to be good enough, but there are a significant number of cases where they may not be accurate. In those cases, we want the human decision-maker to override the tool’s determination. That’s where we need explainability.

You alluded to the gap between policy standards and the technical ability to actually execute those standards. Could you speak to that more?

At a high level, these guidelines convey that whenever an individual is impacted by an automated system, they have the right to know why or what caused that determination to be made.

Initially, simple models were more understandable to humans. But in the last 10 years, we’ve gotten complex models with billions of parameters.The more complex you make something, the harder it is to explain. With regard to regulations, translating the question of why a determination was made becomes harder and harder. 

Large language models specifically are deceptively explainable: if you ask a model why it came up with something, it’ll give you an answer. For example, if you show doctors explanations from models, they are convinced. They say, “Wow, that’s exactly what I’d use to make this determination.” But models can tell you one thing but do something else computationally. The model gives an answer that makes the doctor think it knows what it’s doing, but when you change the values of certain features, those features don’t impact the output, despite the act that they were the ones that the model explicitly said it was using. On one level, it’s not surprising, but it also seems deceptive.

Why is that observation not surprising to you?

A lot of these models are basically built with human preference data. Companies building these models literally put thousands of labelers to task, and asked them what answers they would give in a specific situation. These models are also good at looking at documents on the internet, medical literature, and saying, “A doctor would look at these three features when making this determination.” 

The underlying mechanism is called reinforcement learning with human feedback. These models always want to keep a human engaged. They want a human to keep them in the loop and keep the interaction going. To get a little philosophical, they are not optimized for telling the truth. They’re optimized for telling someone something that keeps them hooked.

What changes would you like to see on both technical and regulatory fronts to ensure right-to-explanation policies are possible to uphold?

The first thing is more conversations between people from both sides. Policymakers trying to come up with regulatory guidelines and AI researchers who work on auditing models are disjointed. We need more dialogue between these two communities so they can sit down together and see what’s feasible. 

Also, [we need] more incentives for doing that kind of work. As an academic, there’s always the question of whether I should have just written another machine learning paper since it’d count more for my career than a meeting with policymakers, even if it led to regulatory change. If we addressed those two factors, we’d move toward better solutions.