Community Perspective - Amanda Coston

Amanda Coston Amanda Coston

Predictive AI models are responsible for consequential decisions. They approve consumer loans, make treatment recommendations in healthcare, and even guide public policy. For Amanda Coston, this raises an important question: how do we know an AI model is actually doing what it’s supposed to? 

Coston develops evaluations for model validity, or methods to determine if a model is suitable for the specific context it is used in.

“Context is key,” says Coston. “Things change if you deploy a model in the future, or in a different setting. How can we design evaluations of algorithms to more accurately reflect all these sources of uncertainty?”

Amanda Coston is a 2023 AI2050 Early Career Fellow and Assistant Professor of Statistics at UC Berkeley. Her scholarship was recognized by a Meta Research PhD Fellowship in 2022, a K&L Gates Presidential Fellow in Ethics and Computational Technologies in 2020, and a Tata Consultancy Services Presidential Fellowship in 2018. Her work was featured in the Wall Street Journal and VentureBeat, and has garnered multiple best paper awards. 

Coston’s work addresses data problems that impact algorithmic decision making, with a particular focus on issues that disproportionately affect marginalized groups. Her AI2050 project develops methods of evaluating validity for predictive AI used in societally high-stakes decision making. This work supports Hard Problem #2, solving key challenges or shortcomings of AI to build public trust in AI systems.

The AI2050 initiative gratefully acknowledges Fayth Tan for assistance in producing this community perspective.

You've described your research as understanding whether “AI is doing what it's supposed to do”. Why is that more complicated than it sounds?

It seems like a low bar, but in practice, a lot of algorithms — even simple predictive ones — often act differently from the ways that people intended them to. This misalignment can cause downstream problems with fairness, or unreliable and unintended behavior, sometimes with serious consequences. A common example is when algorithms effectively discriminate against certain people in a financial or healthcare setting.

Why does this happen, and what kind of consequences arise from this discrepancy?

It can be traced to early problem formulation issues. In a data-rich world, it’s tempting to use the information you have at hand as predictive features for certain outcomes. But this process can create an algorithm that isn’t truly predictive of the outcomes we care about.

Previously, I studied algorithms in consumer lending, which were used to predict whether or not someone would default on a loan. When people apply for a loan, you get information about the applicant, the size of the loan, and if the applicant repaid or defaulted on the loan. Predictive features for repayment outcome models were derived from that information, and models that appeared to perform well were deployed in the real world.

Crucially, the problem wasn’t with the model itself, but with how the model was built. Consumer lending datasets only include repayment outcomes, which only include approved applicants. People rejected for the loan in the first place were excluded entirely. As a result, the data used to train the model thus isn’t fully reflective of the real-world context it’s being used in. 

In data-first approaches, data is often divorced from the setting in which it will actually be used. This causes us to overlook issues with the data itself, such as training datasets for algorithms that are not fully reflective of their target population.

Are there any other areas in which issues of validity can arise when building predictive models?

Sometimes, we use surrogate or proxy outcomes in public policy settings where we’re interested in future social outcomes. Some of my research was in child welfare, evaluating predictive models that are used to help assess whether or not a child is at risk of harm and neglect. A common proxy outcome to predict risk of harm are re-referrals to a child welfare hotline. The underlying assumption is that children who are re-referred are more likely to experience harm, but that isn’t the case. Re-referrals are more likely to occur, regardless of the underlying risk, for families of color and families of lower socioeconomic standing. In this case, re-referrals are what we call a “noisy proxy”, a metric that is inherently flawed as it does not measure what it was intended to.

There’s also the problem of missing data. If you train an algorithm on historical data, you are only training it on observed outcomes. The decisions that led to that outcome are not necessarily taken into account. To highlight an example from a clinical decision support algorithm in healthcare, there was a seemingly paradoxical observation in hospital historical data that asthmatic patients appeared to be at lower risk of developing pneumonia. This obviously does not match up with what we know physiologically, as asthmatic people should be at higher risk. 

What the outcomes in historical data did not capture, however, was that asthmatic people were getting better care. If a clinical algorithm was trained on those outcomes, it might determine asthmatic people are low risk and don’t need intensive care, which would cause their future outcomes to be poor.

It sounds like true solutions for model validity might not be as straightforward as a single quantitative metric. How can solutions be presented in an actionable way?

I’m thinking about a protocol similar to those developed for an Institutional Review Board, which is a regulatory process for human subjects research. It would guide practitioners through common validity problems, asking them to consider cause and effect relationships between an outcome and the various factors that influence it. 

 I’d also link to available resources such as methodological fixes or statistical tools to address problems. Practitioners might also identify no-go issues—issues that, if not addressed in a satisfying way, would mean that this algorithm isn’t worth investing in since it will not function as intended. An extreme example is algorithms that use people’s faces to purportedly predict criminality. Physiognomy and phrenology have been extensively debunked. There is no possible evidence that you could cite in order to support the relevance of faces to criminality. Thus, the model can never be valid because any correlations will be spurious.

Lastly, algorithms are often developed behind closed doors. I believe that members of impacted communities should be given the information they need to weigh in meaningfully—for example, if they notice representation problems and biases in the data that should be flagged as a no-go problem, but aren’t. The decisions being made by these algorithms are consequential, and I’m hoping to work on a tool to open up more avenues for members of the community to actively participate in conversations about these issues.