Medical tools enabled with artificial intelligence (AI) can greatly improve health care by helping with problem-solving, automating tasks, and identifying patterns. AI can aid in diagnoses, predict patients’ risk of developing certain illnesses, and help determine the communities most in need of preventative care. But AI products can also replicate—and worse, exacerbate—existing biases and disparities in patients’ care and health outcomes. Alleviating these biases requires first understanding where they come from.
Christina Silcox, Ph.D., research director for digital health at the Duke-Margolis Center for Health Policy, and her team identified four sources of AI bias and ways to mitigate them in a recent report funded by The Pew Charitable Trusts. She spoke with Pew about the findings.
This interview with Silcox has been edited for clarity and length.
How does bias in AI affect health care?
AI tools do not have to be biased. But if we're not paying careful attention, we teach these tools to reflect existing biases in humans and health care, which leads to continued or worsened health inequities. It’s particularly concerning if this happens in groups that have historically gotten poor care, such as those who have been affected by structural racism and institutionalized disparities like lack of access to affordable health insurance and care. For example, when machine-learning-based tools are trained on real-world health data that reflects inequitable care, we run the risk of perpetuating and even scaling up those inequities. Those tools may end up leading to slower diagnoses or incomplete treatments, decreased access, or reduced overall quality of care—and that’s the last thing we want to do.
Your research identified four ways bias can arise in AI. One is inequitable framing—what is that and how does that lead to bias?
Inequitable framing means that a tool is programmed to solve for the wrong question and, as a result, ends up addressing the wrong issue. An example is a “no-show” algorithm, which can predict which patients may not keep their appointments. Based on those predictions, many clinics and hospitals double-book certain patients to try to minimize lost revenue.
The problem with that is Black, Latino, and American Indian or Alaskan Native patients disproportionately experience systemic barriers to care, such as lack of reliable transportation and limited access to paid sick leave or affordable health insurance, that may prevent them from getting to the doctor’s office. And double-booking those appointments only exacerbates those problems, because when both patients do show up, they are either not seen promptly or are rushed through their appointments.
Asking an AI tool “who won’t show up?” and double-booking only solves a financial problem. A better approach is to design tools that can predict “which supportive measure is most likely to help this patient attend their appointment? Who needs transportation? Who would benefit from a reminder call? Who should be booked for a video consult rather than in person?” That framing solves both the financial problem and the health problem.
Other areas of bias you identified are unrepresentative training data and biased training data. What’s the difference between them and how do they lead to bias?
Unrepresentative data means that the data collected is inconsistent with the locations and populations on which the tool will be used. For example, training a tool with data from Northeastern urban academic medical centers or large research institutes and then bringing it to a Southwestern rural community clinic and using it on the population there. Workflows, access to certain tests and treatments, data collection, patient demographics, and the likelihood of certain diseases all vary between different places and hospital types, so a tool trained on one set of patient data may not perform well when the data or patient populations change significantly.
But even if you have data that is representative, there may be bias within the data. We've seen this with AI tools that used data from pulse oximeters, the finger clip-like devices that measure the oxygen saturation level of your blood. These tools were used to help guide triage and therapy decisions for COVID-19 patients during the pandemic. But it turns out the pulse oximetry sensors don't work as well for darker-skinned people, people who have smaller fingers, or people who have thicker skin. That data is less accurate for certain groups of people than it is for others. And it’s fed into an algorithm that potentially leads doctors to make inadequate treatment decisions for those patients.
The final area of bias you identified was data selection and curation. How does that affect AI?
AI developers have to select the data used to train their programs, and their choices can have serious consequences. For example, a few years ago, health systems used a tool to predict which patients were at risk of severe illness so they could better allocate services meant to prevent those illnesses. Researchers found that when developing the tool, the AI developers had chosen to use data on patients’ health care costs as a proxy for the severity of their illnesses. The problem with that is, Black patients historically have less access to health care—and therefore lower costs—even when they’re very ill. Costs are a racially biased proxy for severity of illness. So the tool underestimated the likelihood of serious health conditions for Black patients.
Decisions in data curation can also cause bias. Health care data is complicated and messy. There's an urge to clean it up and use only the most complete data, or data that has little to no missing information. But people who tend to have missing information in their medical records are those who have less access to care, those who have lower socioeconomic statuses, and those less likely to have insurance. Removing these individuals from the training data set may reduce the performance of the tool for these populations.
What can be done to help mitigate these biases?
AI developers have the responsibility to create teams with diverse expertise and with a deep understanding of the problem being solved, the data being used, and the differences that can occur across various subgroups. Purchasers of these tools also have an enormous responsibility to test them within their own subpopulations and to demand developers use emerging good machine learning practices—standards and practices that help promote safety and effectiveness—in the creation of those products.
FDA has the authority to regulate some AI products, specifically software that is considered a medical device. The agency has a responsibility to ensure those devices perform well across subgroups, should require clear and accessible labeling of the products indicating the populations they’re intended for, and work to build systems that can monitor for biased performance of medical products.
Finally, people who originate data—people who put data into electronic health records, people who put data into claims, people who build wellness algorithms and other tools that collect consumer data—have a responsibility to make sure that data is as unbiased and equitable as possible, particularly if they plan to share it for AI tool development or for other purposes. That means increasing standardization and common definitions in health data, reducing bias in subjective medical descriptions, and annotating where data may differ across populations.
What can be done if a tool shows biased performance after development?
The best option here is to determine the cause of the bias, go back to development, and retrain the model. The other option is to clarify within the product’s use instructions and in its marketing and training materials that the tool is only intended for use in certain populations. Through these remedies, we can ensure that AI tools aren’t replicating or worsening the current state of health care, but rather helping us create a more just and equitable system.