Artificial Intelligence Can Improve Health Care—but Not Without Human Oversight
Study of sepsis detection software underscores need for guidance on implementation and monitoring
Every year 1.7 million adults in the United States develop sepsis, a severe immune response to infection that kills about 270,000 people. Detecting the disease early can mean the difference between life and death.
One of the largest U.S. developers of electronic health record (EHR) software, Epic Systems, offers a tool called the Epic Early Detection of Sepsis model that uses artificial intelligence (AI)—software that mimics human problem-solving—to help physicians diagnose and treat sepsis sooner. But a recent study published in JAMA Internal Medicine found that the tool performed poorly when identifying sepsis. The results demonstrate a reality at this point for AI health care products in general and highlight the need for close attention to how they function in actual health care settings.
Although 170 hospitals and care providers have implemented Epic’s sepsis tool since 2017, some experts remain unsure how well the product works. As with many other AI tools, it did not have to undergo Food and Drug Administration (FDA) review before being put into use and there is no formal system in place to monitor its safety or performance across different sites. That means there is no central reporting required if a patient does not receive appropriate care because of faulty AI.
Researchers from the University of Michigan in Ann Arbor assessed Epic’s sepsis tool’s performance after their institution’s hospital, Michigan Medicine, began using it. In JAMA Internal Medicine, they wrote that such an analysis was needed because “only limited information is publicly available about the model’s performance, and no independent validations ha[d] been published to date.”
Their findings showed that the AI sepsis model identified just 7% of patients with sepsis who had not received timely antibiotics treatment. The tool did not detect the condition in 67% of those who developed it, but generated alerts on thousands who did not.
Epic, however, criticized the findings in news coverage and in correspondence with Pew, noting that the researchers did not calibrate the model for their specific patient population and data, defined sepsis differently than Epic’s model, and failed to acknowledge that two studies assessing the product’s performance had been published.
“Epic predictive models are developed, validated, and continually improved in collaboration with health systems, data scientists, and clinicians across a variety of institutions and locations,” the company stated. “This process determines if a model can be used effectively at different organizations. Over 2,300 hospitals and 48,000 clinics have access and transparency into Epic’s models and supporting documentation.”
The Michigan study and the response to it reflect the broader challenge with AI software products: How they are retrained within a clinical setting matters just as much as how they are developed. Adapting such tools to new environments can prove difficult when patient populations, staffing, and standards for diagnosing diseases and providing care may be very different from those on which the products are based.
Before using any AI software, hospital officials must tailor it to their clinical environment and then validate and test the program to ensure it works. Once the product is in use, staff must monitor it on an ongoing basis to ensure safety and accuracy. These processes require significant investment and regular attention; it can take years to fine-tune the program.
AI systems must be evaluated and monitored routinely, given the tendency for their algorithms—the formulas at the heart of the tools—to be biased in sometimes unexpected ways. For example, in a landmark study published in 2019, scientists found that an AI tool used widely to help hospitals allocate resources dramatically underestimated the health care needs of Black patients. Because its algorithm used health care costs as a proxy for assessing patients’ actual health, the software perpetuated bias against Black people, who tend to spend less on health care—not because of differences in overall health, but because systemic inequities result in less access to health care and treatments.
Revealing potential geographic biases, a 2020 analysis found that, out of 74 studies used to develop image-based diagnostic AI systems, most relied on data from only California, New York, and Massachusetts; 34 states were left out of the studies entirely. If AI is built on data exclusively from largely metropolitan states, it may not perform as well when used in more rural states. The patients—their lifestyles, incidence of disease, and access to diagnostics and treatments—differ too much.
Simply transferring AI from one context to another without reviewing for potential population, resource, or systemic differences can introduce bias. For example, an algorithm designed to scan electronic health records and identify lung cancer patients with certain tumor mutations performed well in Washington state but significantly less so in Kentucky, because the records used different terminology to catalog types of cancer. Additionally, AI trained in settings with highly skilled doctors and advanced equipment may make recommendations that are inappropriate for hospitals with fewer resources.
Unfortunately, few independent resources are available to help hospitals and health systems navigate the AI terrain. To help them, professional medical societies could develop guidance for validating and monitoring AI tools related to their specialties. For example, the American College of Radiology’s Data Science Institute has a series of white papers intended to help users decide whether, when, and how to use these products. Standards development organizations, such as the National Institute of Standards and Technology, also can establish benchmarks and other metrics against which AI products can be assessed.
Researchers have also suggested implementing standards and routine methods for postmarket surveillance to ensure systems’ effectiveness and equity, similar to how drugs are monitored once they are on the market. This is important for adaptive algorithms that continue to learn based on new data as well as nonadaptive algorithms. The latter can experience concept drift, in which the algorithm actually begins to perform worse over time.
With AI still so new to health care, there are many more questions than answers: Without a uniform gold standard that is consistent from hospital to hospital, how should health care providers calibrate and validate AI to reflect their patients’ specific needs? How should they monitor AI products in use and where should they report problems, including adverse events? What standards should AI developers use to validate their own products, especially those that FDA does not review or approve, and how can they assure users that their algorithms are accurate and free of bias? What else can regulators—predominantly FDA and the Federal Trade Commission—do to ensure these products are safe and effective?
As stakeholders wrestle with these questions, it’s critical for health care providers to recognize not just the unique value that AI can provide, but also the unique challenges in implementing them.
Liz Richardson leads The Pew Charitable Trusts’ health care products project.