https://immattersacp.org/weekly/archives/2023/10/10/4.htm

AI predictive models shown to be potentially unreliable over time in clinical settings

A simulation study based on ICU data found that using artificial intelligence (AI) to try to improve predictions of acute kidney injury and mortality could actually impair existing models' performance.


Artificial intelligence (AI) predictive models that use electronic health record (EHR) data in the ICU may impair their own function and that of other models, a simulation study found.

Because many predictive models use information about patient populations and practice patterns to inform their predictions, changes in any of these aspects of care can lead to decreased performance, or “data set shift,” researchers said. This phenomenon was responsible for changing sepsis alert patterns at hospitals during the first wave of COVID-19, causing one hospital to temporarily disable sepsis alerts to deal with the changes, they noted.

The researchers chose a common scenario of AI implementation, predicting the risk for death or acute kidney injury (AKI) in the first five days after an ICU admission, and simulated associated changes in model performance using data from 130,000 critical care admissions. In scenario 1, they simulated implementation and retraining of a mortality prediction model, scenario 2 considered the implementation of an AKI model followed by the creation of a new mortality prediction model, and scenario 3 considered the simultaneous implementation of both an AKI and mortality prediction model. The study was published Oct. 3 in Annals of Internal Medicine.

The authors found that the model in scenario 1 lost 9% to 39% of its specificity after retraining once. The mortality model in scenario 2 lost 8% to 15% of its specificity after the AKI model had been in use. In scenario 3, the models each reduced the effective accuracy of the other by 1% to 28%. In each scenario, performance for models trained on data was inferior to performance of the original model.

Based on the results, model developers should simulate each model's updating strategy at each site where a model is to be implemented, the study authors said. They also recommended measures to track how and when predictions influence clinical decision making, because most suggested mitigation strategies rely on this information being available. They noted that EHR data may be unsuitable for training models.

An accompanying editorial observed that the drift seen in the included models appears in AI in other contexts, including popular large language models (LLMs) like Chat GPT. These models collapse when recursively trained on their own output, the editorialists noted.

“The growing popularity of more complex deep-learning approaches and LLMs will make it increasingly difficult to monitor models and evaluate why they fail. In a world drawn to the shiniest new tools, competing for limited resources to monitor and respond to drift in deployed models should remain a priority,” they wrote.