- A deep learning model trained to identify pneumonia on a contained sample of medical images was unable to achieve the same level of accuracy when let loose on data from external healthcare systems, according to a new study in PLOS Medicine.
Researchers at the Icahn School of Medicine at Mount Sinai trained a convolutional neural network (CNN), a form of deep learning, to flag evidence of pneumonia in chest x-rays.
In three out of five comparison tests using data from the National Institutes of Health (NIH), Mount Sinai Hospital (MSH), and Indiana University Hospital (IUH), the CNN exhibited significantly lower performance when analyzing data from institutions outside of its own network than it did when examining images from its home health system.
The findings indicate that artificial intelligence developers may not be able to translate their successes in limited, controlled training to how their models will function when exposed to data in the wild.
“Our findings should give pause to those considering rapid deployment of artificial intelligence platforms without rigorously assessing their performance in real-world clinical settings reflective of where they are being deployed,” said senior author Eric Oermann, MD, Instructor in Neurosurgery at the Icahn School of Medicine at Mount Sinai.
“Deep learning models trained to perform medical diagnosis can generalize well, but this cannot be taken for granted since patient populations and imaging techniques differ significantly across institutions.”
The research team used 158,000 frontal and lateral chest x-rays from across the three institutions to support the study.
Pneumonia is a high-value use case, the authors said, due to its clinical significance, common occurrence, and the time-saving potential of using deep learning to automate first-layer triage for radiologists.
Using an existing deep learning model as a framework, the team trained the CNN to predict the presence of nine different diagnoses - cardiomegaly, emphysema, effusion, hernia, nodule, atelectasis, pneumonia, edema, and consolidation – based on combined data from the three institutions.
“We were interested only in the prediction of pneumonia and included other diagnoses to improve overall model training and performance,” the team explained.
The model was also tasked with differentiating the origin site of each x-ray, as well as the specific department that created the image. The CNN was able to distinguish between radiographs produced in the inpatient wards and emergency departments of each facility with 99 percent accuracy for NIH and MSH, and 95 percent accuracy for IUH.
The IUH data required significant manual curation at the beginning of the project, the team noted, due to inconsistencies with labeling and other data integrity issues.
The tool did not do quite as well when making predictions on the data from external sources. The CNN produced an area under the curve (AUC) of 0.802 on an internal test of data from Mount Sinai. When performing the same task on data from NIH, the AUC was “significantly worse” at 0.717, the study stated and only slightly better at 0.756 for data from Indiana University Hospital.
One version of the model, which was trained on joint data from Mount Sinai and the NIH, performed better when assessing those datasets internally than the version trained only on one institution’s data.
Interestingly, the ability to pinpoint the origin of the image helped the algorithm “cheat” its way to more accurate assessments, the study noted. The algorithm used the average rates of pneumonia at each institution to gauge whether or not the image was likely to contain evidence of the condition.
However, that meant that the algorithm was not working purely on the clinical characteristics of each image, which slightly alters the utility of the tool for clinical decision support or computer-assisted diagnosis.
“By engineering cohorts of varying prevalence, we demonstrated that the more predictive a hospital system was of pneumonia, the more it was exploited to make predictions, which led to poor generalization on external datasets,” the authors wrote. “We noted that metallic tokens indicating laterality often appeared in radiographs in a site-specific way, which made site identification trivial.”
“However, CNNs did not require this indicator: most image subregions contained features indicative of a radiograph’s origin.”
The images were produced on hardware from different manufacturers and stored in different formats.
If the emergency department used one brand of imaging machine, but the inpatient radiology department used another, the model could easily detect those fundamental differences and correlate them to the known overall higher prevalence of pneumonia in the inpatient setting instead of using clinical factors to make a determination.
“These results suggest that CNNs could rely on subtle differences in acquisition protocol, image processing, or distribution pipeline (e.g., image compression) and overlook pathology,” said the team.
Using the model in this manner would likely decrease the accuracy of the algorithm when exposed to datasets from additional care sites, the authors suggested.
“Even the development of customized deep learning models that are trained, tuned, and tested with the intent of deploying at a single site are not necessarily a solution that can control for potential confounding variables,” the team said.
More work will be required to identify and control for the unknown number of variables that such models may encounter, says first author John Zech, a medical student at the Icahn School of Medicine at Mount Sinai.
“If CNN systems are to be used for medical diagnosis, they must be tailored to carefully consider clinical questions, tested for a variety of real-world scenarios, and carefully assessed to determine how they impact accurate diagnosis.”