EHR Data Boosts Machine Learning Algorithms for Chronic Disease

By Erin McNemar, MPA

July 15, 2021 - By using machine learning algorithms, researchers examined if creating a large-scale electronic health record (EHR) data-based lung cancer cohort could be effective in studying a patient’s prognosis and estimating survival. The cohort study was recently published in JAMA.

Across the world, lung cancer is one of the most diagnosed cancers and is the leading cause of cancer-related deaths behind skin cancer. In the United States, the current five-year survival rate is around 20.6 percent. However, patients with lung cancer will have different outcomes based on a variety of clinical factors.

“A large cohort with adequate clinical information is necessary to identify stable and reliable prognostic variables and the factors associated with improved survival outcomes,” the authors wrote in the study.

As the accessibility of EHR data continues to grow, researchers are given a timely and low-cost alternative to the traditional cohort study. With EHR data being coding in various ways, implementing machine learning algorithms was an important step for researchers to compare information accurately.

“Our primary goal was to build a large and reliable lung cancer EHR cohort that could be used for studying lung cancer progression with a set of generalizable approaches. To this end, we combined structured data and unstructured data to identify patients with lung cancer and extract clinical variables. We evaluated the completeness and accuracy of the extracted data,” the authors wrote.

“To further illustrate the application of EHR cohort data, we developed and validated a prognostic model to predict 1-year to 5-year overall survival (OS) among individuals with non–small cell lung cancer (NSCLC).,” the study authors continued.

In the cohort study, patients with lung cancer were identified from 76,643 individuals with at least one lung cancer diagnostic coded deposited in an EHR in Mass General Brigham health care system from July 1988 to October 2018.

A machine learning algorithm identified patients and extracted clinical information from structured and unstructured data by using natural language processing tools. Researchers then examined the data’s completeness and accuracy by comparing the Boston Lung Cancer study to the standard EHR review results.

Additionally, a prognostic model for non-small cell lung cancer (NSCLC) overall survival was created for clinical application.

Of the 76,642 patients with at least one lung cancer diagnostic code, 42,069 patients were identified to have lung cancer. The AI tool produced a positive predictive value of 94.4 percent. The study cohort was made up of 35,375 patients after removing those with a history of lung cancer and less than 14 days of follow-up after the initial diagnosis.

“We assembled a large lung cancer cohort from EHRs using a phenotyping algorithm and extraction strategies combining structured and unstructured data. Our findings suggest that a prognostic model based on EHR cohort may be used conveniently to facilitate prediction of NSCLC survival,” the authors concluded.

Analytics in Action News

EHR Data Boosts Machine Learning Algorithms for Chronic Disease

A study reveals the use of machine learning algorithms leveraging EHR data could assist in a patient’s lung cancer prognosis.

Next in Analytics in Action