Improving Disease Prediction with Big Data Analytics

Big Data holds great promise to change healthcare for the better. But its potential will not be reached until healthcare providers improve the efficiency with which data is shared and the accuracy with which it is interpreted.

The Second IEEE/ACM Conference on Connected Health: Applications, Systems and Engineering Technologies will bring experts from academia, business and government together to share information and help accelerate health care’s transformation. The international conference is taking place in Philadelphia this week from July 17-19.

Mooi Choo Chuah, professor of computer science and engineering, is serving as the conference’s technical co-chair, along with Insup Lee of the University of Pennsylvania. Chuah conducts research in next-generation wireless network architecture design, network and Smart Grid security, and mobile and cloud computing. She has recently begun to investigate healthcare data mining.

Chuah, the co-director of the technical program committee planning the conference’s content, will present a paper on Tuesday, July 18, titled “Incentivizing High Quality Crowdsourcing Clinical Data for Disease Prediction.”

Her group’s recent research offers two contributions, says Chuah. The first, an approach she developed with her graduate student, Qinghan Xue, uses a large dataset to demonstrate an improved disease prediction model that combines data cleaning and careful feature selection with effective machine learning techniques.

Chuah utilized a dataset made public by Prize4Life, which helped develop the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) database, the largest database of clinical data from Amyotrophic Lateral Sclerosis (ALS) patients ever created. In 2012, Prize4Life held a crowdsourced competition to develop a method that accurately predicts ALS disease outcomes based on PRO-ACT’s dataset.

ALS is a progressive degenerative nerve disease also known as Lou Gehrig’s Disease. Teams competing in the Prize4Life contest sought to predict in which ALS patients the disease would progress slowly, at an average pace and rapidly. Prize4Life also asked researchers to predict how long ALS patients would survive from the date of diagnosis.

Like the teams in the Prize4Life competition, Chuah used the PRO-ACT database (which contains more than 10,700 records with 6,318 features) to predict which patients would fall into the three clusters of progression: slow, average or fast.  

The challenge, says Chuah, was that the dataset was “very noisy.”  

“For example, some data were missing,” says Chuah. “Some data were non-numeric—and, as you know, computers like numeric values.”

Chuah’s model cleaned up the data and improved the accuracy rate in predicting the rate of patients’ ALS progression. Her method outperformed the winning team’s—at 58.3 percent accuracy compared to 40.5 percent—and with fewer required features and higher quality data.

“We were able to predict where a patient would fall on the disease progression spectrum faster and with more accuracy,” says Chuah. “This has implications for improved health outcomes and also for cost-saving—as a physician might see a patient with a faster-progressing disease more frequently, but less frequently for slow-progressing patients.”

The paper’s second contribution presents a solution to one of the major challenges of healthcare: the fact that no single hospital or health care system has enough of its own data for useful predictive disease analysis.

“Hospitals and other health care systems collect troves of data,” says Chuah. “However, each has a limited number of patients experiencing a particular disease—such as ALS or diabetes, for example. We have designed an incentive method to encourage hospitals to share data so that better prediction models can be created.”

The algorithm that Chuah and her team developed is designed to provide a “reward function” for each health care provider, identifying the cost per patient to participate in a crowdsourced database. An individual hospital could use the incentive model to evaluate whether to participate. The model provides a “reward” for offering truthful, high-quality data.

Chuah believes that both elements of her latest research could improve the accuracy and usefulness of predictive disease models and, most importantly, patient health outcomes as well.

“In my work,” she says, “I’m always looking to solve problems that I know will have some kind of positive social impact.”

Story by Lori Friedman