Finding and Treating Sufferers of Rare Diseases: Big Data Techniques at Work

Print Friendly, PDF & Email

ritu-chadhaIn this special guest feature, Dr. Ritu Chadha, IEEE Senior member, Executive Director at Vencore Labs, explains one commercial application of machine learning and information theoretics, among other techniques, to big data to reduce human suffering. Dr. Ritu Chadha is an Executive Director of Vencore Labs. Vencore Labs is a wholly owned subsidiary of Vencore Inc. where the health analytics team has made great strides in bringing a practical solution to the pharmaceutical industry based on the ground breaking research and analytical techniques developed by Ritu and the team at Vencore Labs. Vencore will be launching a new health analytics venture late in 2016 bringing this solution to market. These topic areas, and others, will be explored at IEEE Future Directions’ flagship Technology Time Machine conference in San Diego, CA on 20-21 October 2016.

Big data and analytical techniques have long been applied to healthcare challenges. Let’s look at a specific use case such as improving the diagnoses of people with rare diseases to get them proper treatment. This example will illustrate the commercial application of machine learning and information theoretics, among other techniques, to big data to reduce human suffering.

The incidence of hereditary angioedema, or HAE, is not known with precision but estimated at approximately 1 in 10,000–50,000 people in North America. As the term “edema” indicates, symptoms include swelling of different parts of the body, often leading to erroneous diagnoses of allergies.

HAE drives an estimated 15,000–30,000 emergency room visits per year and eventually results in a 15 percent to 33 percent mortality rate. Much of this suffering could be alleviated and undoubtedly some premature deaths prevented with proper diagnosis and treatment.

Two hurdles need to be addressed: raising awareness among physicians that common symptoms such as swelling may appear to be allergic reactions but mask more serious, rare disorders, and ensuring that so-called “orphan” drugs that treat rare disorders find their way to treat sufferers.

The nature of “orphan” drugs helps illustrate the challenge here. Typically, there is no profit motive to drive development and production of drugs to treat rare diseases, as the markets are extremely small. So in the United States, for instance, the U.S. Food and Drug Administration (FDA) offers incentives to pharmaceutical companies to do so.

A problem remains: finding the patients who need these drugs. That’s the role my company plays. We bring big data, machine learning and information theoretics into play to create a solution for our clients in the pharmaceutical industry.

An analytical approach

To identify possible sufferers of rare diseases such as HAE, we turn to anonymized insurance claims data. (Of course, both ethics and HIPAA – the Health Insurance Portability and Accountability Act of 1996 – require anonymization.) For North America alone, as you can imagine, this means extremely large volumes of data.

We use big data techniques and technologies to analyze this anonymized data in a reasonable timeframe. We also use information theoretic techniques to distill this information into features that are relevant and informative. Then we apply “supervised” machine learning to build models that classify a patient as either having the disease or not.

When I use the term “supervised machine learning,” I mean that the data points that inform your model are labeled as either positive or negative. The “supervision” label means that we had information from a diagnosis code that a doctor supplied.

(“Unsupervised” or “unlabeled” data means it lacks input on what constitutes positive data points and negative data points, and these types are more difficult to work with.)

Enter: machine learning

The initial sifting process involves ignoring most information in each claim record – there’s a lot of completely irrelevant data. We need to focus only on relevant information. For example, if we found that everyone who’s been correctly diagnosed for HAE has had the flu at some point in their life, that’s a correlation. But, unfortunately, nearly everyone has had the flu at some point. We must find and characterize the information that is relevant to a correct diagnosis.

So we have to look at what differentiated correctly diagnosed patients from the rest of the population. Information-theoretic techniques are applied to find the right features.

Information theory sifts through the large number of features which characterize every data point. The technique we use is called mutual information, which is a way of analyzing what we call the “information-theoretic content” of data. This yields a quantification of how relevant a certain piece of information is to reaching a specific conclusion.

The features with the highest information content are then fed into a machine-learning algorithm, which informs a model that classifies a patient as either having the disease or not having the disease. Of course, the answer cannot be 100 percent accurate, so we assign a level of certainty to it.

Analytics and outcomes

We deliver this result to our client, the pharmaceutical company that produces the relevant orphan drug, and this leads them to the doctors involved in the diagnosis. Identifying actual patients is, as mentioned, never done for ethical reasons and to comply with HIPAA requirements on data privacy. But the results put the pharmaceutical company in a position to send their sales representatives to those doctors to provide educational information about the rare disease in question and the available treatments.

In the case of HAE, which is a chronic condition, this could mean providing the correct diagnosis and treatment over the course of a patient’s lifetime, reducing suffering, emergency room visits and preventable deaths.

This example should illustrate how big data and advanced analytical techniques contribute in a real-world scenario to alleviate human suffering.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind