Sign up for our newsletter and get the latest big data news and analysis.

New Streamlined Statistical Method Provides Improved Pattern Detection and Risk Prediction for Disease

The novel regression algorithm, CALF, outperforms the current gold standard, LASSO, in statistical tests

Researchers from the Renaissance Computing Institute (RENCI) at UNC-Chapel Hill, Perspectrix, the UNC School of Medicine, and the WVU Rockefeller Neuroscience Institute have collaborated to develop a new method for finding patterns in data which arguably surpasses the performance of a generally accepted “gold standard.”

Attempting to find patterns in data is central to all research, and it is particularly important in the medical field’s use of biological samples to predict a patient’s risk for disease formation and progression. Today, researchers can utilize advanced medical technology to produce an ocean of data about one person from various biological samples such as blood, DNA, and saliva, with the goal of identifying particular markers that can be informative about a person’s current health and future outlook. However, this advanced data collection and processing has outpaced current statistical methods for identifying these important patterns and relationships, and this is particularly true for the field of psychiatry. For instance, researchers have yet to fully understand and predict the progression of schizophrenia.

The goal is to create a sum of values from tests of some kind—ideally only a few from many possible tests—that differentiates cases from controls or correlates with some other outcome such as age of onset. The popular and powerful method known as LASSO, or “least absolute shrinkage and selection operator,” attempts to find real number weights so sums are close to 1 (cases) and 0 (controls). However, the utility to clinicians of any approximation is typically measured by metrics known as Student p-value, AUC, or Pearson correlation. The new method, by contrast, seeks to optimize a metric directly.

This new method, CALF, is described in the Scientific Reports paper, “A greedy regression algorithm with coarse weights offers novel advantages,” published on March 31, 2022. In terms of the standard metrics, application of CALF to five quite different examples from psychiatric and neurological studies consistently outperformed the gold standard, LASSO regression, and other methods.

“Frisky CALF outruns LASSO in the five examples we outlined in the paper,” said RENCI scientist and lead author Clark Jeffries, PhD. “The metric values using CALF are superior to those of LASSO when the researcher seeks a small number of collectively informative predictors—five chosen from hundreds, for example. Interrogating the biochemistry or other relationships among the five can then suggest causality.”

The key distinction of CALF is its simplicity. For a given level of metric performance, it typically utilizes only a fraction of the predictors required by the LASSO method. CALF progresses in a ‘greedy’ fashion, meaning it searches through the data and accepts the immediate next best predictor until the algorithm has optimized model performance. According to Jeffries, the research team originally aimed to develop a baseline for the simplest regression model and, in doing so, discovered that this streamlined model extracts statistically significant results from data where LASSO fails to do so.

The geometry underlying a new data analysis method. Shown are three of 26 weight vectors, which are all combinations of +1, -1, 0. In higher dimensions, the choices become astronomical.

CALF’s improvements on existing predictive models has great implications. It has the potential to unearth statistically significant patterns in data that may otherwise go undetected. Unlike conventional methods, small alterations in input data do not alter its solution at all. It also accepts repetitious or nearly repetitious input data while LASSO solutions may become wild.

“It’s likely that there are existing data sets out there that failed to show more than a trend with routine analyses and could show classification significance with CALF,” said Diana Perkins, MD, MPH, a psychiatrist at UNC-Chapel Hill. “This could be groundbreaking for the field of psychiatry in improving prediction of patients’ risk for psychosis and other mental illnesses, allowing earlier intervention and overall improved outcomes.”

“While testing examples with all possible parameters in all possible conventional models is impossible, our best practices showed that CALF can find statistically significant patterns in data that otherwise fail interpretation,” added RENCI scientist Jeffrey Tilson, PhD. “We encourage fellow researchers to test out CALF on their own data sets.”

Darius Bost, a PhD student at UNC-Chapel Hill and Graduate Research Assistant at RENCI, noted that, “Using data from the Alzheimer’s Disease Neuroimaging Initiative, CALF was able to determine a small set of DNA markers that are highly correlated with the age of onset of Alzheimer’s, indicating great potential for CALF as a simple and reliable research tool.”

Current versions of CALF in R and Python were developed by John R. Ford at Perspectrix. These resources are open source and available at the below links: 

R version: https://cran.r-project.org/web/packages/CALF/index.html

PyPi Python version: https://pypi.org/project/calfpy/  

Python 3.x version: https://github.com/jorufo/CALF_Python

Sample data that may be used for duplication of all the stated results is available via GitHub as an unrestricted supporting resource at https://github.com/jorufo/CALF_SupportingResources.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1

Leave a Comment

*