Sign up for our newsletter and get the latest big data news and analysis.

Data Science 101: Is Logistic Regression Dead?

Smita AdhikaryIn this special guest feature for our Data Science 101 channel, Smita Adhikary of Big Data Analytics Hires shares her thoughts about how the data science community has changed over the years – many useful tips for those just entering the field. Smita Adhikary is a Managing Consultant at Big Data Analytics Hires- a talent search and recruiting firm focused primarily on Data Science and Decision Science professionals. Having started her career as a ‘quant’ more than a decade ago building scorecards and statistical models for banks and credit card companies and having spent many years in management consulting, she has witnessed from very close quarters the transformation brought about by the advent of “Big Data” in the skill-sets desired in ‘quants’. Like most ‘quants’ she holds a Masters in Economics and like a lot management consultants an MBA from Kellogg School of Management.

Once upon a time, when I started my career in Banking, we used to build to scorecards. We were the much revered econometricians using the past to predict the future. We would ferociously debate the merits of logit vs. probit over steaming cups of coffee in office corridors; sometimes even indulge in critiquing Heckman’s two-stage estimation when discussing payment projection models for Collections. We were given the luxury of an observation period from history in which the customers’ profiles were assumed to be frozen, and the dependent variable came from a non-overlapping performance window following the observation period. The rules of engagement were set and neither the lender (i.e., the merchant) nor the customer could interact in the interim to change the outcomes of the game. The only player determining the outcome of the game (e.g., pay/default) was the customer, once s/he was approved by the lender. In order to build these scores, we would patiently wait for 6-12 months from the time the customers were acquired to gather sufficient length of performance. This information would  then be used to predict the likelihood of ‘good’ and ‘bad’ based on the customers’ profiles from the time of acquisition.

Life was good. But then something happened – Big Data happened! It completely changed how the merchant and the customer would interact. Forever.

Cut to the digital era, the customers and the merchant have now started interacting in a dynamic setting where nothing is frozen anymore. There is no so called ‘performance period’. The customers are free to navigate the merchant website any which way they fancy. Also, the merchant can display content and place offers dynamically based on how a given customer interacts with his website. To make matters more complicated purchase decisions are not necessarily made on the first visit itself. Internet savvy customers now have all the information at their fingertips to land themselves the best deal. They typically go through the AIDA (Attention-Interest-Desire-Action) journey when contemplating a purchase. In this scenario, the customer’s site navigation on the day of the purchase is mere execution of a decision that has been made even before the customer lands on the site – the customer has been on the site before; the customer is aware of what is on offer; the customer knows exactly how to get to the page on the site where they can choose the product they desire. In fact, the pages visited on the day of the purchase are often not causal to the purchase, just simply correlated.

So why am I subjecting you to this drivel?

The point I am trying to simply make is that in the new world the focus has dramatically shifted from prediction to classification. The selling and buying is now all happening in a real-time environment where the two players are interacting with each other, and repeatedly. The merchant has the leverage to influence the customer’s behavior through customized offers based on behavioral segmentation and contextual targeting. In essence, the dependent variable here is really immaterial. All the merchant wants to understand is ‘who the customer’ is and that will determine what offer to place. On a more technical note, since the customers are now visiting the merchant’s site several times the independence of each record ceases to exist – a mortal blow to the much beloved logistic regression. And then fall Caesar? Enter Machine Learning.

All of a sudden the whole world of analytics is now talking about Support Vector Machines, Random Forests, Bagged Regressions et al. – everything is about classification; everything is about adaptively learning and self-evolving algorithms that augment the understanding of the customer with every successive digital footprint.

This makes sense. This is all good. But let us spare a moment to think what this has entailed for the analytics job market.

Now we have a clear demarcation of the “Predictive Modelers” and the “Data Scientists”. The former have been kind of relegated to traditional banking, insurance, telecom companies where static scores and optimization based solutions are still pursued (but not sure for how much longer!). The latter define the sought after (just like we were eons ago) whiz-kids ruling the roost at the cool tech companies, and presumably changing the world. This paradigm shift has drastically changed the ‘skills’ requirement in job descriptions: when screening candidates employers are now specifically looking for ‘Python, R and machine learning’, as against ‘SAS, regression, optimization’ in the days of yore. The seriousness of their intent is cemented by the fact that they are willing to dish out startling salaries (typically, 40% or higher than econometricians with comparable levels of education and experience) for the new age skills. Believe you me, the employers are not kidding around – if you got the chops, they will pay. What I find most amazing in the current context is the fact that the ‘gold standard’ of analytic excellence, as far as perception goes, has now become much smaller companies – the new-age cool tech startups. Nobody pays much attention anymore if you have a behemoth like Bank of America or a Chase or an Oracle or even a McKinsey on your resume.

The discourse above surely seems to paint a rather bleak picture for the Predictive Modelers out there. In reality, what I realized having lately devoted fair amount of time and research on the topic, and having helped people through it as they consider career changes, is that there is a lot of help available on the world wide web (and for FREE!) if one makes the commitment to learn. A seasoned econometrician can quickly become an expert on machine learning by simply enrolling in the fantastic courses offered at Coursera by professors from Johns Hopkins, Stanford and others. The legendary professors Dr. Hastie and Dr. Tibshirani have even made their classic book on the subject available for free download. Just think – Elon Musk launched a company that now builds rockets at 1/5th the cost at which NASA builds them(never mind the recent explosion) … but where did he start? by reading books on rocket science! And here you have a chance to learn from the best.  So, rather than feeling depressed, I implore you to make this monumental next move in your life. It’s time to turn a new page in our careers. We will talk on the other side!

 

Sign up for the free insideBIGDATA newsletter.

Comments

  1. Changes in analytics industry well articulated (better than many of those papers / quora posts on “Machine Learning” vs. “Traditional Statistics”, etc.). However, not sure if majority of problems in “new world of digital” fall in classification domain? Logistic regression is passe because most “Big Data” production systems (some can) cannot implement the algorithm at real time speeds, and customer level predictions are not required in this environment largely.

    • Dear Amit,
      Thank you for your rather astute observations. You are absolutely correct. All problems in digital are not classification problems – take for example, the case of attribution. If you consider it at a visitor level, it is a classification problem where the outcome variable is a binary (0=not-convert/1=convert) attached to each visitor. If however, you consider attribution rolled up to the channel/feature level, we have a continuous dependent variable (total number of conversions at every time instance). In either case, logistic regression is not the most effective technique. In the former, we have ‘time’ element making it essentially a panel(=mixed effects, plus non-independence at every visitor level), while in the latter often the total data points are so less (businesses typically have 2-3 years of weekly data), one has to resort to sample augmentation based techniques such as bagging and random forests. Please stay tuned for our next publication which addresses this issue:) And your comment on the ‘real-time’ nature of problems making logistic regression passe is spot-on.

  2. Hi Smita,
    Superb article,big data area well articulated.
    Machine learning and Bayesian classification Models are primarily focussed on domain like ecommerce as of now .
    I know for sure banks still use logistic for their modelling traditionally for behavioural score card ,risks etc
    Banks with a slow pace trying to look into unstructured data which is handled more with Python and Perl for consumer sentiment and how this can help in revenue generation or enhancement or complete Revolution

    I’m building a similar platform for Customer Target Model

    Thanks
    Mahendra

    • Thank you so much for liking the article and for resonating with our thoughts. Seems like you have made the transition to the ‘other side’…congratulations!

  3. Abhishek says:

    Here in this article , I beg to differ , I don’t think logistic is dead or irrelevant…. you are simply comparing a bigger business problem these days , which is of classification, hence new techniques are required for this purpose. Since the problem focus has changed doesn’t mean older technique is obsolete …just not getting that high priority for the business.
    For a real data scientist no technique dies ever …because maths is universal truth ……we figure out smarter and powerful variants of existing ones ….I remember developing logistics model with adaptive variables which worked pretty well in real time scenario.

    • Hi Abhishek,
      I think you answered your own question. In the article we are trying to point out why the job market is paying a significant premium to machine learning specialists over traditional statisticians/econometricians. The math or technique is not dead…but the demand for those techniques is definitely on a steady decline …and the some of the reasons for that is cited in the article..

  4. David Jones says:

    What’s the reason for not running classification models, machine learning models etc in SAS ?? They’ve spent 40 years in developing those .. sounds like a rather bizarre statement saying their days are over

    • Hi David,
      The article aims to point out the change in shift of employer preferences in desirable skill-sets in the US market and the reasons for this shift. A lot of this shift stems from the adoption of open-source technologies. Even a giant like Walmart is moving away from SAS and into R. And there is no denying the fact that these newer skill-sets demand a significant premium in the market compared to the traditional SAS users, quite like SAS users used to about a decade ago..

  5. @David, I am a SAS devotee for the last 15 years – pretty much built my career with SAS a companion. These days even I am *having* to use R in connection with machine learning because SAS simply does not have any structured module for such things as random forests, nothing that will compare with the neat R packages.
    Also, one of the things that is really hurting SAS over the last few years is the inordinately high price tag. Compare this with R which is free.

  6. Smita – very bright article, and important market observations.
    It would be even more interesting if you share your thoughts about the reasons for such shift in employers preferences.
    In my opinion, predictive modeling algorithms have limited ability to build accurate enough predictive models for the scenarios when we need to predict human’s decisions (upsell, cross sell, churn, etc.) – This is the reason why employers look for alternative solutions.
    Please, let me know if you agree with my point or not.

    • Dear Vladimir,
      Thanks for your comment. I will be a little skeptical saying that predictive modeling cannot effectively address the areas you have highlighted – case in point – for years, compound Poisson models have been doing an excellent job predicting survival probabilities in churn models. In my humble opinion, the key issue that we are faced in the age of big data emanates from the fact that analytics now is expected to be ‘real time’. In such a scenario, traditional econometric methods can often be less effective simply because they are based on ‘closed form’ estimates as against ‘learning’ and ‘evolving’ incrementally. This is why techniques such as state space models, random forests, SVM have become so popular these days. This is a contentious debate and has the potential to create misunderstanding in forums like this:)

    • Dear Smita,
      Thank you for your response. I and my colleagues spent significant efforts trying to build accurate enough SVN, Random Forest, etc. models, but our best models had 30% classification error, and it seems to me as a natural limitation for such data because we are trying to predict people decisions. What error level would you consider acceptable for real time use in this case?

    • Dear Vladimir,
      It is quite evident that you are in the thick of things. In my experience, 70% accuracy is not bad at all when you are modeling – as you have aptly said – “human behavior”. If I may share a few tricks we have tried which have helped augments model fit: [a] run random forest first to reduce dimensionality and shortlist key drivers, followed by SVM for final model estimation; [b] apply latent variable modeling or hidden Markov modeling to investigate whether the unexplained variation is due to a unobserved phenomenon attributable to customer psyche/exogenous factors; [c] if the error rate is caused by high percentage of missing values in one or more factors, consider dynamic linear models (DLM) or state-space formulation. Hope this resonates.

    • Dear Smita,

      Thank you for your response, this is a valuable information for me. The tricks sound promising, and I will try some of them.

  7. Archit Bhasin says:

    Hi Smita,
    Thanks for your article. Im a starter in Credit risk modelling (Just 2 years of experience). I have been poaching my bank to use something like Social score of a customer in an Application scorecard, but they are reluctant. Are the banks not ready for the transition yet ? And if predictive modelers don’t exercise the new learnings that you have mentioned, shall they look beyond Risk as career options ?

    • Dear Archit,
      Thank you so much for the comment. You are absolutely right in your observation- the banking industry has been the slowest to adapt to this trend. Using something like a social score in acquisition scorecards can have multiple ramifications from a regulatory and compliance standpoint with regard to fair credit laws. Hence the reluctance. However, classification models for cross-sell and upsell of credit cards and other banking products are becoming more and more the norm . So ‘traditional’ credit risk may not yet be ready for these, but the same is not true for all of banking.

  8. Vijay Gupta says:

    “Bagged Regressions” are classification? A Logistic is a predictive and classification tool?

    “On a more technical note, since the customers are now visiting the merchant’s site several times the independence of each record ceases to exist – a mortal blow to the much beloved logistic regression. ” (BTW, can not each visit be a separate independent encounter even if it is by the same person? )

    Is SVM not used for the same thing as a Logistic in many areas (though it does not produce a score)? Or Random Forest?

    The business pains that can be helped by classification have increased and the quality of classification algorithms have improved.

  9. Charles Richard says:

    Just thought I’d mention this. I used to be a process control engineer in the chemical industry. We used an algorithm called DMC (Dynamic Matirx Control) to build controllers for whole chemical process units with many dependent and independent variables. Predictive models that are dynamic – time is a variable – were used. There was an adaptive element to correct for process changes after the models were derived – the models were developed from step testing in the unit operating in its normal range. The adaptations were usually small; if they got to be large, then a new model was needed which was rare. This may be something sort of midway between the two “sides” discussed in this article.

  10. Girish Bakshi says:

    Hi Smita, Very interesting and insightful article (realize it was written a while back) but spot on in on increasing importance of classification/clustering based approaches in the online data rich world to segment and offer. Choice models are useful in planning phase…

Leave a Comment

*

Resource Links: