Sign up for our newsletter and get the latest big data news and analysis.

Predictive Modeling and Production Deployment

This article is the sixth and last in an editorial series that reviews how predictive analytics helps your organization predict with confidence what will happen next so that you can make smarter decisions and improve business outcomes.. 

Comprehensive data access capabilities coupled with effective exploratory data analysis are important prerequisites for predictive analytics as mentioned in last week’s article.

insideBIGDATA_Guide_PAPredictive Modeling

Using predictive analytics involves understanding and preparing the data, defining the predictive model, and following the predictive process. Predictive models can assume many shapes and sizes, depending on their complexity and the application for which they are designed. The first step is to understand what questions you are trying to answer for your organization. The level of detail and complexity of your questions will increase as you become more comfortable with the analytic process.  The most important steps in the predictive analytics process are as follows:

  • Define the project outcomes and deliverables, state the scope of the effort, establish business objectives, and identify the data sets to be used.
  • Undertake data collection and data understanding.
  • Perform data munging – the process of inspecting, cleaning, and transforming the data.
  • Utilize exploratory data analysis (EDA) – use graphical techniques with the objective of discovering useful information, arriving at conclusions. Apply statistics to validate the assumptions, hypothesis and test using standard statistical techniques.
  • Apply modeling principles to provide the ability to automatically create accurate predictive models about the future.
  • Evaluate the model allowing you to verify the robustness of the chosen model and make mid-course corrections. Test models on existing data and apply predictions to new data.
  • Select a deployment option to open up the analytical results to every day decision making and to get results by automating the decisions based on the modeling.

Each of the above steps can be considered iterative and may be revisited as needed. It should be noted that the data munging step often is very time-consuming depending on the cleanliness of the incoming data and can take up to 70% of the overall project timeline.

Characteristics of the data can often help you determine what predictive modeling techniques might best meet the data analyst’s needs. Here are a number of points to consider when determining which technique to use based on your data and the problem you wish to solve.

  • When the data is grouped by observations, tools such as cluster analysis, association rules, and k-nearest neighbors usually provide the best results.
  • Use classification to separate the data into classes based on the response variable – both binary classes like True or False, as well as multi-class situations.
  • Use single, multiple and polynomial regression when attempting to make a prediction rather than a classification.
  • In poor quality or limited data situations, A/B testing is appropriate. As an example, A/B tests are statistical experiments that help you decide whether a change is actually making a significant impact on your product.

insideBIGDATA_Guide_PA_3

 

Production Deployment

The final step in the predictive analytics project timeline is to determine how best to deploy the solution to a production environment. Of primary concern is using open source R on larger data sets where performance is important. The open source R engine was not built for enterprise usage. Deploying open source R can problematic for the following reasons:

  • Poor memory management – R does not reclaim memory well, so memory use can grow faster, leading to out-of-memory crashes, as well as non-linear performance due to increased garbage collection requests, and increased swapping.
  • Risk of deploying open source with GPL license – software vendors are forbidden to embed or redistribute open source R as a part of any commercial closed-source software.

In order to avoid these issues, analysts often will opt to convert their working R solution to a different programming environment like C++ or Python. This path, however, is far from optimal since it requires recoding and significant retesting.

Best practice would be to use a commercial, enterprise-grade R solution, like TIBCO Software’s Enterprise Runtime for R (TERR) to resolve the
above limitations and to yield a robust production environment. Because many corporations already have legacy predictive models in house, it is also recommended that you ensure your analytics platform supports TERR, open source R, S+, MATLAB and SAS models, in order to take  advantage of an ecosystems of predictive analytics.

The complete insideBIGDATA Guide to Predictive Analytics  is available for download in PDF from the insideBIGDATA White Paper Library, courtesy of TIBCO Software.

Comments

  1. The easiest and fastest way to deploy sophisticated Predictive Analytic models is to export the model to PMML (Predictive Model Markup Language) a mature, well supported open industry standard and then use products like those from Zementis which execute, optimize and scale PMML for both batch and real-time systems.

    This enables immediate deployment of Predictive Models without custom-coding and provides a “write once, run anywhere capability” .

Leave a Comment

*

Resource Links: