Embarking on a predictive analytics project involves lots of decisions – what business problem to address, what data to use, who to have on your team, etc. One important technology decision is which predictive modelling platform to use. A chart from our recent survey of data analytics practitioners, for example, shows the range of tools and technologies used by practitioners. There are two main options: a GUI-based tool or a programming language. In this blog post we look at the popular options in each category and explore some of the advantages and disadvantages of each.
Well designed application-based, or point-and-click, tools make it very quick and easy to develop and evaluate models, and to perform associated data manipulation tasks. Using one of these tools it is possible to train, evaluate and deploy an advanced analytics model in less than an hour! Important GUI-based solutions for building predictive data analytics models include:
- IBM SPSS Modeller (www.ibm.com/analytics/us/en/ technology/spss/)
- Knime Analytics Platform (www.knime.org)
- RapidMiner Studio (www.rapidminer.com)
- SAS Enterprise Miner (www.sas.com)
- Weka (www.cs.waikato.ac.nz/ml/weka)
- Oracle Data Miner (www.oracle.com/technetwork/database/options/odm/dataminerworkflow-168677.html)
- MS Cortana Intelligence Suite (www.microsoft.com/en-us/server-cloud/cortana-intelligence-suite/)
- Dell Statistica (software.dell.com/products/statistica/)
Most of these tools offer reasonably similar functionality in which a user builds a model by connecting together different modelling elements in a process flow. These elements typically include sampling, data manipulation, model training, and evaluation. Key differentiators amongst these offerings include:
- Ease of use of their interfaces
- The breadth of modelling approaches that they offer
- The amount of intelligent defaults built into the tool
- The ability of tools to handle datasets of significant size
- Ease of integration with other enterprise-wide tools
The key advantages of using a GUI-based tool are:
- Ease of use – build models in hours rather than days
- Reliable, well tested implementations
- Complete end-to-end infrastructural support
The disadvantages of using a GUI-based tool include:
- Costs can be significant
- The tool may not support all desired functionality
- A tool can lock you into an ecosystem
- Tools often don’t support the very latest approaches
The two main competing programming languages for predictive data analytics are R (www.r-project.org) and Python (www.python.org). Building predictive data analytics models using a language like R or Python is not especially difficult. For example, in R we might do something like:
creditscoring.train <- read.csv(“creditScoringTrain.csv”)
glm.mod <- glm(Outcome∼Amount+Salary+Age+LoanSalaryRatio, family=binomial(link=”logit”), data=creditscoring.train)
creditscoring.test <- read.csv(“creditScoringTest.csv”)
predicted.values <- predict(glm.mod, creditscoring.test)
Or in Python we might use something like:
X_train = pd.read_csv(‘creditScoringTrain.csv’)
Y_train = pd.read_csv(‘creditScoringLabels.csv’)
my_tree = tree.DecisionTreeClassifier(criterion=”entropy”)
X_test <- read.csv(“creditScoringTest.csv”)
y_pred = my_tree.predict(X_test)
The advantages of using a programming language for advanced analytics are:
- Huge flexibility – anything you can imagine can be implemented.
- The newest advanced analytics techniques will usually become available in programming languages long before they will be implemented in application-based solutions.
- Low cost of setup
The disadvantages, however, of using a programming language for advanced analytics are:
- Programming is a skill that takes time and effort to learn.
- Experienced staff can be hard to find
- Very little of the infrastructural support that is is present in application-based solutions available to us which puts an extra burden on developers to implement these supports themselves.
- In highly regulated environments using bespoke solutions written by in-house developers can be problematic if audit is required.
Key tools, libraries and resources for Python programming:
- IDE: iPython notebooks (www.anaconda.org)
- Data Manipulation: pandas (pandas.pydata.org)
- Visualisation: Matplotlib (www.matplotlib.org)
- Modelling: Scikit Learn (www.scikit-learn.org)
- Web: Learn Python the Hard Way (www.learnpythonthehardway.org/book/)
- Book: Programming Collective Intelligence (shop.oreilly.com/product/9780596529321.do)
Choosing a Predictive Analytics Platform
Choosing a predictive analytics platform involves balancing out the pros and cons of different options. If you have good programmers on your team then R and Python are a great, cost effective way to get started. In fact, writing code to use these languages for predictive modelling is not terribly complicated so even novice programmers can get up and running quite easily.
Of course, getting well trained in how to use tools or programming languages is a great way to evaluate them. At The Analytics Store we have courses in many of the most important tools, and the R and Python programming languages. For example: