practical machine learning

Practical Machine Learning: discerning differences and selecting the best approach Lynn Langit Reviewed by Mark Tabladillo

Practical Machine Learning: discerning differences and selecting the best approach 2

TABLE OF CONTENTS Executive summary ................................................................................................................................................................... 3 Introduction .................................................................................................................................................................................. 3 Concepts ......................................................................................................................................................................................... 6 Process and Practicalities ..................................................................................................................................................... 15 Accessible to Data Scientists & Business Users ........................................................................................................... 20 Accessible to Developers & BI/DW Professionals ..................................................................................................... 24 Key Takeaways .......................................................................................................................................................................... 30 References and Resources .................................................................................................................................................... 32 Table of Abbreviations ...................................................................................................................................................... 33

About Lynn Langit .................................................................................................................................................................... 34 About Mark Tabladillo ............................................................................................................................................................ 34


Executive summary The formal definition of Machine Learning is this: the ability of computing systems to gain knowledge from experience. Practical ML enables your organization to answer business questions more effectively because of that experience. Machine Learning solutions consist of your input data built into models which combine that data with statistical and data mining algorithms.

Until relatively recently applied ML (as contrasted to ML for research) was simply too specialized, difficult and expensive to have broad adoption outside of the academic community and a few commercial domains (finance, ad serving). However, improvements in languages, libraries as well new commercial offerings (including cloud-only products) have greatly increased the practicality of implementing ML applications. Also demand has been fueled by Big Data - more data encourages more powerful methods of processing to gain understanding from that data.

This report will discuss technologies and implementation approaches for creating enterprise data solutions that include one or more machine learning components. The report will also detail the tradeoffs of each solution and determine which approach best fits organizational needs.

Introduction The term ‘Predictive Analytics’ is used somewhat interchangeably with Machine Learning. The central idea is that Machine Learning enables the creation of important business insights based on a analyzing some set of input data with one or more data mining or statistical algorithms.

Where Machine Learning is used

In some sectors, particularly academic research, statistical analysis and data mining have been standard analytical techniques for years. These sectors tend to use open source languages, tools and libraries. Academics commonly use specialty coding languages such as R or Python libraries (SciPy/NumPy/Pandas), rather than enterprise languages, such as Java for their ML research projects. Also researchers tend to work with wide (many attributes) and shallow (relatively small sample sizes) datasets. This academic dataset size is significant because many of the commonly


used tools, such as R Studio or even Weka, are designed for small (albeit rich) datasets and they are limited to working with datasets that can fit in the memory of analyst’s desktop computer rather than requiring server or even cloud-scale processing power.

In a few commercial sectors, such as financial (for example with credit scoring) and security (for example for email spam detection), use of ML (via data mining) is not a new approach. In these areas, highly specialized tools and specially trained professionals have supported these types of solutions. These vertical-specific ML solution development cycles run to the hundreds of thousand or even millions of dollars to implement. These costs include software licenses, powerful hardware, proprietary development and management tools and consulting fees. Also these types of projects have commonly taken months or even years to implement.

However, the ML market landscape is rapidly changing with the availability of Big Data/cloud storage, processing and data pipelines. These new services enable faster and cheaper data collection, storage and processing. Also the growth of IoT (mostly sensor) data is increasing the volumes of available data for analysis. These market changes are making the overall ‘entry point’ for ML projects less risky –i.e. cheaper and faster. Another driver of adoption is the efforts that commercial vendors are putting into creating usable ML tooling – most of which is runs on that particular vendor’s cloud infrastructure (such as IBM Watson on Bluemix, Microsoft Azure ML on Azure or Amazon ML). ML projects are increasingly seen as a realistic possibility given the larger market landscape. Simply put, more data means a need for more powerful methods of deriving meaning from the increasingly large and complex datasets. Enter the democratization of Machine Learning.

Challenges to Adoption

Although tools are reducing the complexity of applying the power of statistical and data mining techniques to increasingly larger data sets, the enterprise market is in the early stages of ML adoption. One of the key blockers is complexity -- creating useful predictive analytics or ML differs substantially from the more traditional business analytics.

Because the application (and demand) for technical professionals skilled in applied statistics and data mining had traditionally been a small market, we are faced with a lack of trained, working


professionals who can produce useful results in this area. Specifically we lack those who have experience on how to perform the tasks needed in the enterprise ML solution lifecycle – such as to clean and groom the input data, to select appropriate techniques and algorithms, to build and evaluate models and to support moving the result of their work to production.

Vendors are stepping in to reduce this gap. Several major commercial vendors have launched general-purpose machine learning suites this year. As mentioned, the majority of these new offers are cloud-based. Some solutions offer you the ability to train, test and deploy in either a cloud or on premises, while other solutions are cloud-only, such as BigML.


Concepts Taxonomies and terms for Machine Learning solutions have important and nuanced differences in meaning, proper understanding is key to differentiating products and solutions available in the ML space. To begin, we’ll start by providing definitions of associated technologies.

What is the difference between business analytics and predictive analytics? Business Analytics is defined as finding answers to business questions by querying data and producing a definite result or result set. For example: “What are the top five items that are found in a shopping basket for a 38 year old man from California who is shopping on a Saturday at 5pm at a major grocery chain?” The answer to this question (via a query to source data) produces a deterministic result set, usually shown as a report or a dashboard is the only type of analytics that they have available. Stated differently, business analytics are used to analyze “what has happened” for past events.

Predictive Analytics is defined as finding answers to business questions by applying one or more probabilistic algorithms to some set of input data and producing one or more probabilistic results. For example: “Consider the items which appear together in the shopping baskets of all 38 year old men from California who are shopping on a Saturday at 5pm at any of the major grocery chain stores for which we have data and predict how many of a given item from this set the stores should have on hand to ensure proper supply for this type of customer.” In this case, the type of algorithm is regression because it is used to predict a future value or set of values. To get a result one or more regression algorithms are applied to the source data – for example, linear regression. Because the results are probabilistic, i.e. a percentage or score of likelihood of a result, it is common to use more than one evaluative algorithm and then to evaluate the quality of the result. This is process is called ‘evaluating the model.’ The best result from the models is selected and is either presented via statistical output (probability) or via a customized visualization. Stated differently, predictive analytics are used to analyze “what will happen” for potential or future events. The graphic below illustrates and


contrasts sample results in business and predictive analytics.

Figure 1 - Two Types of Analytics

What is the difference between data mining and predictive analytics? Data Mining encompasses a broader set of tasks than that included in predictive analytics. In addition to regression algorithms, data mining also includes other types predictive analysis. Specifically, finding groupings in the source data, by matching new data to existing labeled (or categorized) data is called classification. Classification algorithm executions are characterized as implementations of ‘supervised’ algorithms because there is an authoritative set of data, which is used to process the input data in addition to an algorithm. For example “In a set of data there are examples of pictures or drawings of objects that we’ve identified and labeled as particular animals – i.e. ‘this is a picture of a dog and that is a picture of a cat.’ “ A classification task is to evaluate the likelihood of a new picture being a dog or a cat based on pattern matching to the set of known states. An example of a classification algorithm is decision trees. Of note is that regression is also ‘supervised’ because a data set with ‘known values’ is used in conjunction


with the application of the regression algorithm when evaluating the probability of a result using new input data.

Discovering natural groupings in source data, for which there are no known states or labels is called clustering. Since there are no known states when clustering algorithms are used, this type of machine learning is called ‘unsupervised’. An example of this technique is ‘here are some pictures, group them into subsets based on characteristics (or labels) that are discovered during the process of running the algorithm.’ As with the other types of ML, when implementing clustering it is common to use multiple clustering algorithms, such as k-means, then to evaluate the model results and finally to select the top performing algorithm and model for the particular business problem.

What is the difference between predictive analytics and machine learning? Machine Learning is evolving to support the increasing volumes, varieties and velocities of Big Data projects, rather than the smaller, simpler datasets that typified data mining projects, particularly in academia. Another way to understand ML is as the next generation of data mining. Machine learning is a superset of predictive analytics because it involves more than application of one or more predictive analytic techniques (and associated algorithms) to sets of input data. Another consideration is the current push toward commercial ‘productization’ of machine learning applications. Although data mining and statistical analysis has been widely used in particular domains, the broadest application, for academic research, is implemented quite differently than for commercial applications.

Specifically there are many steps in data preparation for predictive analytics (or ML) projects that are different from data preparation common for business analytics projects. Steps to prepare input data for predictive analytics include such tasks as the following:

• Evaluating data types and detecting or creating labels (for classification)

• Evaluating number / ratio of null values


• Evaluating quality/ usefulness of input data based on statistical analysis (mean, mode, etc…)

• Removing outlier values (exceptions)

• Creating groupings (called ‘bucketing’)

Commercial tools provide data visualizers, which assist with data quality assessment at this state and also facilitate easy modification of the input data. After the data preparation tasks have been completed there is a 3-step process to implement a machine learning solution or model. It is quite common for the model process to be iterative (because the outputs are probabilistic) during the model creation phase. Iterations often include returning to the data preparation phase because adjusting the quality of the input data impacts outputs. The need for iteration over increasingly large data sets marries nicely with the scalability of cloud-based ML solutions.

These steps include the following:

• Input Data

o Ingest – in this step you ingest source data, common ingest methods are file-based, database-based. Increasingly accepting streaming input is a requirement.

o Evaluate & Clean – in this step you review the input data (often done using statistical analysis) and tune that data, so as to be prepared for inclusion in one or more ML models

• Model

o Select ML Algorithm and Initialize Model(s) – in this step you match the business question and input data to a ML technique (regression, classification or clustering) and one or more algorithms from within that technique (such as, linear regression, decision trees, k-means clustering) to evaluate the possibility of building a useful model with this information


o Train Model(s) – in this step you create the model and load it with data, you then process the model and view the output

o Score Model(s) – in this step you evaluate the effectiveness of model results vs. the ‘random guess’ line to understand the potential use of the model(s) for future predictions, classifications and clustering tasks

• Predict

o Perform Prediction – in this step you evaluate new data against the model in order to predict the likelihood of selected results.

These steps are often performed iteratively, as model scoring results in differentiation between multiple models. You may decide to repeat some or all of the entire cycle with slightly different input data, different algorithms, different algorithm parameters, etc… in order to produce one or mode ‘useful’ models. Wizards and visualization tools found in ML products speed up these iterative cycles.

Shown below is an open source project for RStudio called Shiny. Shiny is used by many R developers, because it allows them to quickly an easily visualize (and query) models they created in the R programming language. Note the use of input parameters via slider bars and text boxes. These controls allow the ML developer to ‘try out’ different values in evaluating the usefulness of their model. Lightweight visualization tools for rapid iteration are particularly valuable for ML scenarios.


Figure 2 - Visualization of R results using Shiny

Is data science the same thing as machine learning? Data science is a super set of Machine Learning in that in addition to all of the tasks described in the last paragraph, data science also includes hypothesis formation, or more simply, ‘asking the right question(s)?’ Data science, as shown in the graphic, involves domain expertise, healthy curiosity, scientific thinking, understanding of math, statistics, algorithms, data input sets and visualization. Increasingly, a team of people in the enterprise is responsible for data science projects, because the skill sets needs are simply not found in any one or two people. Also these teams benefit from using enterprise-grade tools, which facilitate communication and other


enterprise needs, such as security, source control and others.

Figure 3 - Skills need for Data Science

What is Artificial Intelligence and how does it relate to machine learning?

An AI (Artificial Intelligence) solution contains one of more intelligent agents. AI intelligent agents automate tasks that would normally require a highly trained person to do. An example of this type of task is speech recognition and translation. An AI system is one that responds to complex problems in a human-like way. A well-known AI success of late is the celebrated win of the IBM Watson AI system again two top human players in the TV trivia game show Jeopardy.


In some ways, AI has more to do with process automation than learning because AI systems ingest vast amounts of source data and perform iterative ML processes, often over a period of years. In practice AI includes a number of ML components, so that the system and its processes can be increasingly optimized or can learn over time. You can see commercial application of AI systems in domains as disparate as medical diagnostics, self-driving cars, face and speech recognition and bank fraud detection.

What is Deep Learning and how does it relate to machine learning? Deep Learning is a relatively new aspect of Machine Learning. It’s a set of algorithms in ML that attempt to model high-level abstractions in data by using multiple non-linear transformations. Deep Learning is focusing on improving the efficiency of unsupervised or semi-supervised feature learning algorithms. It’s based on research in human neuroscience, such as human neural coding. Algorithms are deep neural networks and problem sets include computer vision, natural language processing and speed recognition. Also Deep Learning has been called the new definition of the ‘neural networks’ data-mining algorithm.

Advances in hardware, particularly around GPU computational capabilities have facilitated use of Deep Learning as they have enabled model-processing times to shrink from weeks or days to a more practical level, i.e. minutes. However, given the computational intensity, it is still the case that computational (processing time) requirements limit the widespread application of Deep Learning algorithms.

Deep Learning is also called ‘strong AI’ because of it’s potential to disrupt a large number of processes. Major software companies are focusing millions of dollars in research around improving usability of Deep Learning in their own core products (such as their voice recognition systems, Google Now, Microsoft Cortana and Apple Siri and other products). Although the potential of Deep Learning is exciting, the reality is that the broad application of its results due to time, cost, complexity and skills needed is still limited to experimental and (mostly) research projects at a small subset of companies, such as Google, IBM, Microsoft, etc....


What is the importance of real-‐time analytics? Broader adoption of technologies such as in-memory databases and streaming Hadoop (Spark Streaming, Storm and Samza), along with new types of data providers, e.g. IoT data input devices, are increasing the demand for real-time analytics as a category. In addition creation of cloud-based data pipeline libraries and products, enables the creation of more complex conduits for incoming data, including through multiple processing pipelines. Along with these advances in real-time Big Data technologies in general comes demand for products, which can enable rapid creation of solutions that also include real-time predictive analytics. Major software vendors are creating consumer products and services, such as adaptive voice input (Google Now, Microsoft Cortana and Apple Siri) that use real-time predictive analytics. These types of applications are igniting consumer imagination and fueling demand in general.


Process and Practicalities Let’s take a deeper look at the processes involved in creating commercial machine learning solutions. We are doing so, because, as mentioned, the process for creating useful commercial predictive analytics is quite different than that of creating business analytics. Digging into the detailed processes involved will help in our understanding of the usability of the libraries, tools and products currently available.

Business data projects are driven by the need to gain more or better business insights. Given that, what are the types of use cases that machine learning solutions can address? Remembering the core functionality of ML, i.e. predicting one or more discrete, future values, classifying or labeling new data into known groups and/or detecting natural groups in new data, here is a short list of some types of common use cases:

• Facilities & Manufacturing -‐-‐ Smart Buildings, Predictive Maintenance

• Sales & Marketing -‐-‐ Demand Forecasting, Churn Analysis, Target Advertising

• Biomedical -‐-‐ Life Science Research, Healthcare outcomes (patient re-‐admission rates)

• Security -‐-‐ Fraud Detection, Network Intrusion Detection

• Logistics – Routing

As mentioned the steps involved in a creating an end-to-end machine learning solution include a number of considerations. Before the advent of cloud-based data storage, pipelines and machine learning model tooling, costs involved in creating what were then called data mining solutions blocked many enterprises. These costs included high hardware and software license fees (often well over $ 100k, up to $ 1 million simply to start what was often a multi-year project was not unheard of as well). Additionally, the costs of re-training or hiring specialty consultants to implement the data mining projects added to the project costs and complexity. Prior to cloud-based data storage and cloud-based data pipeline products, costs associated to unearthing enterprise data from the various (and often proprietary) on-premise data silos added to adoption blockers. Yet another blocker to implementing traditional data mining was that the domain of


business analyst (or, in some cases, statistician) were wholly separated from developers who would be charged with creating application interfaces for the results of the data mining work produced by the business analysts.

Cloud storage combined with new types of Big Data storage has driven overall enterprise data volumes up dramatically. Increasingly large and complex data sets are becoming progressively more difficult to analyze in a meaningful way for the enterprise. Driven by particular sectors, such as the ML analysis of massive amounts of behavioral data collected in social gaming (Angry Birds, Halo, etc…), the enterprise appetite for getting started with ML projects has increased sharply over the last 12 months.

Although the landscape is improving due to the release of improved open source libraries, tools as well as new commercial tools, for most enterprises, ML projects are a new type of analytics. Given that, for traditional enterprises, the newly releasing set of cloud-based ML tools and services, such as Azure ML, IBM Waston, Predixion Software, AWS ML, BigML and others are a welcome compliment to the existing (mostly open source) languages, libraries and tools.

Another new item in the emerging ecosystem of enterprise tools and products designed to support enterprise ML projects is the emergence of commercial data markets. IBM, Microsoft and Predixion Software all include the ability to directly ‘publish’ the results of one or more useful ML experiments into their cloud-based repository or marketplace. Technically, most enable the ML experiment to be published as a REST-based web service endpoint.

Interestingly, cloud vendors are leveraging integration with their own cloud services. For example, Amazon ML includes the ability to enable real-time ML via a one-button click as shown in the screenshot below. This real-time capability is integrated with AWS S3 storage. AWS ML integrates with S3, RDS or Redshift at this time.


Figure 4 - Amazon ML Model Usage Options

This functionality not only facilitates quick and easy deployment to production of commercial ML services, but also has the interesting implication of providing the enterprise a commercial platform from which they can monetize the results of their ML experiments by making those results available as a commercial offering.


Shown below is a chart that lists many of the major offerings – either commercial or open source.

Phase Azure AWS Google Commercial Open

Source Ingest Stream

Insight Kinesis Big Query Data Torrent Flume

Pipeline Data Pipeline Data Pipeline Data Pipeline Data Torrent Kafka Storage BLOB

Document DB SQLAzure HDInsight

S3 Dynamo DB RDS – SQL Redshift EMR

BLOB H/R Datastore MySQL Hadoop on GCE

SAS NoSQL Hadoop

Create Predictive Models

Azure ML Revolution Analytics for R Language

AWS ML Prediction API SAS IBM Watson Predixion Software BigML Matlab Mathematica PredictionIO…

R Mahout Python Pandas Weka

Predicative Results Publication and/or Visualization

Excel Power BI Gateway PowerView Azure Data Market

AWS Lambdas Partners

Google Charts BigML Dato Predixion Marketplace Tableau Wolfram Language

D3

In some verticals, such as biomedical, it is common to have some form of academic data mining or statistics work (data sets and / or data mining models) to use as a basis for creating commercial machine learning solutions. One example is when you are turning that academic research into commercial biomedical products. Given that, we’ll list data mining languages, libraries and tools, which are commonly used in academic research. Also, it has been the case that traditional statistical tools and languages, i.e. Matlab, Mathematica, have high adoption in the research sector.


ML Academic Languages, Tools and Libraries – some are open source – most have free versions for academic research – shown below is a chart that summarizes many of these items. We have included the communities’ category, because academic data science communities are at the front edge of work on improving open source tools and libraries and bear watching when you are assessing the state of ML tools and products.

Category Objects Notes

Languages R Language SciPy/NumPy/Pandas Matlab Mathematica Julia Mahout Weka

Stats Language Python Libraries for ML Stats Language Stats Language Scalable Stats Language ML for Hadoop Research Stats Language

Tools R Studio Shiny for R Weka Studio PyCharm Sublime

IDE for R Visualization for R IDE for Weka IDE for Python IDE for Python and more

Communities KDNuggets Kaggle DataKind Open Gov/Open Data Code for America

Website Competition Community Community Community


Accessible to Data Scientists & Business Users A key question around the practicality of ML solutions for the enterprise is this: Who exactly will develop the ML solutions in the enterprise? Given the diverse set of skills needed to successfully implement any type of data science solution, much less the smaller subset (which is even more complex – around ML), the first part of the answer is the most critical. A team of skilled professionals best implements ML projects. Our answer to the common question “Do I just need to hire a statistician to implement a ML project?” is an unqualified “No!” Commercial ML differs substantially from ML for academic research. While the image of the lone scientist, toiling away in his/her lab and carefully analyzing the results via complex statistical calculations is the heritage of ML, this images bears little relationship to the practicalities of implementing ML in the enterprise.

While there is definitely a place for a dedicated statistician on an enterprise ML team, this is no longer a requirement for all ML projects. That being said, ML tools compliment (but do not substitute for) statistical and data mining domain expertise. What has changed with the advent of these tools, is the ability for your key team members to work with others (business analysts, decision makers, developers, DevOps, etc…) because the tools use common interfaces and well-designed dataflow visualizations. Also most tools are cloud-based, which means zero-install and configuration and quick environment start up time. Additionally commercial tools are designed to scale storage and processing via cloud capacity, enabling faster movement from small dataset experiments to full-scale production deployments. Cloud-based tools are particularly well suited for building quick proof-of-concept projects for the enterprise.

Given the democratization of tooling, you may be wondering whether this new tooling is sophisticated enough for classically trained data scientists and academics to be able to make full use of their complete skill sets? The answer is a conditional yes – some, but not all, commercial products, such as Azure ML, contain integration with commonly used statistical languages (R Language and Python libraries) and allow re-use of scripts created in these languages.


Additionally, it’s important for researches to have visibility into algorithms and algorithm parameters. This is important for reproducibility of published experiment results. Shown below is an Azure ML model, which uses two-class support vector machines in performing classification (of Tweets in this sample). Also of note is the ability to use R Language scripts in a ML workflow:

Figure 5 - Azure ML Experiement


Model evaluation is a key component of a ML Experiment. Here is sample output from Azure ML model evaluation visualization. You’ll note that both score information (table) and graphical output are included in the visualization:

Figure 6 - Azure ML Model Evaluation Output


For comparison, shown below is output from a sample Amazon ML model evaluation:

Figure 7 - Amazon ML Model Evaluation Visualization


Accessible to Developers & BI/DW Professionals An interesting and somewhat unexpected aspect of ML enterprise projects is that in no way is having one or more Big Data repositories a requirement for undertaking this type of project. Due to the origins of ML, i.e. academic research using statistics and data mining, some of the most useful ML projects are, in fact, based on application of these techniques to LOB data. You can think of it as being able to ask different kinds of questions of your current data. Understanding when to use ML (and when not to) relates directly to the definitions of business and predictive analytics. Simply put, use ML when you want to ask business questions will result in probabilistic answers.

The ability to ask predictive questions of LOB data often yields useful results. For example, it has been quite common to begin ML projects in sales and marketing departments, using CRM data as source for ML experiments that involve answering business questions like ‘what are the characteristics of the customers who produce the most revenue?’ (Clustering) and ‘what type of cross-sell opportunities can we introduce on our website based on known customer purchase patterns?’ (Classification).

Another common ‘entry point’ for ML solutions in the enterprise is in using IT (log) data. Regulatory (access auditing) and compliance requirements – and also general security concerns, drive ML experiments such as ‘at what day / time can I expect that network bandwidth usage will spike to a particular level (value) for a particular segment of my corporate users?’ (Regression).


In general, the enterprise can find value in appropriately applying predictive analytics via ML solutions to a broad spectrum of domains. In addition to sales and market or DevOps, enterprises can apply ML to other scenarios for which probabilistic analysis would yield useful results. For example questions such as these can now be addressed:

• What are the most closely correlated employee attributes with highest revenue production of that employee’s team?

• At what future point (value) in time do our customers in a certain segment (i.e. demographics, geographic…) tend to make a subsequent purchase?

• What groups (trial or free items) of our public resources (website, Github, YouTube…) tend to be used by browsers who become our customers?

As mentioned, integrated tooling provided by commercial vendors enables simpler deployment and embedding of ML model results into enterprise applications via their ‘publish as a web service’ functionality. Given that relatively few enterprise application developers have familiarity, much less expertise in ML languages, tools and libraries, using commercial ML tools that include ‘click to publish’ functionality significantly speeds up time to market.

Another advantage of using commercial ML tools for the enterprise is the built in connectors to disparate incoming data sources. Given that it is increasingly common to use a broad variety of data sources as ML ingest sources, the availability of pre-built connectors once again speed development cycles. It is common to include connector for LOB data, i.e. RDBMS systems (both on-premise and cloud-based) as well as for some of the newer NoSQL databases, Hadoop as well as one or more type of incoming data stream.


Also useful are the quick statistical snapshots that most commercial ML tools provide of datasets in your ML project. For example, the AWS ML dataset console view includes the visualization shown below:

Figure 8 - AWS ML Datasources Attribute Information

The AWS viewer not only allows the ML team to ‘see’ the attribute names, but also the correlations, uniqueness of data, most/least frequent categories, it also includes an inline ‘Preview’ visualization of the uniqueness of the data.

As mentioned, integrated commercial ML tooling, which include ‘one-click’ to deploy capabilities increases usability for developers and BI professionals. Additionally, capabilities, which essentially advertise published ML web services, such as Microsoft Azure Data Market, provide additional discoverability; usability and also commerce opportunities for published services are also emerging. An example is shown below.


Figure 9 - Azure Machine Learning Test Harness


Visualization of results is another element of ML solution usability. To that end, we’ve included a sample from IBM Watson Analytics. This service includes flexible visualizations at all phases of the ML process (i.e. data discovery, modeling, etc…) an example is shown below.

Figure 10 - IBM Watson ML Visualization


Our last example of model visualization is from the commercial cloud-based vendor BigML and is shown below. Also interesting is how vendors such as BigML enable community via providing a platform for their users to get more value from their ML models. You’ll note BigML allows users to upload, share, rate and also sell models for use by others in their own ML scenarios.

Figure 11 - BigML Model Visualization


Key Takeaways Incorporating the results of machine learning experiments into production data solutions adds significant complexity to the overall projects. Given this, a solid understanding of technology choices around machine learning solutions is essential for designing and delivering solutions that provide business value to the organization.

• Use commercial machine learning products when team members new to machine learning processes are creating your solution. Due to fundamental differences at every stage in the data pipeline, i.e. data preparation, hypothesis formation, algorithm selection, model training and evaluation, ML projects introduce a set of complex processes into the enterprise. If your data paradigm consists of an OLTP store alone, you would be best served by leveraging commercial ML development suites, rather than attempting to cobble together solutions based on tools and libraries that were built primarily for statisticians.

• Select tools or coding libraries that perform at the speed and scale for the data ingest and processing scale for the types of machine learning methods that your business problems require. Enterprises will benefit from leveraging cloud storage and process of Big Data workloads as sources for ML solutions because their data volumes are generally significantly larger than those of academic research. Also, in-memory streams are increasingly relevant, particularly with the advent of more and more IoT scenarios.

• Teams that have already implemented pure open source data solutions are most capable of adding pure open source machine learning solutions. Domains where data mining and/or statistics may have already been in use, such as academic research will have more success using open source tools and libraries, so long as their input data does not overrun the capabilities of those tools.

• Plan for and test your model deployment topology to ensure ML experiments deliver production business value. Commercial vendors are incorporating one-click to deploy functionality in their ML studio environments, given the common challenges


around deployment of ML models; such functionality enables faster time to market for production solutions. Also consider the vendor path to implementing streaming or near-real time ML solutions if that is part of your requirements.

• Select tools or plan for coding appropriate types of visualization solutions. ML outputs are unfamiliar to many business users. Standard reports and dashboards have not been designed to display ML results in a meaningful way. Selecting ML vendors, which integrate results easily into other commercial solutions or common libraries results in broader usability for ML solutions.


References and Resources This section lists the references and resources referred to in this article. Data Science graphic -‐-‐ http://civicscience.com/data-‐science-‐a-‐visual-‐guide/ Shiny for R-‐Studio -‐-‐ http://shiny.rstudio.com/gallery/movie-‐explorer.html Deep Learning and the Hololens -‐-‐ https://technoptimist.wordpress.com/2015/01/25/deep-‐learning-‐and-‐the-‐hololens Collection of papers on how IBM Watson works -‐ http://www.andrew.cmu.edu/user/ooo/watson/ What is AI? -‐-‐ http://www.techopedia.com/definition/190/artificial-‐intelligence-‐ai How Google is Teaching Computers to See -‐ https://gigaom.com/2012/06/25/how-‐google-‐is-‐teaching-‐computers-‐to-‐see/ Need Deep Learning? Here are 4 Lessons from Google -‐ https://gigaom.com/2015/01/29/new-‐to-‐deep-‐learning-‐here-‐are-‐4-‐easy-‐lessons-‐from-‐google/ Getting started with AWS ML -‐-‐ http://docs.aws.amazon.com/machine-‐learning/latest/dg/tutorial.html AzureML on Windows Azure DataMarket / Binary Classifier Sample -‐-‐ https://datamarket.azure.com/dataset/aml_labs/log_regression BigML Sample Model -‐ https://bigml.com/user/ashikiar/gallery/model/53b2f21ec8db635905000d33 Kaggle Community -‐ https://www.kaggle.com/ DataKind Community -‐ http://www.datakind.org/


Table of Abbreviations

Abbreviation Full Term AI Artificial Intelligence AWS Amazon Web Services BI Business Intelligence CRM Customer Relationship Management DW Data Warehouse GPU Graphics Processing Unit IoT Internet of Things LOB Line of Business ML Machine Learning NoSQL No SQL OLAP On line analytical processing OLTP On line transactional processing POC Proof-‐of-‐concept RDBMS Relational Database Management System


About Lynn Langit Lynn Langit is a Big Data and Cloud Architect who has been working with database solutions for more than 15 years. Over the past 4 years, she’s been working as an independent architect using these technologies, mostly in the biotech, education, manufacturing and facilities verticals. Lynn has done POCs and has helped teams build solutions on the AWS, Azure, Google and Rackspace Clouds. She has done work with SQL Server, MySQL, AWS Redshift, AWS MapReduce, Cloudera Hadoop, MongoDB, Neo4j, Aerospike and many other database systems. In addition to building solutions, Lynn also partners with all major vendor cloud vendors, providing early technical feedback into their Big Data and Cloud offerings. She is an AWS Community Hero, Google Developer Expert (Cloud), Microsoft MVP (SQL Server) and a MongoDB Master. Lynn is also a Cloudera certified instructor (for MapReduce Programming).

Prior to re-entering the consulting world 3 years ago, Lynn’s background is over 10 years as a Microsoft Certified instructor, a Microsoft vendor and then 4 years as Microsoft employee. She’s published 3 books on SQL Server Business Intelligence and has most recently worked with the SQL Azure team at Microsoft. She continues to write and screencast and hosts a BigData channel on YouTube (http://www.youtube.com/SoCalDevGal) with over 150 different technical videos on Cloud and BigData topics. Lynn is also a committer on several open source projects (http://github.com/lynnlangit).

About Mark Tabladillo Mark Tabladillo is a Senior Data Scientist at midtown Atlanta's Predictix/LogicBlox. He has used and promoted Microsoft Azure Machine Learning, Microsoft SQL Server Data Mining, Microsoft BI Stack, Power BI, SAS, SPSS, R, and Julia. He is a SQL Server MVP and has a research doctorate (PhD) from Georgia Tech. He is chapter leader for PASS Data Science Virtual Chapter, which has periodic live meetings and its own YouTube channel.

practical machine learning

Technology

machine learning solutions

input data

big data

discerning differences

data mining algorithms

best approach lynn langit

data scientists business

business questions