practical machine learning

34
Practical Machine Learning: discerning differences and selecting the best approach Lynn Langit Reviewed by Mark Tabladillo

Upload: lynn-langit

Post on 16-Jul-2015

2.101 views

Category:

Technology


2 download

TRANSCRIPT

 

Practical Machine Learning: discerning differences and selecting the best approach Lynn Langit Reviewed  by  Mark  Tabladillo  

 

Practical Machine Learning: discerning differences and selecting the best approach   2

TABLE OF CONTENTS Executive  summary  ...................................................................................................................................................................  3  Introduction  ..................................................................................................................................................................................  3  Concepts  .........................................................................................................................................................................................  6  Process  and  Practicalities  .....................................................................................................................................................  15  Accessible  to  Data  Scientists  &  Business  Users  ...........................................................................................................  20  Accessible  to  Developers  &  BI/DW  Professionals  .....................................................................................................  24  Key  Takeaways  ..........................................................................................................................................................................  30  References  and  Resources  ....................................................................................................................................................  32  Table  of  Abbreviations  ......................................................................................................................................................  33  

About  Lynn  Langit  ....................................................................................................................................................................  34  About  Mark  Tabladillo  ............................................................................................................................................................  34  

 

 

Practical Machine Learning: discerning differences and selecting the best approach   3

Executive summary The formal definition of Machine Learning is this: the ability of computing systems to gain knowledge from experience. Practical ML enables your organization to answer business questions more effectively because of that experience. Machine Learning solutions consist of your input data built into models which combine that data with statistical and data mining algorithms.

Until relatively recently applied ML (as contrasted to ML for research) was simply too specialized, difficult and expensive to have broad adoption outside of the academic community and a few commercial domains (finance, ad serving). However, improvements in languages, libraries as well new commercial offerings (including cloud-only products) have greatly increased the practicality of implementing ML applications. Also demand has been fueled by Big Data - more data encourages more powerful methods of processing to gain understanding from that data.

This report will discuss technologies and implementation approaches for creating enterprise data solutions that include one or more machine learning components. The report will also detail the tradeoffs of each solution and determine which approach best fits organizational needs.

Introduction The term ‘Predictive Analytics’ is used somewhat interchangeably with Machine Learning. The central idea is that Machine Learning enables the creation of important business insights based on a analyzing some set of input data with one or more data mining or statistical algorithms.

Where Machine Learning is used

In some sectors, particularly academic research, statistical analysis and data mining have been standard analytical techniques for years. These sectors tend to use open source languages, tools and libraries. Academics commonly use specialty coding languages such as R or Python libraries (SciPy/NumPy/Pandas), rather than enterprise languages, such as Java for their ML research projects. Also researchers tend to work with wide (many attributes) and shallow (relatively small sample sizes) datasets. This academic dataset size is significant because many of the commonly

 

Practical Machine Learning: discerning differences and selecting the best approach   4

used tools, such as R Studio or even Weka, are designed for small (albeit rich) datasets and they are limited to working with datasets that can fit in the memory of analyst’s desktop computer rather than requiring server or even cloud-scale processing power.

In a few commercial sectors, such as financial (for example with credit scoring) and security (for example for email spam detection), use of ML (via data mining) is not a new approach. In these areas, highly specialized tools and specially trained professionals have supported these types of solutions. These vertical-specific ML solution development cycles run to the hundreds of thousand or even millions of dollars to implement. These costs include software licenses, powerful hardware, proprietary development and management tools and consulting fees. Also these types of projects have commonly taken months or even years to implement.

However, the ML market landscape is rapidly changing with the availability of Big Data/cloud storage, processing and data pipelines. These new services enable faster and cheaper data collection, storage and processing. Also the growth of IoT (mostly sensor) data is increasing the volumes of available data for analysis. These market changes are making the overall ‘entry point’ for ML projects less risky –i.e. cheaper and faster. Another driver of adoption is the efforts that commercial vendors are putting into creating usable ML tooling – most of which is runs on that particular vendor’s cloud infrastructure (such as IBM Watson on Bluemix, Microsoft Azure ML on Azure or Amazon ML). ML projects are increasingly seen as a realistic possibility given the larger market landscape. Simply put, more data means a need for more powerful methods of deriving meaning from the increasingly large and complex datasets. Enter the democratization of Machine Learning.

Challenges to Adoption

 Although tools are reducing the complexity of applying the power of statistical and data mining techniques to increasingly larger data sets, the enterprise market is in the early stages of ML adoption. One of the key blockers is complexity -- creating useful predictive analytics or ML differs substantially from the more traditional business analytics.

Because the application (and demand) for technical professionals skilled in applied statistics and data mining had traditionally been a small market, we are faced with a lack of trained, working

 

Practical Machine Learning: discerning differences and selecting the best approach   5

professionals who can produce useful results in this area. Specifically we lack those who have experience on how to perform the tasks needed in the enterprise ML solution lifecycle – such as to clean and groom the input data, to select appropriate techniques and algorithms, to build and evaluate models and to support moving the result of their work to production.

Vendors are stepping in to reduce this gap. Several major commercial vendors have launched general-purpose machine learning suites this year. As mentioned, the majority of these new offers are cloud-based. Some solutions offer you the ability to train, test and deploy in either a cloud or on premises, while other solutions are cloud-only, such as BigML.

 

Practical Machine Learning: discerning differences and selecting the best approach   6

Concepts Taxonomies and terms for Machine Learning solutions have important and nuanced differences in meaning, proper understanding is key to differentiating products and solutions available in the ML space. To begin, we’ll start by providing definitions of associated technologies.

What  is  the  difference  between  business  analytics  and  predictive  analytics?    Business Analytics is defined as finding answers to business questions by querying data and producing a definite result or result set. For example: “What are the top five items that are found in a shopping basket for a 38 year old man from California who is shopping on a Saturday at 5pm at a major grocery chain?” The answer to this question (via a query to source data) produces a deterministic result set, usually shown as a report or a dashboard is the only type of analytics that they have available. Stated differently, business analytics are used to analyze “what has happened” for past events.

Predictive Analytics is defined as finding answers to business questions by applying one or more probabilistic algorithms to some set of input data and producing one or more probabilistic results. For example: “Consider the items which appear together in the shopping baskets of all 38 year old men from California who are shopping on a Saturday at 5pm at any of the major grocery chain stores for which we have data and predict how many of a given item from this set the stores should have on hand to ensure proper supply for this type of customer.” In this case, the type of algorithm is regression because it is used to predict a future value or set of values. To get a result one or more regression algorithms are applied to the source data – for example, linear regression. Because the results are probabilistic, i.e. a percentage or score of likelihood of a result, it is common to use more than one evaluative algorithm and then to evaluate the quality of the result. This is process is called ‘evaluating the model.’ The best result from the models is selected and is either presented via statistical output (probability) or via a customized visualization. Stated differently, predictive analytics are used to analyze “what will happen” for potential or future events. The graphic below illustrates and

 

Practical Machine Learning: discerning differences and selecting the best approach   7

contrasts sample results in business and predictive analytics.

Figure 1 - Two Types of Analytics

What  is  the  difference  between  data  mining  and  predictive  analytics?    Data Mining encompasses a broader set of tasks than that included in predictive analytics. In addition to regression algorithms, data mining also includes other types predictive analysis. Specifically, finding groupings in the source data, by matching new data to existing labeled (or categorized) data is called classification. Classification algorithm executions are characterized as implementations of ‘supervised’ algorithms because there is an authoritative set of data, which is used to process the input data in addition to an algorithm. For example “In a set of data there are examples of pictures or drawings of objects that we’ve identified and labeled as particular animals – i.e. ‘this is a picture of a dog and that is a picture of a cat.’ “ A classification task is to evaluate the likelihood of a new picture being a dog or a cat based on pattern matching to the set of known states. An example of a classification algorithm is decision trees. Of note is that regression is also ‘supervised’ because a data set with ‘known values’ is used in conjunction

 

Practical Machine Learning: discerning differences and selecting the best approach   8

with the application of the regression algorithm when evaluating the probability of a result using new input data.

Discovering natural groupings in source data, for which there are no known states or labels is called clustering. Since there are no known states when clustering algorithms are used, this type of machine learning is called ‘unsupervised’. An example of this technique is ‘here are some pictures, group them into subsets based on characteristics (or labels) that are discovered during the process of running the algorithm.’ As with the other types of ML, when implementing clustering it is common to use multiple clustering algorithms, such as k-means, then to evaluate the model results and finally to select the top performing algorithm and model for the particular business problem.

What  is  the  difference  between  predictive  analytics  and  machine  learning?      Machine Learning is evolving to support the increasing volumes, varieties and velocities of Big Data projects, rather than the smaller, simpler datasets that typified data mining projects, particularly in academia. Another way to understand ML is as the next generation of data mining. Machine learning is a superset of predictive analytics because it involves more than application of one or more predictive analytic techniques (and associated algorithms) to sets of input data. Another consideration is the current push toward commercial ‘productization’ of machine learning applications. Although data mining and statistical analysis has been widely used in particular domains, the broadest application, for academic research, is implemented quite differently than for commercial applications.

Specifically there are many steps in data preparation for predictive analytics (or ML) projects that are different from data preparation common for business analytics projects. Steps to prepare input data for predictive analytics include such tasks as the following:

• Evaluating data types and detecting or creating labels (for classification)

• Evaluating number / ratio of null values

 

Practical Machine Learning: discerning differences and selecting the best approach   9

• Evaluating quality/ usefulness of input data based on statistical analysis (mean, mode, etc…)

• Removing outlier values (exceptions)

• Creating groupings (called ‘bucketing’)

Commercial tools provide data visualizers, which assist with data quality assessment at this state and also facilitate easy modification of the input data. After the data preparation tasks have been completed there is a 3-step process to implement a machine learning solution or model. It is quite common for the model process to be iterative (because the outputs are probabilistic) during the model creation phase. Iterations often include returning to the data preparation phase because adjusting the quality of the input data impacts outputs. The need for iteration over increasingly large data sets marries nicely with the scalability of cloud-based ML solutions.

These steps include the following:

• Input Data

o Ingest – in this step you ingest source data, common ingest methods are file-based, database-based. Increasingly accepting streaming input is a requirement.

o Evaluate & Clean – in this step you review the input data (often done using statistical analysis) and tune that data, so as to be prepared for inclusion in one or more ML models

• Model

o Select ML Algorithm and Initialize Model(s) – in this step you match the business question and input data to a ML technique (regression, classification or clustering) and one or more algorithms from within that technique (such as, linear regression, decision trees, k-means clustering) to evaluate the possibility of building a useful model with this information

 

Practical Machine Learning: discerning differences and selecting the best approach   10

o Train Model(s) – in this step you create the model and load it with data, you then process the model and view the output

o Score Model(s) – in this step you evaluate the effectiveness of model results vs. the ‘random guess’ line to understand the potential use of the model(s) for future predictions, classifications and clustering tasks

• Predict

o Perform Prediction – in this step you evaluate new data against the model in order to predict the likelihood of selected results.

These steps are often performed iteratively, as model scoring results in differentiation between multiple models. You may decide to repeat some or all of the entire cycle with slightly different input data, different algorithms, different algorithm parameters, etc… in order to produce one or mode ‘useful’ models. Wizards and visualization tools found in ML products speed up these iterative cycles.

Shown below is an open source project for RStudio called Shiny. Shiny is used by many R developers, because it allows them to quickly an easily visualize (and query) models they created in the R programming language. Note the use of input parameters via slider bars and text boxes. These controls allow the ML developer to ‘try out’ different values in evaluating the usefulness of their model. Lightweight visualization tools for rapid iteration are particularly valuable for ML scenarios.

 

Practical Machine Learning: discerning differences and selecting the best approach   11

Figure 2 - Visualization of R results using Shiny

Is  data  science  the  same  thing  as  machine  learning?    Data science is a super set of Machine Learning in that in addition to all of the tasks described in the last paragraph, data science also includes hypothesis formation, or more simply, ‘asking the right question(s)?’ Data science, as shown in the graphic, involves domain expertise, healthy curiosity, scientific thinking, understanding of math, statistics, algorithms, data input sets and visualization. Increasingly, a team of people in the enterprise is responsible for data science projects, because the skill sets needs are simply not found in any one or two people. Also these teams benefit from using enterprise-grade tools, which facilitate communication and other

 

Practical Machine Learning: discerning differences and selecting the best approach   12

enterprise needs, such as security, source control and others.

Figure 3 - Skills need for Data Science

What is Artificial Intelligence and how does it relate to machine learning?

An AI (Artificial Intelligence) solution contains one of more intelligent agents. AI intelligent agents automate tasks that would normally require a highly trained person to do. An example of this type of task is speech recognition and translation. An AI system is one that responds to complex problems in a human-like way. A well-known AI success of late is the celebrated win of the IBM Watson AI system again two top human players in the TV trivia game show Jeopardy.

 

Practical Machine Learning: discerning differences and selecting the best approach   13

In some ways, AI has more to do with process automation than learning because AI systems ingest vast amounts of source data and perform iterative ML processes, often over a period of years. In practice AI includes a number of ML components, so that the system and its processes can be increasingly optimized or can learn over time. You can see commercial application of AI systems in domains as disparate as medical diagnostics, self-driving cars, face and speech recognition and bank fraud detection.

What  is  Deep  Learning  and  how  does  it  relate  to  machine  learning?    Deep Learning is a relatively new aspect of Machine Learning. It’s a set of algorithms in ML that attempt to model high-level abstractions in data by using multiple non-linear transformations. Deep Learning is focusing on improving the efficiency of unsupervised or semi-supervised feature learning algorithms. It’s based on research in human neuroscience, such as human neural coding. Algorithms are deep neural networks and problem sets include computer vision, natural language processing and speed recognition. Also Deep Learning has been called the new definition of the ‘neural networks’ data-mining algorithm.

Advances in hardware, particularly around GPU computational capabilities have facilitated use of Deep Learning as they have enabled model-processing times to shrink from weeks or days to a more practical level, i.e. minutes. However, given the computational intensity, it is still the case that computational (processing time) requirements limit the widespread application of Deep Learning algorithms.

Deep Learning is also called ‘strong AI’ because of it’s potential to disrupt a large number of processes. Major software companies are focusing millions of dollars in research around improving usability of Deep Learning in their own core products (such as their voice recognition systems, Google Now, Microsoft Cortana and Apple Siri and other products). Although the potential of Deep Learning is exciting, the reality is that the broad application of its results due to time, cost, complexity and skills needed is still limited to experimental and (mostly) research projects at a small subset of companies, such as Google, IBM, Microsoft, etc....

 

Practical Machine Learning: discerning differences and selecting the best approach   14

What  is  the  importance  of  real-­‐time  analytics?   Broader adoption of technologies such as in-memory databases and streaming Hadoop (Spark Streaming, Storm and Samza), along with new types of data providers, e.g. IoT data input devices, are increasing the demand for real-time analytics as a category. In addition creation of cloud-based data pipeline libraries and products, enables the creation of more complex conduits for incoming data, including through multiple processing pipelines. Along with these advances in real-time Big Data technologies in general comes demand for products, which can enable rapid creation of solutions that also include real-time predictive analytics. Major software vendors are creating consumer products and services, such as adaptive voice input (Google Now, Microsoft Cortana and Apple Siri) that use real-time predictive analytics. These types of applications are igniting consumer imagination and fueling demand in general.

 

 

Practical Machine Learning: discerning differences and selecting the best approach   15

Process and Practicalities  Let’s take a deeper look at the processes involved in creating commercial machine learning solutions. We are doing so, because, as mentioned, the process for creating useful commercial predictive analytics is quite different than that of creating business analytics. Digging into the detailed processes involved will help in our understanding of the usability of the libraries, tools and products currently available.

Business data projects are driven by the need to gain more or better business insights. Given that, what are the types of use cases that machine learning solutions can address? Remembering the core functionality of ML, i.e. predicting one or more discrete, future values, classifying or labeling new data into known groups and/or detecting natural groups in new data, here is a short list of some types of common use cases:

• Facilities  &  Manufacturing  -­‐-­‐  Smart  Buildings,  Predictive  Maintenance  

• Sales  &  Marketing  -­‐-­‐  Demand  Forecasting,  Churn  Analysis,  Target  Advertising  

• Biomedical  -­‐-­‐  Life  Science  Research,  Healthcare  outcomes  (patient  re-­‐admission  rates)  

• Security  -­‐-­‐  Fraud  Detection,  Network  Intrusion  Detection  

• Logistics  –  Routing    

As mentioned the steps involved in a creating an end-to-end machine learning solution include a number of considerations. Before the advent of cloud-based data storage, pipelines and machine learning model tooling, costs involved in creating what were then called data mining solutions blocked many enterprises. These costs included high hardware and software license fees (often well over $ 100k, up to $ 1 million simply to start what was often a multi-year project was not unheard of as well). Additionally, the costs of re-training or hiring specialty consultants to implement the data mining projects added to the project costs and complexity. Prior to cloud-based data storage and cloud-based data pipeline products, costs associated to unearthing enterprise data from the various (and often proprietary) on-premise data silos added to adoption blockers. Yet another blocker to implementing traditional data mining was that the domain of

 

Practical Machine Learning: discerning differences and selecting the best approach   16

business analyst (or, in some cases, statistician) were wholly separated from developers who would be charged with creating application interfaces for the results of the data mining work produced by the business analysts.

Cloud storage combined with new types of Big Data storage has driven overall enterprise data volumes up dramatically. Increasingly large and complex data sets are becoming progressively more difficult to analyze in a meaningful way for the enterprise. Driven by particular sectors, such as the ML analysis of massive amounts of behavioral data collected in social gaming (Angry Birds, Halo, etc…), the enterprise appetite for getting started with ML projects has increased sharply over the last 12 months.

Although the landscape is improving due to the release of improved open source libraries, tools as well as new commercial tools, for most enterprises, ML projects are a new type of analytics. Given that, for traditional enterprises, the newly releasing set of cloud-based ML tools and services, such as Azure ML, IBM Waston, Predixion Software, AWS ML, BigML and others are a welcome compliment to the existing (mostly open source) languages, libraries and tools.

Another new item in the emerging ecosystem of enterprise tools and products designed to support enterprise ML projects is the emergence of commercial data markets. IBM, Microsoft and Predixion Software all include the ability to directly ‘publish’ the results of one or more useful ML experiments into their cloud-based repository or marketplace. Technically, most enable the ML experiment to be published as a REST-based web service endpoint.

Interestingly, cloud vendors are leveraging integration with their own cloud services. For example, Amazon ML includes the ability to enable real-time ML via a one-button click as shown in the screenshot below. This real-time capability is integrated with AWS S3 storage. AWS ML integrates with S3, RDS or Redshift at this time.

 

Practical Machine Learning: discerning differences and selecting the best approach   17

Figure 4 - Amazon ML Model Usage Options

This functionality not only facilitates quick and easy deployment to production of commercial ML services, but also has the interesting implication of providing the enterprise a commercial platform from which they can monetize the results of their ML experiments by making those results available as a commercial offering.

 

Practical Machine Learning: discerning differences and selecting the best approach   18

Shown below is a chart that lists many of the major offerings – either commercial or open source.

 Phase   Azure   AWS   Google   Commercial   Open  

Source  Ingest   Stream  

Insight  Kinesis   Big  Query   Data  Torrent   Flume  

Pipeline   Data  Pipeline   Data  Pipeline   Data  Pipeline   Data  Torrent   Kafka  Storage   BLOB  

Document  DB  SQLAzure  HDInsight  

S3  Dynamo  DB  RDS  –  SQL    Redshift  EMR  

BLOB  H/R  Datastore  MySQL  Hadoop  on  GCE  

SAS   NoSQL  Hadoop  

Create  Predictive  Models  

Azure  ML  Revolution  Analytics  for  R  Language  

AWS  ML   Prediction  API   SAS  IBM  Watson  Predixion  Software  BigML  Matlab  Mathematica  PredictionIO…  

R    Mahout    Python  Pandas  Weka  

Predicative  Results    Publication  and/or  Visualization  

Excel  Power  BI  Gateway  PowerView  Azure  Data  Market  

AWS  Lambdas  Partners  

Google  Charts   BigML  Dato  Predixion  Marketplace  Tableau  Wolfram  Language    

D3  

 

In some verticals, such as biomedical, it is common to have some form of academic data mining or statistics work (data sets and / or data mining models) to use as a basis for creating commercial machine learning solutions. One example is when you are turning that academic research into commercial biomedical products. Given that, we’ll list data mining languages, libraries and tools, which are commonly used in academic research. Also, it has been the case that traditional statistical tools and languages, i.e. Matlab, Mathematica, have high adoption in the research sector.

 

Practical Machine Learning: discerning differences and selecting the best approach   19

ML Academic Languages, Tools and Libraries – some are open source – most have free versions for academic research – shown below is a chart that summarizes many of these items. We have included the communities’ category, because academic data science communities are at the front edge of work on improving open source tools and libraries and bear watching when you are assessing the state of ML tools and products.

Category Objects Notes

Languages R Language SciPy/NumPy/Pandas Matlab Mathematica Julia Mahout Weka

Stats Language Python Libraries for ML Stats Language Stats Language Scalable Stats Language ML for Hadoop Research Stats Language

Tools R Studio Shiny for R Weka Studio PyCharm Sublime

IDE for R Visualization for R IDE for Weka IDE for Python IDE for Python and more

Communities KDNuggets Kaggle DataKind Open Gov/Open Data Code for America

Website Competition Community Community Community

 

Practical Machine Learning: discerning differences and selecting the best approach   20

Accessible to Data Scientists & Business Users  A key question around the practicality of ML solutions for the enterprise is this: Who exactly will develop the ML solutions in the enterprise? Given the diverse set of skills needed to successfully implement any type of data science solution, much less the smaller subset (which is even more complex – around ML), the first part of the answer is the most critical. A team of skilled professionals best implements ML projects. Our answer to the common question “Do I just need to hire a statistician to implement a ML project?” is an unqualified “No!” Commercial ML differs substantially from ML for academic research. While the image of the lone scientist, toiling away in his/her lab and carefully analyzing the results via complex statistical calculations is the heritage of ML, this images bears little relationship to the practicalities of implementing ML in the enterprise.

While there is definitely a place for a dedicated statistician on an enterprise ML team, this is no longer a requirement for all ML projects. That being said, ML tools compliment (but do not substitute for) statistical and data mining domain expertise. What has changed with the advent of these tools, is the ability for your key team members to work with others (business analysts, decision makers, developers, DevOps, etc…) because the tools use common interfaces and well-designed dataflow visualizations. Also most tools are cloud-based, which means zero-install and configuration and quick environment start up time. Additionally commercial tools are designed to scale storage and processing via cloud capacity, enabling faster movement from small dataset experiments to full-scale production deployments. Cloud-based tools are particularly well suited for building quick proof-of-concept projects for the enterprise.

Given the democratization of tooling, you may be wondering whether this new tooling is sophisticated enough for classically trained data scientists and academics to be able to make full use of their complete skill sets? The answer is a conditional yes – some, but not all, commercial products, such as Azure ML, contain integration with commonly used statistical languages (R Language and Python libraries) and allow re-use of scripts created in these languages.

 

Practical Machine Learning: discerning differences and selecting the best approach   21

Additionally, it’s important for researches to have visibility into algorithms and algorithm parameters. This is important for reproducibility of published experiment results. Shown below is an Azure ML model, which uses two-class support vector machines in performing classification (of Tweets in this sample). Also of note is the ability to use R Language scripts in a ML workflow:

Figure 5 - Azure ML Experiement

 

 

Practical Machine Learning: discerning differences and selecting the best approach   22

Model evaluation is a key component of a ML Experiment. Here is sample output from Azure ML model evaluation visualization. You’ll note that both score information (table) and graphical output are included in the visualization:

Figure 6 - Azure ML Model Evaluation Output

 

Practical Machine Learning: discerning differences and selecting the best approach   23

For comparison, shown below is output from a sample Amazon ML model evaluation:

Figure 7 - Amazon ML Model Evaluation Visualization

 

 

Practical Machine Learning: discerning differences and selecting the best approach   24

Accessible to Developers & BI/DW Professionals An interesting and somewhat unexpected aspect of ML enterprise projects is that in no way is having one or more Big Data repositories a requirement for undertaking this type of project. Due to the origins of ML, i.e. academic research using statistics and data mining, some of the most useful ML projects are, in fact, based on application of these techniques to LOB data. You can think of it as being able to ask different kinds of questions of your current data. Understanding when to use ML (and when not to) relates directly to the definitions of business and predictive analytics. Simply put, use ML when you want to ask business questions will result in probabilistic answers.

The ability to ask predictive questions of LOB data often yields useful results. For example, it has been quite common to begin ML projects in sales and marketing departments, using CRM data as source for ML experiments that involve answering business questions like ‘what are the characteristics of the customers who produce the most revenue?’ (Clustering) and ‘what type of cross-sell opportunities can we introduce on our website based on known customer purchase patterns?’ (Classification).

Another common ‘entry point’ for ML solutions in the enterprise is in using IT (log) data. Regulatory (access auditing) and compliance requirements – and also general security concerns, drive ML experiments such as ‘at what day / time can I expect that network bandwidth usage will spike to a particular level (value) for a particular segment of my corporate users?’ (Regression).

 

Practical Machine Learning: discerning differences and selecting the best approach   25

In general, the enterprise can find value in appropriately applying predictive analytics via ML solutions to a broad spectrum of domains. In addition to sales and market or DevOps, enterprises can apply ML to other scenarios for which probabilistic analysis would yield useful results. For example questions such as these can now be addressed:

• What are the most closely correlated employee attributes with highest revenue production of that employee’s team?

• At what future point (value) in time do our customers in a certain segment (i.e. demographics, geographic…) tend to make a subsequent purchase?

• What groups (trial or free items) of our public resources (website, Github, YouTube…) tend to be used by browsers who become our customers?

As mentioned, integrated tooling provided by commercial vendors enables simpler deployment and embedding of ML model results into enterprise applications via their ‘publish as a web service’ functionality. Given that relatively few enterprise application developers have familiarity, much less expertise in ML languages, tools and libraries, using commercial ML tools that include ‘click to publish’ functionality significantly speeds up time to market.

Another advantage of using commercial ML tools for the enterprise is the built in connectors to disparate incoming data sources. Given that it is increasingly common to use a broad variety of data sources as ML ingest sources, the availability of pre-built connectors once again speed development cycles. It is common to include connector for LOB data, i.e. RDBMS systems (both on-premise and cloud-based) as well as for some of the newer NoSQL databases, Hadoop as well as one or more type of incoming data stream.

 

Practical Machine Learning: discerning differences and selecting the best approach   26

Also useful are the quick statistical snapshots that most commercial ML tools provide of datasets in your ML project. For example, the AWS ML dataset console view includes the visualization shown below:

Figure 8 - AWS ML Datasources Attribute Information

The AWS viewer not only allows the ML team to ‘see’ the attribute names, but also the correlations, uniqueness of data, most/least frequent categories, it also includes an inline ‘Preview’ visualization of the uniqueness of the data.

As mentioned, integrated commercial ML tooling, which include ‘one-click’ to deploy capabilities increases usability for developers and BI professionals. Additionally, capabilities, which essentially advertise published ML web services, such as Microsoft Azure Data Market, provide additional discoverability; usability and also commerce opportunities for published services are also emerging. An example is shown below.

 

Practical Machine Learning: discerning differences and selecting the best approach   27

Figure 9 - Azure Machine Learning Test Harness

 

Practical Machine Learning: discerning differences and selecting the best approach   28

Visualization of results is another element of ML solution usability. To that end, we’ve included a sample from IBM Watson Analytics. This service includes flexible visualizations at all phases of the ML process (i.e. data discovery, modeling, etc…) an example is shown below.

 Figure 10 - IBM Watson ML Visualization

   

 

Practical Machine Learning: discerning differences and selecting the best approach   29

Our last example of model visualization is from the commercial cloud-based vendor BigML and is shown below. Also interesting is how vendors such as BigML enable community via providing a platform for their users to get more value from their ML models. You’ll note BigML allows users to upload, share, rate and also sell models for use by others in their own ML scenarios.

 Figure 11 - BigML Model Visualization

 

 

Practical Machine Learning: discerning differences and selecting the best approach   30

Key Takeaways Incorporating the results of machine learning experiments into production data solutions adds significant complexity to the overall projects. Given this, a solid understanding of technology choices around machine learning solutions is essential for designing and delivering solutions that provide business value to the organization.

• Use commercial machine learning products when team members new to machine learning processes are creating your solution. Due to fundamental differences at every stage in the data pipeline, i.e. data preparation, hypothesis formation, algorithm selection, model training and evaluation, ML projects introduce a set of complex processes into the enterprise. If your data paradigm consists of an OLTP store alone, you would be best served by leveraging commercial ML development suites, rather than attempting to cobble together solutions based on tools and libraries that were built primarily for statisticians.

• Select tools or coding libraries that perform at the speed and scale for the data ingest and processing scale for the types of machine learning methods that your business problems require. Enterprises will benefit from leveraging cloud storage and process of Big Data workloads as sources for ML solutions because their data volumes are generally significantly larger than those of academic research. Also, in-memory streams are increasingly relevant, particularly with the advent of more and more IoT scenarios.

• Teams that have already implemented pure open source data solutions are most capable of adding pure open source machine learning solutions. Domains where data mining and/or statistics may have already been in use, such as academic research will have more success using open source tools and libraries, so long as their input data does not overrun the capabilities of those tools.

• Plan for and test your model deployment topology to ensure ML experiments deliver production business value. Commercial vendors are incorporating one-click to deploy functionality in their ML studio environments, given the common challenges

 

Practical Machine Learning: discerning differences and selecting the best approach   31

around deployment of ML models; such functionality enables faster time to market for production solutions. Also consider the vendor path to implementing streaming or near-real time ML solutions if that is part of your requirements.

• Select tools or plan for coding appropriate types of visualization solutions. ML outputs are unfamiliar to many business users. Standard reports and dashboards have not been designed to display ML results in a meaningful way. Selecting ML vendors, which integrate results easily into other commercial solutions or common libraries results in broader usability for ML solutions.

 

 

Practical Machine Learning: discerning differences and selecting the best approach   32

References and Resources This  section  lists  the  references  and  resources  referred  to  in  this  article.      Data  Science  graphic  -­‐-­‐  http://civicscience.com/data-­‐science-­‐a-­‐visual-­‐guide/    Shiny  for  R-­‐Studio  -­‐-­‐  http://shiny.rstudio.com/gallery/movie-­‐explorer.html    Deep  Learning  and  the  Hololens  -­‐-­‐  https://technoptimist.wordpress.com/2015/01/25/deep-­‐learning-­‐and-­‐the-­‐hololens    Collection  of  papers  on  how  IBM  Watson  works  -­‐  http://www.andrew.cmu.edu/user/ooo/watson/    What  is  AI?  -­‐-­‐  http://www.techopedia.com/definition/190/artificial-­‐intelligence-­‐ai    How  Google  is  Teaching  Computers  to  See  -­‐  https://gigaom.com/2012/06/25/how-­‐google-­‐is-­‐teaching-­‐computers-­‐to-­‐see/    Need  Deep  Learning?  Here  are  4  Lessons  from  Google  -­‐  https://gigaom.com/2015/01/29/new-­‐to-­‐deep-­‐learning-­‐here-­‐are-­‐4-­‐easy-­‐lessons-­‐from-­‐google/    Getting  started  with  AWS  ML  -­‐-­‐  http://docs.aws.amazon.com/machine-­‐learning/latest/dg/tutorial.html    AzureML  on  Windows  Azure  DataMarket    /  Binary  Classifier  Sample  -­‐-­‐  https://datamarket.azure.com/dataset/aml_labs/log_regression    BigML  Sample  Model  -­‐  https://bigml.com/user/ashikiar/gallery/model/53b2f21ec8db635905000d33    Kaggle  Community  -­‐  https://www.kaggle.com/    DataKind  Community  -­‐  http://www.datakind.org/          

 

Practical Machine Learning: discerning differences and selecting the best approach   33

Table of Abbreviations

Abbreviation   Full  Term  AI   Artificial  Intelligence  AWS   Amazon  Web  Services  BI   Business  Intelligence  CRM   Customer  Relationship  Management  DW   Data  Warehouse  GPU   Graphics  Processing  Unit  IoT   Internet  of  Things  LOB   Line  of  Business    ML   Machine  Learning  NoSQL   No  SQL    OLAP   On  line  analytical  processing  OLTP   On  line  transactional  processing  POC   Proof-­‐of-­‐concept  RDBMS   Relational  Database  Management  System  

 

 

Practical Machine Learning: discerning differences and selecting the best approach   34

About Lynn Langit Lynn Langit is a Big Data and Cloud Architect who has been working with database solutions for more than 15 years. Over the past 4 years, she’s been working as an independent architect using these technologies, mostly in the biotech, education, manufacturing and facilities verticals. Lynn has done POCs and has helped teams build solutions on the AWS, Azure, Google and Rackspace Clouds. She has done work with SQL Server, MySQL, AWS Redshift, AWS MapReduce, Cloudera Hadoop, MongoDB, Neo4j, Aerospike and many other database systems. In addition to building solutions, Lynn also partners with all major vendor cloud vendors, providing early technical feedback into their Big Data and Cloud offerings. She is an AWS Community Hero, Google Developer Expert (Cloud), Microsoft MVP (SQL Server) and a MongoDB Master. Lynn is also a Cloudera certified instructor (for MapReduce Programming).

Prior to re-entering the consulting world 3 years ago, Lynn’s background is over 10 years as a Microsoft Certified instructor, a Microsoft vendor and then 4 years as Microsoft employee. She’s published 3 books on SQL Server Business Intelligence and has most recently worked with the SQL Azure team at Microsoft. She continues to write and screencast and hosts a BigData channel on YouTube (http://www.youtube.com/SoCalDevGal) with over 150 different technical videos on Cloud and BigData topics. Lynn is also a committer on several open source projects (http://github.com/lynnlangit).

About Mark Tabladillo Mark Tabladillo is a Senior Data Scientist at midtown Atlanta's Predictix/LogicBlox. He has used and promoted Microsoft Azure Machine Learning, Microsoft SQL Server Data Mining, Microsoft BI Stack, Power BI, SAS, SPSS, R, and Julia. He is a SQL Server MVP and has a research doctorate (PhD) from Georgia Tech. He is chapter leader for PASS Data Science Virtual Chapter, which has periodic live meetings and its own YouTube channel.