ignite-ml (a distributed machine learning library...

Ignite-ML (A Distributed

Machine Learning Library for Apache

Ignite)

Corey Pentasuglia

Masters Research

Submitted as partial requirement for M.S. Degree SUNY Polytechnic Institute

May 15, 2016

Ignite-ML (A Distributed Machine Learning library for Apache Ignite)


ContentsIntroduction.......................................................................................................................1

Overview.....................................................................................................................1Distributed Environments............................................................................................2Data Locality...............................................................................................................2

Existing Frameworks......................................................................................................2MLBase......................................................................................................................2H2O............................................................................................................................3

Project...........................................................................................................................4Apache Ignite..............................................................................................................4OLTP vs. OLAP...........................................................................................................5Apache Ignite vs. Apache Spark...................................................................................6Ignite-ML..................................................................................................................8Ignite-ML Architecture.................................................................................................9Using Ignite-ML.........................................................................................................10Datasets...................................................................................................................11Caching....................................................................................................................11Scalability.................................................................................................................11Performance.............................................................................................................12TDML........................................................................................................................13Conclusion...............................................................................................................13

Citation........................................................................................................................14

IntroductionOverview

The topic of DML (Distributed Machine Learning) is quickly becoming an increasingly researched topic. DML is the study of executing machine learning algorithms in parallel on a distributed computer system. The study of DML appears to be a relatively new topic in the Computer Science community. This writing will explain some of the current technologies for performing DML, as well as describe some work being done to execute DML algorithms on an OLTP (Online Transaction Processing) information system. This writing introduces the idea of Transactional Distributed Ignite-ML (A Distributed Machine Learning Library for Apache Ignite) | Corey Pentasuglia

Page 1 of 14


Machine Learning (TDML). The intent of this work is not to perform a performance study, however to explore the ability to perform common machine learning principles on transacted data in an OLTP/OLAP hybrid information system.

Distributed EnvironmentsWhen considering running any code distributed, it is important to consider needs. What does this statement mean? Basically, if an application does not truly process very large amounts of data, then it likely does not need to be distributed. Of course, it’s not possible to give a blanket statement that indicates whether or not to run distributed, however it needs to be a judgement call. An application that only operates on a sparse amount of data, and/or does not have a lot of processing to perform likely should not be running distributed, since the overhead of the application will likely be too high. Of course, that doesn’t meant that it can run on the scale of a single multi-core machine in parallel.

The point behind all of this is to state the importance of scalability. Applications that perform DML should consider scalability. Ideally, an application that runs a machine learning algorithm should be able to run on a single node, and scale upwards to N nodes.

Data LocalityAnother important principle is the notion of data locality. The principle of locality can speed up distributed computation drastically. Ideally, computation performed on some data should occur on the instance where that data lives. Sending around the data for computation only adds overhead to an application.

Existing FrameworksHow does all of this apply to DML? Well, if one is considering one of the many technologies available, or even researching the creation of a new distributed machine learning framework, then they must consider what principles allow for efficiency.

There are currently many on-going projects in the DML domain, some of these projects include: MLBase and H2O. Research is currently being performed utilizing Apache Ignite and Weka (a Java based machine learning library) to perform distributed clustering. Let’s explore some ideas from some of these frameworks.

MLBaseThe MLBase framework is built over the Apache Spark ML implementation. Apache Spark is a general purpose cluster computing framework. Spark provides many high-

Ignite-ML (A Distributed Machine Learning Library for Apache Ignite) | Corey Pentasuglia

Page 2 of 14


level APIs for commonly used languages such as: Java, Scala, Python and R. The primary library for performing DML in Apache Spark is the MLlib library. MLlib claims to allow DML to be easy and scalability. It is likely that these goals will be common among all DML frameworks. MLlib provides functionality to perform the following: classification, regression, clustering, collaborative filtering, and dimensionality reduction.

FIGURE 1 ARCHITECTURE OF MLBASE [2]

The MLBase [2] project is the culmination of the technologies shown in the image above. The entire framework is built over Apache Spark for the general distributed computing aspects. MLlib is a library that is built on top of the Apache Spark framework, and provides the ML (Machine Learning) algorithms. MLI provides a high level ML abstraction for easily performing ML algorithms. ML Optimizer is aims to automate the ML pipeline construction, and is currently still under development.

H2OH2O [1] is a framework that claims to make math and predictive analysis easier to perform in a business environment. H20 supports a nice set of languages including R, Java, Scala, and Python. Many of the APIs listed are very convenient for data scientists who have no desire to write in a language like Java to perform analysis. The H2O frameworks is really targeted toward predictive analysis, and many large companies are already integrated with H2O. Similar to some of the other current frameworks, H2O seems focused on data analysis. H2O also provides integration with the Spark framework. H2O calls this integration Sparkling Water.


Page 3 of 14


FIGURE 2 H2O SPARKLING WATER [1]

Sparkling Water, as depicted in Figure 2 [1], is the integration of the H2O framework over the Apache Spark distributed environment.

The technologies described above appear to be excellent and well adapted platforms for performing distributed machine learning in an analytical environment. One of the intentions of the research topic of this writing is to explore distributed machine learning in a transactional sense. Research proved that much of the current technologies are focused on performing analytical analysis on “Big Data”, but this leads us to the topic of this writing. How can we adapt distributed machine learning to “Fast Data”? More is explained below.

ProjectApache IgniteIgnite-ML is a library that runs on-top of the Apache Ignite framework. Let us first explore what the Apache Ignite In-Memory Data Fabric is…


Page 4 of 14


FIGURE 3 APACHE IGNITE ARCHITECTURE [6]

Apache Ignite [6] provides an environment for handling not only the storage of large amounts of data, but also the processing of large amounts of data. More recently, this has become the concept of “Fast Data”, which is really a compliment to “Big Data”. The notion of “Fast Data” means that an information system can handle incoming transactions with speed, and perform meaningful calculations on data per transaction. Apache Ignite has many features that facilitate the quick handling of the data. It is not necessary to explore every component shown in Figure 3 [6] above, however the Data Grid, Hadoop Acceleration, and Compute Grid can really lend themselves to DML.

OLTP vs. OLAPOn line Transaction Processing (OLTP) and On line Analytical Processing (OLAP) are two types of technologies that relate to the handling and processing of data.

FIGURE 4 OLTP VS OLAP [3]

Figure 4 [3] depicts a separation of OLTP and OLAP. Apache Ignite is considered a hybrid of these two technologies. It is the opinion of many that having a hybrid of these


Page 5 of 14


two technologies is very powerful. To relate back to what was mentioned above, perhaps the OLAP portion of this information system relates more to the notion of big data, whereas the OLTP portion relations more to the newly mentioned fast data. Apache Spark is an example of a framework that is considered an OLAP technology.

FIGURE 5 OLTP VS OLAP FEATURES [3]

Figure 5 [3] above points out the major differences between OLTP and OLAP information systems. This table can be utilized as a references for comparing the technologies.

Apache Ignite vs. Apache Spark

It seems pretty clear that many of the current technologies focus on the analytical aspect of DML. Viewing the current state of DML technologies, a practicalist might not have much to bite off. Following the ideals of this project, what if one could perform classification on a per transaction basis? Meaning, a dataset has already been used to train a supervised learning algorithm, and as data is being transacted, classification is performed live prior to storage.

Comparing Apache Ignite and Apache Spark is a bit like comparing an apple and a bag of apples. Notice I didn’t say oranges? Ignite really encompasses many of the OLAP principles that are contained within Apache Spark, in fact they are considered sister Ignite-ML (A Distributed Machine Learning Library for Apache Ignite) | Corey Pentasuglia

Page 6 of 14


projects. However, Apache Ignite also has many of the OLTP principles for processing large amounts of transacted data. Let us take a look at a comparison slide [4] created by Dmitriy Setrakyan of GridGain Systems, one of the Apache Ignite creators.

FIGURE 6 SPARK VS IGNITE GRIDGAIN PRESENTATION [4]

Figure 6 [4] depicts some of the differences between Ignite and Spark. In some ways this was the inspiration behind this project. One may notice that Spark has a machine learning library (MLlib), whereas Apache Ignite has no machine learning. The focus of Spark is also more geared toward data science. Apache Ignite has some clear advantages over Spark, some of these include: In-Memory indexing, real streaming, the use of memory as primary storage, off-heap memory (avoids garbage collection pauses), and no need for deserialization. Spark utilizes Resilient Distributed Datasets that are created a head of time and are immutable. Data being processed using Apache Ignite utilizes mutable datasets.


Page 7 of 14


FIGURE 7 SPARK VS IGNITE PRESENTATION SLIDE

Figure 7 above also provides some of the major differences between Apache Ignite and Apache Spark. As shown above, Ignite provides convenient integration with legacy Map Reduce code because of the fully compatible Hadoop APIs.

One area where Spark has a clear advantage is language support. Spark supports many languages including: Java, Scala, Python, and R. Apache Ignite only supports the following: Java, and Scala. It would be very nice to have an API to utilize Ignite-ML from the R language, and may come in time. Many data scientists have stated that they prefer utilizing a language like R. Java is the intended language for Ignite-ML. Java is a good language to be able to work with since it is a very practical language. Ideally, people writing distributed applications could utilize the many features of Ignite, and take advantage of the distributed machine learning that Ignite-ML provides.

Ignite-MLThe background of the Ignite-ML library started with simple notion of performing ML algorithms on the Apache Ignite framework. The original project just utilized Ignite library code and JavaML (Java Machine Learning Library). This project simply utilized JavaML to perform the KNN algorithm distributed. It was soon realized that a library that is extensible would be better for others who wish to use the library. This proof of concept developed into the Ignite-ML library. The Ignite-ML library utilizes the Weka library for machine learning algorithms. It was determined that Weka is a far better library for performing machine learning in Java. Additional machine learning algorithms can be plugged into the Ignite-ML library to be performed on the grid. The project now


Page 8 of 14


consists of requests, responses, handlers, and exceptions. All of these classes facilitate a library that is extensible, and has a clear API.

Ignite-ML Architecture

FIGURE 8 IGNITE-ML ARCHITECTURE

Figure 8 depicts the architecture of the Ignite-ML library. The library has been designed to be extensible. The library already contains default implementations of many machine learning algorithms (integrated via Weka). However, custom implementations of algorithms can be integrated to the library by registering the following: handler (algorithm code), request, and response. These three simple objects can allow utilizers to add additional machine learning algorithms to be executed on the compute grid.


Page 9 of 14


FIGURE 9 IGNITE-ML HANDLER CODE

An example of the code that adds a handler is shown in Figure 9. As you may infer, a handler is created that implements the IgniteMLHandler class. The handler has two template parameters. These template parameters are the classes that implement IgniteMLRequest, and IgniteMLResponse. These classes allow the executor to know what request and response classes are registered for a handler.

The Ignite-ML library can be utilized in many ways. The library can either be integrated into an existing application, or one might create a small application to perform calculations. Ideally, utilizers of the Ignite-ML library with either adopt the Apache Ignite fabric, or already be utilizing it when integrating with Ignite-ML.

Using Ignite-MLThe Ignite-ML library can be integrated into existing frameworks, as well as be developed are part of a new application. The source code contains example applications that can be used standalone. The applications will be bundled into an executable jar file that can be run using the java –jar command. The example applications are bundled under the source at: ignite-ml/examples. As an example, one can run the ignite-ml/examples/ml-knn-app. The Knn app requires a training dataset, and the dataset to be classified. Here is an example:

Please provide the path to your training dataset: C:\iris.dataPlease provide the path to your actual (non-training) data: C:\iris.data

This example will be used to demonstrate the Ignite-ML library throughout the remainder of this report. All examples and output within the rest of this report will result from the execution of the ml-knn-app.


Page 10 of 14


In order to integrate Ingite-ML into existing applications, one should start by getting an Ignite executor instance:

IgniteExecutor executor = IgniteExecutorFactory.createInstance(trainingData);

The IgniteExecutorFactory class takes the trainingData as a constructor parameter and is used to training any of the handlers that are registered. The implementer can then create a request of type IgniteMLRequest (Java Interface). Each request may take relevant parameters. The IgniteKnnRequest takes the dataset to be classified. The request can then be handled by passing it to the executor.

IgniteKnnResponse response = executor.handleRequest(request);

The request will then be run on the Ignite compute grid, and the response will be passed as the return to the handleRequest call. The IgniteKnnResponse contains the classes for the dataset that’s passed in within the request.

These calls can be used to perform Knn classification within an application, or a developer can register and utilize their own ML algorithms. This can be done by defining a handler, request, and response that derive from the respective types within the Ignite-ML library.

DatasetsThere are primarily two datasets being used for testing with Ignite-ML. The first dataset is the iris dataset [7], and secondly the ph (poker hand) [7] dataset. The iris dataset is a smaller set of about 150 entries, and the ph dataset contains over a million entries. For the purpose of Ignite-ML testing, the ph data has been shortened to 10k entries.

CachingAs seen in some of the content above, the notion of caching is really the underlying foundation for Apache Ignite. The Ignite-ML takes advantage of this caching by utilizing both the Compute Grid and Data Grid components. For instance, a training dataset can be trained to create a classifier, and that classifier can be loaded into the cache to be shared among all of the nodes. Incoming transactional data can also be added to the cache to be shared. Apache Ignite also has a notion of per node shared state, in which nodes can cache information for the next iteration of processing.

ScalabilityThe Ignite-ML library has been proven to be well scalable thus far. Through performance studies, it has been determined that the Ignite-ML library can be integrated into an application that can run on a machine as powerful as a laptop, all the way up to


Page 11 of 14


a compute cluster. Of course, the amount of resources available will determine the speed in which an application executes.

PerformanceThe performance of Ignite-ML was measured by running the ml-knn-app on both a single laptop, as well as a cluster grid of 4 machines. Significant performance increases were observed when the application executed on the grid. The datasets mentioned above were utilized to test performance. The iris flower data [7] was first executed on both the laptop (4 logical processors), and then executed on the 4 node cluster. The performance results can be seen below. The iris data consists of 150 items, and 5 fields.

FIGURE 10 IRIS LAPTOP PERFORMANCE

The results depicted in Figure 10 show the performance of the iris dataset classification on a 4 core laptop. The classification took roughly 11 seconds.

FIGURE 11 IRIS COMPUTE GRID PERFORMANCE

The results depicted in Figure 11 show the performance of the iris dataset classification on a 4 node compute grid. The classification took roughly 3 seconds.

It seems clear that the performance is significantly better when executing on the compute grid. These results are encouraging considering the iris dataset is not considered a large set.

The next dataset used for performance testing is the ph dataset [7] mentioned above. The ph dataset has been scaled down to 10k items (originally over 1 million). The dataset consists of 10 fields. Performance results from the runs with this dataset can be seen below.


Page 12 of 14


FIGURE 12 PH 10K PERFORMANCE LAPTOP

The results depicted in Figure 12 show the performance of the ph 10k dataset classification on a 4 core laptop. The classification took roughly 26 seconds.

FIGURE 13 PH 10K PERFORMANCE CLUSTER

The results depicted in Figure 13 show the performance of the ph 10k dataset classification on a 4 node compute grid. The classification took roughly 5 seconds.

Similar to the conclusions made above, these performance results solidify that the application runs significantly faster on the Ignite cluster. It seems that these results are a clear indicator that Ignite-ML is scalable, and performs very well.

TDMLTransactional Distributed Machine Learning (TDML) is a concept that has been coined by this research. The idea is that we can extend the current usage of distributed machine learning to include ML concepts per transaction, as well as on an already stored dataset. Of course, things like normalization need to be considered, however this concept will be addressed at some point later on. It is assumed by Ignite-ML currently that datasets provided to the library will already be normalized already. This concept of performing something like classification on incoming data transactions works well for supervised learning. It has been acknowledged that unsupervised learning techniques will still likely need to be performed on the OLAP portion of the system. It does seem to spark ideas of new ways that all of these concepts can work together to provide more consistency.

ConclusionThe research of transacted distributed machine learning is still being researched by the writers of this document, and the Ignite-ML library is still being developed. These tasks will likely be on-going. The hope of this research is to explore a new area of distributed machine learning. It is the hope of the writers that readers become interested in this project, and will even contribute to the project. The writers have been in contact with Ignite-ML (A Distributed Machine Learning Library for Apache Ignite) | Corey Pentasuglia

Page 13 of 14


developers, and even the creator of the Apache Ignite Fabric and hope to collaborate on new ideas.

Citation[1]"AI FOR BUSINESS." H2O.ai (0xData). Http://www.h2o.ai/, n.d. Web. 05 May 2016.

[2]"MLbase." MLbase. Http://www.mlbase.org/, n.d. Web. 05 May 2016.

[3]"OLTP vs. OLAP." OLTP vs. OLAP. Http://datawarehouse4u.info/OLTP-vs-OLAP.html, n.d. Web. 05 May 2016.

[4]Tm. Apache IgniteTM (Incubating) - In-Memory Data Fabric (n.d.): n. pag. Apache IgniteTM (Incubating) - In-Memory Data Fabric. Http://gotocon.com/dl/goto-chicago-2015. Web.

[5]"Weka 3: Data Mining Software in Java." Weka 3. Http://www.cs.waikato.ac.nz/ml/weka/, n.d. Web. 05 May 2016.

[6]"Apache Ignite." Apache Ignite. Https://ignite.apache.org/index.html, n.d. Web. 12 May 2016.

[7]"UCI Machine Learning Repository." UCI Machine Learning Repository. Http://archive.ics.uci.edu/ml/, n.d. Web. 14 May 2016.


Page 14 of 14

ignite-ml (a distributed machine learning library...

Documents