a cloud computing framework with machine learning...

A Cloud Computing Framework with Machine Learning

Algorithms for Industrial Applications

Brian Xu, D. Mylaraswamy, and P. Dietrich

Honeywell Aerospace, Golden Valley, MN, USA

Abstract. In this paper, a novel cloud computing

framework is presented with machine learning (ML)

algorithms for aerospace applications such as condition

based maintenance, detecting anomalies, predicting the

onset of part failures, and reducing total lifecycle costs.

This cloud framework has been developed by using

MapReduce, HBase, and Hadoop Distributed File System

(HDFS) technologies on a Hadoop cluster of OpenSUSE

Linux machines. Its ML algorithms are based on Mahout

ML Library and its web portal is built using JBoss and

JDK. Importantly, the big data from various Honeywell

data sources are managed by our HBase and analyzed by

various ML algorithms. Users can use this cloud based

analytic toolset through web browsers anytime and

anywhere. More analytic results of using this framework

will be published later.

Keywords: Cloud Computing, Big Data Analytics,

Machine Learning, Algorithms, MapReduce, CBM.

1 Introduction

To deal with data overabundance and information

overload, big data analytics and cloud computing

technologies are being used by the world top IT and

software companies such as Google, IBM and Microsoft

[1]. Currently these technologies are being adopted by

other industries. In this paper, we present our prototype

of cloud framework with machine learning (ML)

algorithms for aerospace applications, specifically,

Condition Based Maintenance (CBM), monitoring,

diagnostics, and product reliability and performance. In

Honeywell, there are big (volume, velocity, variety) data

that are collected and streamed from thousands of

aircraft (operational datasets, maintenance data, etc.), test

cells (hundreds of sensors and measurements, etc.), and

repair shops (records of electronics, avionics, mechanical

repairs, etc.). For instance, one test cell can generate 300

MB test data daily per engine. Our approach is to

combine the best strengths and synergies of both cloud

computing and machine learning technologies, in order

to effectively analyze the big data and develop

capabilities of predictive analysis, actionable information,

better CBM and decision making.

Technically, by combining and leveraging cloud

computing and ML technologies, our major goals are

included (not limited to): (1) detecting anomalies from

parts, components and systems, (2) predicting the onset of

failures of parts (e.g., components, LRUs, etc.) to

maximize asset usages and availability, minimize the

downtimes, and (3) sustaining better and effective CBM

policies, and (4) reducing total lifecycle costs of our

aerospace assets and networks. Our primary tasks are to

realize these goals by analyzing the big data and

transforming information into knowledge. In our CBM

applications, after we developed our Hadoop cluster by

leveraging Apache ecosystems [2], we have focused on

analyzing and mining our data sources by using open

source ML algorithms including Mahout Library [3] and

by developing our ML algorithms using R and Matlab.

This paper is organized as follows: Section 2 describes

our cloud-based ML framework and components. Apache

Hadoop, MapReduce, and HBase are used to develop an

effective cloud computing infrastructure on which our

machine learning framework is built, and the technical

details are described in Section 3. Mahout ML algorithms

are briefly introduced in Section 4. Our conclusions are

presented in Section 5.

2 Architecture and components of

Cloud-based ML Framework

Our specific tasks are to find valuable insights, patterns

and trends in big data (large volume, velocity, and variety)

that can lead to actionable information, decision making,

prediction, situation awareness and understanding. To

complete these technical tasks, we have developed a cloud

framework with machine learning technologies for cyber-

learning, leveraging machine learning algorithms (SVM,

random forests, PCA, K-means, etc.), knowledge mining,

and knowledge intensive problem solving.

We developed our cloud-based ML framework, by

developing a Cloud Controller, Cluster Controllers, and

Node Controllers on our Hadoop cluster of Linux machines.

We used Eucalyptus cloud tool [4] to develop our primary

software framework. The framework architecture and key

components are shown in Figure 1.

In Figure 1, we implemented the HBase that is a scalable,

distributed database and supports real-time access large

data repositories such as Oracle, MySQL, etc. Currently,

we have 5 major HBase tables (more big tables can be

created as needed):

1. CBM_use: This table manages user credentials and

access privileges.

2. Field_reports: This table contains data from

operating assets installed on various aircrafts and

operating vehicles.

3. ListOfValues: This table contains variables

(typically vehicle installed sensors) and sampled

historical data. Each data set has a unique

timestamp associated with it.

4. Repair_reports: This table contains data collected

during the repair of a component. Typically data

includes removal data, field observations (free text),

parts replaced/repaired, and shop observations (free

text)

5. Testcell_reports: This table contains data from the

laboratory acceptance and qualification testing.

Most of the components we track undergo an

acceptance test before they are shipped back to the

field.

In general, the HBase has two technical components: (a)

Convenient base classes that support Hadoop

MapReduce jobs and functions with HBase tables; and (b)

Query predicate pushes down via server side scan and

gets filters that will select related data for track

management systems.

As seen in Figure 1, HBase tables can work with

relational databases such as SQL Server or MySQL and

achieve the highest speed in processing and analyzing

the big data. The following is an example of the code in

Listing 1 for our HBase to get data from our SQL Server,

e.g., Honeywell Predictive Trend Monitoring and

Diagnostics (PTMD) database, and others.

Figure 1.

Architecture and Components of our Cloud-based ML Framework.

Listing 1. A Code Segment for our HBase to get data from

the SQL Server.

… …

HADOOP_CLASSPATH=’/opt/hbase-0.92.1-

security/bin/hbase classpath’

${HADOOP_HOME}/bin/hadoop jar /opt/hbase-0.92.1-

security/hbase-0.92.1-security.jar importtsv -

Dimporttsv.columns=HBASE_ROW_KEY, asset:model,

asset:serialnumber, test:device, test:type, test:objective,

test:operator, …,

event:time, algorithm:name, algorithm:date ‘Device_test’

hdfs://RTanalytic…/hadoop/scrap/DeviceSQL4HBase.Ta

g.txt … …

Our cloud-based ML framework works with our existing

SQL databases and analytic tools as seen in Figure 2.

Major existing data sources include stream datasets from

test cells of aircraft engines, Auxiliary Power Units

(APUs), assets (e.g. electronic parts, mechanic parts, etc.)

repair shops, and aircraft fleets. We have our SQL

servers, MySQL databases, and analytic tools (MatLab,

proprietary toolbox, etc.). Newly developed HBase

tables are populated by ETL, and the selected datasets

from existing RDBMS and the HBase provide the data

column families for Mahout ML tools to analyze.

3 Cloud-Based ML Framework

Built Using Apache Ecosystem

Our commodity computers were virtualized by using

Xen Hypervisor (www.xen.org/) as a virtualization

http://www.xen.org/

platform. OpenSUSE Linux operation system was

installed on these virtualized computers. Our Apache

Hadoop [2] software framework is installed as seen in

Figure 3 on these virtual machines to support data-

intensive distributed applications in aerospace industries.

Our first Hadoop cluster consists of three nodes, one of

which was designated as the NameNode and JobTracker.

The other two machines acted as both DataNode and

TaskTracker. A distributed filesystem was configured

and formatted across the three nodes.

MapReduce is the core of the Hadoop technology for

easily writing applications to process vast amounts of

data in-parallel on Hadoop clusters. Our Hadoop cluster

consists of a single master JobTracker and one slave

TaskTracker per cluster-node. The master node is

responsible for scheduling the jobs' component tasks on

the slave nodes, monitoring them and re-executing the

failed tasks. The slave computers execute the tasks as

directed by the master.

Figure 2.

The Cloud-based ML Framework working with our existing data sources and analytic tools.

Figure 3.

Screenshot of the Apache Hadoop Framework.

Figure 4 (a).

A Screenshot of the MapReduce running on our Hadoop Cluster.

Figure 4 (b).

A Screenshot of our MapReduce working with HDFS.

Specifically, our MapReduce runs on the Hadoop cluster

and jobs can be submitted from the command line as

shown in the examples, Figures 4 (a) and (b). The

MapReduce searches for a string in an input set and

writes its result to an output set. Each job can be split up

into parallel tasks working on independent chunks of

data across the nodes.

In addition, our HBase supports massively parallelized

processing via MapReduce for using HBase as both

source and sink. Hadoop Distributed File System (HDFS)

is a distributed file system that is well suited for the

storage of large files. The HDFS also handles failovers,

and replicates blocks. Our HBase is built on top of the

HDFS and provides fast record lookups and updates for

large tables (see Figure 1). In addition, Eucalyptus cloud

development tool [4] was used for building Amazon

Web Services compatible cloud, by leveraging your

existing virtualized infrastructure.

Several key Apache HBase tables are developed. The

HBase provides column families of the data for Mahout

ML algorithms to analyze. Sample results of our HBase

tables are shown in Listings 2 and 3, respectively.

Our cloud-based ML framework is integrated and used

for our CBM analytics. The workflow of this framework

used in our CBM analytics applications is shown in

Figure 5. The CBM workflow can be broadly described

as a set of application and web-servers. The application

servers are intended to serve Matlab users, while the

web-service is used primarily to display summary

conclusions made by Matlab-based analytics. Both these

servers are implemented within the Hadoop cluster and

use a Meta-data Index file to retrieve the necessary data

from our Hadoop-cloud. Application servers also allow

remote method invocations so that analytic calculations

can be done within Matlab – making full use of its

mathematical and statistical libraries.

Listing 2. Partial results of „scan RMSS_report’

HBase table.

… …

37 column=event:removedEngine,

timestamp=1340260441203, value=0

...

9 column=asset:aircraftSerial,

timestamp=1340260441203, value=F900EX

9 column=asset:engineModel,

timestamp=1340260441203, value=331-350

9 column=asset:engineSerial,

timestamp=1340260441203, value=R0122B … …

Listing 3. Partial results of „scan PTMD_report’

HBase table.

… …

98 column=asset:airline, timestamp=1340261260821,

value=Singapore Airlines

98 column=asset:airlinecode,

timestamp=1340261260821, value=SNG

98 column=asset:mission, timestamp=1340261260821,

value=Flight

98 column=asset:model, timestamp=1340261260821,

value=331-350

98 column=asset:serial, timestamp=1340261260821, … …

Figure 5. Workflow of the Cloud-based ML Framework for CBM Analytics.

4 Mahout Machine Learning

Algorithms

Our Mahout library has been integrated in our cloud-based

ML framework in Figure 1. We have integrated and tested

the Mahout ML algorithms, see Figure 6 (a) and (b).

Most important applications using the Mahout ML

algorithms include classification, clustering and

recommendation, along with Pattern Mining, Regression,

Dimension Reduction, and Evolutionary Algorithms. By

analyzing over 100 millions of data examples [5]

successfully, Mahout becomes more important ML tool

to use with Hadoop and MapReduce. Therefore we

decided to integrate Mahout ML algorithms into our

cloud-based ML framework for real world applications

such as anomaly detection, fault mode prediction, etc.

The Mahout machine learning algorithms are written

using MapReduce paradigm:

(a) supervised learning algorithms including neural

networks, support vector machines (SVMs),

Naive Bayesian classifiers, decision trees,

random forests, and logistic regression;

(b) unsupervised learning algorithms including k-

means, hierarchical clustering, self-organizing

maps, fuzzy k-means, Dirichlet, PCA, ICA,

expectation-maximization, and mean-shift, etc..

One code sample of Naive Bayes in Mahout is shown in

Listing 4.

We found that Mahout ML algorithms can be used for

better classification by working with ZooKeeper. More

specifically, when serializing stochastic gradient descent

ML models for deployment as classifiers, it is usually

best to only serialize the best performing sub-model from

Adaptive Logistic Regression. The resulting serialized

file will be 100 times smaller [5] than would result from

serializing the entire ensemble of models contained in

the Adaptive Logistic Regression object. Mahout ML

algorithms are integrated into our cloud based ML

framework including, Support Vector Machines, and

Naive Bayes classifiers, K-Means, hierarchical clustering,

and self-organizing maps, fuzzy k-Means, Dirichlet, and

Mean-Shift. All these algorithms are written in Map-

Reduce paradigm.

Listing 4. A code snippet of the Naive Bayes

classifier in Mahout.

public class Starter { public static void main( final

String[] args )

{ final BayesParameters params = new

BayesParameters(); … …

params.set( "classifierType", "bayes" ); … …

try { Path input = new Path( "/tmp/input" );

TrainClassifier.trainNaiveBayes( input,

"/tmp/output", params );

Algorithm algorithm = new BayesAlgorithm();

Datastore datastore = new

InMemoryBayesDatastore( params );

ClassifierContext classifier =

New ClassifierContext( algorithm, datastore ); … …

Experiments: Specifically, we tested Mahout algorithms

against our data sources such as PTMD datasets, test

cells, and others. One of our experiments was to test

Mahout K-Means for the PTMD data. Details are briefed

in the following. One of the models we want to build is a

trend model that describes how the EGT_MARGIN

changes as a function of the APU age (e.g., flight.AHRS,

etc.), by using the PTMD data.

We believe this the flight.TAT is a random perturbation

to this model.

Our initial steps include:

(1) Mahout Vectors are converted as a Hadoop

sequence file, by using seqdirectory tool to convert

text file into a Hadoop sequence file for Mahout to

run;

(2) feature vectors are created with three dimensions,

(3) RandomSparseAccessvector is used for vectorizing

the data, and

(4) the understandable sequence file is saved in Hadoop.

Figure 6 (a). Mahout algorithms are installed in our Framework.

Figure 6 (b). Mahout algorithms are tested in our Framework.

Technically, we

(a) got 17 data files from our HDFS;

(b) vectorized the data (for making Mahout to

understand the data):

parsed values of astart.EGTP, mesone.EGTA

and algout.EGT_MARGIN from the data files

took average of the values, since each of this

parameter has an array of double values.

preated feature vectors with 3 dimensions.

RandomSparseAccessvector was used for

vectorizing the data

paved in Hadoop understandable sequence file

(c) identified initial clusters – K-Means mandates

guessing initial centroids. Since the datasets have

17 files, we guessed just two centroids;

(d) ran the K-Means algorithm – using convergence

delta as 0.0001, number of iterations as 10, and

cluster classification threshold as 0.5.

We got the result of two classifications; 5 files fall under

the classification “1” and remaining 12 in classification

“0” correctly. Our future work and tasks will include the

following areas:

(1) Classification: predicting a category, e.g.,

discrete, finite values with no ordering;

(2) Regression: predicting a numeric quantity, e.g.,

continuous, infinite values with ordering.

With our data sources (e.g. PTMD, etc.), we will

evaluate and select the best of the ML algorithms,

including Linear Regression, Logistic Regression, Linear

and Logistic Regression with regularization, Neural

Networks, Support Vector Machine, Naive Bayes,

Nearest Neighbor, Decision Tree, Random Forest, and

Gradient Boosted Trees. In addition, we have also tested

well known open source ML tools such as Rapid Miner

and Weka. Although Rapid Miner and Weka are not

designed for cloud computing and big data analytics, we

have tested them in order to evaluate and select the best

possible ML algorithms for our CBM applications.

5 Conclusions In this paper, our cloud-based ML framework is

developed by using Apache Hadoop, MapReduce, HBase,

and others in Apache ecosystem. Mahout Machine

learning algorithms are integrated with this framework in

order to analyze real word data sources from Honeywell

engines, auxiliary power units (APUs), line replaceable

units (LRUs), and many other products.

We found that cloud computing tools: Apache Hadoop,

MapReduce, and cloud software OpenStack or Eucalyptus,

can be used to develop cloud based analytics tools for

various industries including aerospace and manufacturing

venders, in order to provide big data analytics capabilities.

They can complement traditional RDBMS and analysis

technologies that are currently good for a number of

commercial applications (e.g., banks, retails, etc.).

Our future work will focus on evaluating and improving

more ML algorithms for specific tasks and datasets. Much

more data will be bought into our cloud and analyzed by

our big data analytics tools using our top machine

learning algorithms more effectively and efficiently.

6 References

1. Cloud MapReduce

http://code.google.com/p/cloudmapreduce/

2. Apache Ecosystem, http://apache.org/

3. Mahout http://mahout.apache.org/

4. Eucalyptus Cloud Software, www.eucalyptus.com/

5. Sean Owen, etc., “Mahout in Action”, Manning

Publications Co., 2012.

http://code.google.com/p/cloudmapreduce/

http://apache.org/

http://mahout.apache.org/

http://www.eucalyptus.com/

a cloud computing framework with machine learning...

Documents