a cloud computing framework with machine learning...
Post on 10-Nov-2018
215 Views
Preview:
TRANSCRIPT
A Cloud Computing Framework with Machine Learning
Algorithms for Industrial Applications
Brian Xu, D. Mylaraswamy, and P. Dietrich
Honeywell Aerospace, Golden Valley, MN, USA
Abstract. In this paper, a novel cloud computing
framework is presented with machine learning (ML)
algorithms for aerospace applications such as condition
based maintenance, detecting anomalies, predicting the
onset of part failures, and reducing total lifecycle costs.
This cloud framework has been developed by using
MapReduce, HBase, and Hadoop Distributed File System
(HDFS) technologies on a Hadoop cluster of OpenSUSE
Linux machines. Its ML algorithms are based on Mahout
ML Library and its web portal is built using JBoss and
JDK. Importantly, the big data from various Honeywell
data sources are managed by our HBase and analyzed by
various ML algorithms. Users can use this cloud based
analytic toolset through web browsers anytime and
anywhere. More analytic results of using this framework
will be published later.
Keywords: Cloud Computing, Big Data Analytics,
Machine Learning, Algorithms, MapReduce, CBM.
1 Introduction
To deal with data overabundance and information
overload, big data analytics and cloud computing
technologies are being used by the world top IT and
software companies such as Google, IBM and Microsoft
[1]. Currently these technologies are being adopted by
other industries. In this paper, we present our prototype
of cloud framework with machine learning (ML)
algorithms for aerospace applications, specifically,
Condition Based Maintenance (CBM), monitoring,
diagnostics, and product reliability and performance. In
Honeywell, there are big (volume, velocity, variety) data
that are collected and streamed from thousands of
aircraft (operational datasets, maintenance data, etc.), test
cells (hundreds of sensors and measurements, etc.), and
repair shops (records of electronics, avionics, mechanical
repairs, etc.). For instance, one test cell can generate 300
MB test data daily per engine. Our approach is to
combine the best strengths and synergies of both cloud
computing and machine learning technologies, in order
to effectively analyze the big data and develop
capabilities of predictive analysis, actionable information,
better CBM and decision making.
Technically, by combining and leveraging cloud
computing and ML technologies, our major goals are
included (not limited to): (1) detecting anomalies from
parts, components and systems, (2) predicting the onset of
failures of parts (e.g., components, LRUs, etc.) to
maximize asset usages and availability, minimize the
downtimes, and (3) sustaining better and effective CBM
policies, and (4) reducing total lifecycle costs of our
aerospace assets and networks. Our primary tasks are to
realize these goals by analyzing the big data and
transforming information into knowledge. In our CBM
applications, after we developed our Hadoop cluster by
leveraging Apache ecosystems [2], we have focused on
analyzing and mining our data sources by using open
source ML algorithms including Mahout Library [3] and
by developing our ML algorithms using R and Matlab.
This paper is organized as follows: Section 2 describes
our cloud-based ML framework and components. Apache
Hadoop, MapReduce, and HBase are used to develop an
effective cloud computing infrastructure on which our
machine learning framework is built, and the technical
details are described in Section 3. Mahout ML algorithms
are briefly introduced in Section 4. Our conclusions are
presented in Section 5.
2 Architecture and components of
Cloud-based ML Framework
Our specific tasks are to find valuable insights, patterns
and trends in big data (large volume, velocity, and variety)
that can lead to actionable information, decision making,
prediction, situation awareness and understanding. To
complete these technical tasks, we have developed a cloud
framework with machine learning technologies for cyber-
learning, leveraging machine learning algorithms (SVM,
random forests, PCA, K-means, etc.), knowledge mining,
and knowledge intensive problem solving.
We developed our cloud-based ML framework, by
developing a Cloud Controller, Cluster Controllers, and
Node Controllers on our Hadoop cluster of Linux machines.
We used Eucalyptus cloud tool [4] to develop our primary
software framework. The framework architecture and key
components are shown in Figure 1.
In Figure 1, we implemented the HBase that is a scalable,
distributed database and supports real-time access large
data repositories such as Oracle, MySQL, etc. Currently,
we have 5 major HBase tables (more big tables can be
created as needed):
1. CBM_use: This table manages user credentials and
access privileges.
2. Field_reports: This table contains data from
operating assets installed on various aircrafts and
operating vehicles.
3. ListOfValues: This table contains variables
(typically vehicle installed sensors) and sampled
historical data. Each data set has a unique
timestamp associated with it.
4. Repair_reports: This table contains data collected
during the repair of a component. Typically data
includes removal data, field observations (free text),
parts replaced/repaired, and shop observations (free
text)
5. Testcell_reports: This table contains data from the
laboratory acceptance and qualification testing.
Most of the components we track undergo an
acceptance test before they are shipped back to the
field.
In general, the HBase has two technical components: (a)
Convenient base classes that support Hadoop
MapReduce jobs and functions with HBase tables; and (b)
Query predicate pushes down via server side scan and
gets filters that will select related data for track
management systems.
As seen in Figure 1, HBase tables can work with
relational databases such as SQL Server or MySQL and
achieve the highest speed in processing and analyzing
the big data. The following is an example of the code in
Listing 1 for our HBase to get data from our SQL Server,
e.g., Honeywell Predictive Trend Monitoring and
Diagnostics (PTMD) database, and others.
Figure 1.
Architecture and Components of our Cloud-based ML Framework.
Listing 1. A Code Segment for our HBase to get data from
the SQL Server.
… …
HADOOP_CLASSPATH=’/opt/hbase-0.92.1-
security/bin/hbase classpath’
${HADOOP_HOME}/bin/hadoop jar /opt/hbase-0.92.1-
security/hbase-0.92.1-security.jar importtsv -
Dimporttsv.columns=HBASE_ROW_KEY, asset:model,
asset:serialnumber, test:device, test:type, test:objective,
test:operator, …,
event:time, algorithm:name, algorithm:date ‘Device_test’
hdfs://RTanalytic…/hadoop/scrap/DeviceSQL4HBase.Ta
g.txt … …
Our cloud-based ML framework works with our existing
SQL databases and analytic tools as seen in Figure 2.
Major existing data sources include stream datasets from
test cells of aircraft engines, Auxiliary Power Units
(APUs), assets (e.g. electronic parts, mechanic parts, etc.)
repair shops, and aircraft fleets. We have our SQL
servers, MySQL databases, and analytic tools (MatLab,
proprietary toolbox, etc.). Newly developed HBase
tables are populated by ETL, and the selected datasets
from existing RDBMS and the HBase provide the data
column families for Mahout ML tools to analyze.
3 Cloud-Based ML Framework
Built Using Apache Ecosystem
Our commodity computers were virtualized by using
Xen Hypervisor (www.xen.org/) as a virtualization
platform. OpenSUSE Linux operation system was
installed on these virtualized computers. Our Apache
Hadoop [2] software framework is installed as seen in
Figure 3 on these virtual machines to support data-
intensive distributed applications in aerospace industries.
Our first Hadoop cluster consists of three nodes, one of
which was designated as the NameNode and JobTracker.
The other two machines acted as both DataNode and
TaskTracker. A distributed filesystem was configured
and formatted across the three nodes.
MapReduce is the core of the Hadoop technology for
easily writing applications to process vast amounts of
data in-parallel on Hadoop clusters. Our Hadoop cluster
consists of a single master JobTracker and one slave
TaskTracker per cluster-node. The master node is
responsible for scheduling the jobs' component tasks on
the slave nodes, monitoring them and re-executing the
failed tasks. The slave computers execute the tasks as
directed by the master.
Figure 2.
The Cloud-based ML Framework working with our existing data sources and analytic tools.
Figure 3.
Screenshot of the Apache Hadoop Framework.
Figure 4 (a).
A Screenshot of the MapReduce running on our Hadoop Cluster.
Figure 4 (b).
A Screenshot of our MapReduce working with HDFS.
Specifically, our MapReduce runs on the Hadoop cluster
and jobs can be submitted from the command line as
shown in the examples, Figures 4 (a) and (b). The
MapReduce searches for a string in an input set and
writes its result to an output set. Each job can be split up
into parallel tasks working on independent chunks of
data across the nodes.
In addition, our HBase supports massively parallelized
processing via MapReduce for using HBase as both
source and sink. Hadoop Distributed File System (HDFS)
is a distributed file system that is well suited for the
storage of large files. The HDFS also handles failovers,
and replicates blocks. Our HBase is built on top of the
HDFS and provides fast record lookups and updates for
large tables (see Figure 1). In addition, Eucalyptus cloud
development tool [4] was used for building Amazon
Web Services compatible cloud, by leveraging your
existing virtualized infrastructure.
Several key Apache HBase tables are developed. The
HBase provides column families of the data for Mahout
ML algorithms to analyze. Sample results of our HBase
tables are shown in Listings 2 and 3, respectively.
Our cloud-based ML framework is integrated and used
for our CBM analytics. The workflow of this framework
used in our CBM analytics applications is shown in
Figure 5. The CBM workflow can be broadly described
as a set of application and web-servers. The application
servers are intended to serve Matlab users, while the
web-service is used primarily to display summary
conclusions made by Matlab-based analytics. Both these
servers are implemented within the Hadoop cluster and
use a Meta-data Index file to retrieve the necessary data
from our Hadoop-cloud. Application servers also allow
remote method invocations so that analytic calculations
can be done within Matlab – making full use of its
mathematical and statistical libraries.
Listing 2. Partial results of „scan RMSS_report’
HBase table.
… …
37 column=event:removedEngine,
timestamp=1340260441203, value=0
...
9 column=asset:aircraftSerial,
timestamp=1340260441203, value=F900EX
9 column=asset:engineModel,
timestamp=1340260441203, value=331-350
9 column=asset:engineSerial,
timestamp=1340260441203, value=R0122B … …
Listing 3. Partial results of „scan PTMD_report’
HBase table.
… …
98 column=asset:airline, timestamp=1340261260821,
value=Singapore Airlines
98 column=asset:airlinecode,
timestamp=1340261260821, value=SNG
98 column=asset:mission, timestamp=1340261260821,
value=Flight
98 column=asset:model, timestamp=1340261260821,
value=331-350
98 column=asset:serial, timestamp=1340261260821, … …
Figure 5. Workflow of the Cloud-based ML Framework for CBM Analytics.
4 Mahout Machine Learning
Algorithms
Our Mahout library has been integrated in our cloud-based
ML framework in Figure 1. We have integrated and tested
the Mahout ML algorithms, see Figure 6 (a) and (b).
Most important applications using the Mahout ML
algorithms include classification, clustering and
recommendation, along with Pattern Mining, Regression,
Dimension Reduction, and Evolutionary Algorithms. By
analyzing over 100 millions of data examples [5]
successfully, Mahout becomes more important ML tool
to use with Hadoop and MapReduce. Therefore we
decided to integrate Mahout ML algorithms into our
cloud-based ML framework for real world applications
such as anomaly detection, fault mode prediction, etc.
The Mahout machine learning algorithms are written
using MapReduce paradigm:
(a) supervised learning algorithms including neural
networks, support vector machines (SVMs),
Naive Bayesian classifiers, decision trees,
random forests, and logistic regression;
(b) unsupervised learning algorithms including k-
means, hierarchical clustering, self-organizing
maps, fuzzy k-means, Dirichlet, PCA, ICA,
expectation-maximization, and mean-shift, etc..
One code sample of Naive Bayes in Mahout is shown in
Listing 4.
We found that Mahout ML algorithms can be used for
better classification by working with ZooKeeper. More
specifically, when serializing stochastic gradient descent
ML models for deployment as classifiers, it is usually
best to only serialize the best performing sub-model from
Adaptive Logistic Regression. The resulting serialized
file will be 100 times smaller [5] than would result from
serializing the entire ensemble of models contained in
the Adaptive Logistic Regression object. Mahout ML
algorithms are integrated into our cloud based ML
framework including, Support Vector Machines, and
Naive Bayes classifiers, K-Means, hierarchical clustering,
and self-organizing maps, fuzzy k-Means, Dirichlet, and
Mean-Shift. All these algorithms are written in Map-
Reduce paradigm.
Listing 4. A code snippet of the Naive Bayes
classifier in Mahout.
public class Starter { public static void main( final
String[] args )
{ final BayesParameters params = new
BayesParameters(); … …
params.set( "classifierType", "bayes" ); … …
try { Path input = new Path( "/tmp/input" );
TrainClassifier.trainNaiveBayes( input,
"/tmp/output", params );
Algorithm algorithm = new BayesAlgorithm();
Datastore datastore = new
InMemoryBayesDatastore( params );
ClassifierContext classifier =
New ClassifierContext( algorithm, datastore ); … …
Experiments: Specifically, we tested Mahout algorithms
against our data sources such as PTMD datasets, test
cells, and others. One of our experiments was to test
Mahout K-Means for the PTMD data. Details are briefed
in the following. One of the models we want to build is a
trend model that describes how the EGT_MARGIN
changes as a function of the APU age (e.g., flight.AHRS,
etc.), by using the PTMD data.
We believe this the flight.TAT is a random perturbation
to this model.
Our initial steps include:
(1) Mahout Vectors are converted as a Hadoop
sequence file, by using seqdirectory tool to convert
text file into a Hadoop sequence file for Mahout to
run;
(2) feature vectors are created with three dimensions,
(3) RandomSparseAccessvector is used for vectorizing
the data, and
(4) the understandable sequence file is saved in Hadoop.
Figure 6 (a). Mahout algorithms are installed in our Framework.
Figure 6 (b). Mahout algorithms are tested in our Framework.
Technically, we
(a) got 17 data files from our HDFS;
(b) vectorized the data (for making Mahout to
understand the data):
parsed values of astart.EGTP, mesone.EGTA
and algout.EGT_MARGIN from the data files
took average of the values, since each of this
parameter has an array of double values.
preated feature vectors with 3 dimensions.
RandomSparseAccessvector was used for
vectorizing the data
paved in Hadoop understandable sequence file
(c) identified initial clusters – K-Means mandates
guessing initial centroids. Since the datasets have
17 files, we guessed just two centroids;
(d) ran the K-Means algorithm – using convergence
delta as 0.0001, number of iterations as 10, and
cluster classification threshold as 0.5.
We got the result of two classifications; 5 files fall under
the classification “1” and remaining 12 in classification
“0” correctly. Our future work and tasks will include the
following areas:
(1) Classification: predicting a category, e.g.,
discrete, finite values with no ordering;
(2) Regression: predicting a numeric quantity, e.g.,
continuous, infinite values with ordering.
With our data sources (e.g. PTMD, etc.), we will
evaluate and select the best of the ML algorithms,
including Linear Regression, Logistic Regression, Linear
and Logistic Regression with regularization, Neural
Networks, Support Vector Machine, Naive Bayes,
Nearest Neighbor, Decision Tree, Random Forest, and
Gradient Boosted Trees. In addition, we have also tested
well known open source ML tools such as Rapid Miner
and Weka. Although Rapid Miner and Weka are not
designed for cloud computing and big data analytics, we
have tested them in order to evaluate and select the best
possible ML algorithms for our CBM applications.
5 Conclusions In this paper, our cloud-based ML framework is
developed by using Apache Hadoop, MapReduce, HBase,
and others in Apache ecosystem. Mahout Machine
learning algorithms are integrated with this framework in
order to analyze real word data sources from Honeywell
engines, auxiliary power units (APUs), line replaceable
units (LRUs), and many other products.
We found that cloud computing tools: Apache Hadoop,
MapReduce, and cloud software OpenStack or Eucalyptus,
can be used to develop cloud based analytics tools for
various industries including aerospace and manufacturing
venders, in order to provide big data analytics capabilities.
They can complement traditional RDBMS and analysis
technologies that are currently good for a number of
commercial applications (e.g., banks, retails, etc.).
Our future work will focus on evaluating and improving
more ML algorithms for specific tasks and datasets. Much
more data will be bought into our cloud and analyzed by
our big data analytics tools using our top machine
learning algorithms more effectively and efficiently.
6 References
1. Cloud MapReduce
http://code.google.com/p/cloudmapreduce/
2. Apache Ecosystem, http://apache.org/
3. Mahout http://mahout.apache.org/
4. Eucalyptus Cloud Software, www.eucalyptus.com/
5. Sean Owen, etc., “Mahout in Action”, Manning
Publications Co., 2012.
top related