A Cloud Computing Framework with Machine Learning ...worldcomp- ?· A Cloud Computing Framework with…

Download A Cloud Computing Framework with Machine Learning ...worldcomp- ?· A Cloud Computing Framework with…

Post on 10-Nov-2018

212 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • A Cloud Computing Framework with Machine Learning

    Algorithms for Industrial Applications

    Brian Xu, D. Mylaraswamy, and P. Dietrich

    Honeywell Aerospace, Golden Valley, MN, USA

    Abstract. In this paper, a novel cloud computing

    framework is presented with machine learning (ML)

    algorithms for aerospace applications such as condition

    based maintenance, detecting anomalies, predicting the

    onset of part failures, and reducing total lifecycle costs.

    This cloud framework has been developed by using

    MapReduce, HBase, and Hadoop Distributed File System

    (HDFS) technologies on a Hadoop cluster of OpenSUSE

    Linux machines. Its ML algorithms are based on Mahout

    ML Library and its web portal is built using JBoss and

    JDK. Importantly, the big data from various Honeywell

    data sources are managed by our HBase and analyzed by

    various ML algorithms. Users can use this cloud based

    analytic toolset through web browsers anytime and

    anywhere. More analytic results of using this framework

    will be published later.

    Keywords: Cloud Computing, Big Data Analytics,

    Machine Learning, Algorithms, MapReduce, CBM.

    1 Introduction

    To deal with data overabundance and information

    overload, big data analytics and cloud computing

    technologies are being used by the world top IT and

    software companies such as Google, IBM and Microsoft

    [1]. Currently these technologies are being adopted by

    other industries. In this paper, we present our prototype

    of cloud framework with machine learning (ML)

    algorithms for aerospace applications, specifically,

    Condition Based Maintenance (CBM), monitoring,

    diagnostics, and product reliability and performance. In

    Honeywell, there are big (volume, velocity, variety) data

    that are collected and streamed from thousands of

    aircraft (operational datasets, maintenance data, etc.), test

    cells (hundreds of sensors and measurements, etc.), and

    repair shops (records of electronics, avionics, mechanical

    repairs, etc.). For instance, one test cell can generate 300

    MB test data daily per engine. Our approach is to

    combine the best strengths and synergies of both cloud

    computing and machine learning technologies, in order

    to effectively analyze the big data and develop

    capabilities of predictive analysis, actionable information,

    better CBM and decision making.

    Technically, by combining and leveraging cloud

    computing and ML technologies, our major goals are

    included (not limited to): (1) detecting anomalies from

    parts, components and systems, (2) predicting the onset of

    failures of parts (e.g., components, LRUs, etc.) to

    maximize asset usages and availability, minimize the

    downtimes, and (3) sustaining better and effective CBM

    policies, and (4) reducing total lifecycle costs of our

    aerospace assets and networks. Our primary tasks are to

    realize these goals by analyzing the big data and

    transforming information into knowledge. In our CBM

    applications, after we developed our Hadoop cluster by

    leveraging Apache ecosystems [2], we have focused on

    analyzing and mining our data sources by using open

    source ML algorithms including Mahout Library [3] and

    by developing our ML algorithms using R and Matlab.

    This paper is organized as follows: Section 2 describes

    our cloud-based ML framework and components. Apache

    Hadoop, MapReduce, and HBase are used to develop an

    effective cloud computing infrastructure on which our

    machine learning framework is built, and the technical

    details are described in Section 3. Mahout ML algorithms

    are briefly introduced in Section 4. Our conclusions are

    presented in Section 5.

    2 Architecture and components of Cloud-based ML Framework

    Our specific tasks are to find valuable insights, patterns

    and trends in big data (large volume, velocity, and variety)

    that can lead to actionable information, decision making,

    prediction, situation awareness and understanding. To

    complete these technical tasks, we have developed a cloud

    framework with machine learning technologies for cyber-

    learning, leveraging machine learning algorithms (SVM,

    random forests, PCA, K-means, etc.), knowledge mining,

    and knowledge intensive problem solving.

    We developed our cloud-based ML framework, by

    developing a Cloud Controller, Cluster Controllers, and

    Node Controllers on our Hadoop cluster of Linux machines.

    We used Eucalyptus cloud tool [4] to develop our primary

    software framework. The framework architecture and key

    components are shown in Figure 1.

    In Figure 1, we implemented the HBase that is a scalable,

    distributed database and supports real-time access large

    data repositories such as Oracle, MySQL, etc. Currently,

    we have 5 major HBase tables (more big tables can be

    created as needed):

    1. CBM_use: This table manages user credentials and access privileges.

  • 2. Field_reports: This table contains data from operating assets installed on various aircrafts and

    operating vehicles.

    3. ListOfValues: This table contains variables (typically vehicle installed sensors) and sampled

    historical data. Each data set has a unique

    timestamp associated with it.

    4. Repair_reports: This table contains data collected during the repair of a component. Typically data

    includes removal data, field observations (free text),

    parts replaced/repaired, and shop observations (free

    text)

    5. Testcell_reports: This table contains data from the laboratory acceptance and qualification testing.

    Most of the components we track undergo an

    acceptance test before they are shipped back to the

    field.

    In general, the HBase has two technical components: (a)

    Convenient base classes that support Hadoop

    MapReduce jobs and functions with HBase tables; and (b)

    Query predicate pushes down via server side scan and

    gets filters that will select related data for track

    management systems.

    As seen in Figure 1, HBase tables can work with

    relational databases such as SQL Server or MySQL and

    achieve the highest speed in processing and analyzing

    the big data. The following is an example of the code in

    Listing 1 for our HBase to get data from our SQL Server,

    e.g., Honeywell Predictive Trend Monitoring and

    Diagnostics (PTMD) database, and others.

    Figure 1.

    Architecture and Components of our Cloud-based ML Framework.

    Listing 1. A Code Segment for our HBase to get data from

    the SQL Server.

    HADOOP_CLASSPATH=/opt/hbase-0.92.1-

    security/bin/hbase classpath

    ${HADOOP_HOME}/bin/hadoop jar /opt/hbase-0.92.1-

    security/hbase-0.92.1-security.jar importtsv -

    Dimporttsv.columns=HBASE_ROW_KEY, asset:model,

    asset:serialnumber, test:device, test:type, test:objective,

    test:operator, ,

    event:time, algorithm:name, algorithm:date Device_test

    hdfs://RTanalytic/hadoop/scrap/DeviceSQL4HBase.Ta

    g.txt

    Our cloud-based ML framework works with our existing

    SQL databases and analytic tools as seen in Figure 2.

    Major existing data sources include stream datasets from

    test cells of aircraft engines, Auxiliary Power Units

    (APUs), assets (e.g. electronic parts, mechanic parts, etc.)

    repair shops, and aircraft fleets. We have our SQL

    servers, MySQL databases, and analytic tools (MatLab,

    proprietary toolbox, etc.). Newly developed HBase

    tables are populated by ETL, and the selected datasets

    from existing RDBMS and the HBase provide the data

    column families for Mahout ML tools to analyze.

    3 Cloud-Based ML Framework Built Using Apache Ecosystem

    Our commodity computers were virtualized by using

    Xen Hypervisor (www.xen.org/) as a virtualization

    http://www.xen.org/

  • platform. OpenSUSE Linux operation system was

    installed on these virtualized computers. Our Apache

    Hadoop [2] software framework is installed as seen in

    Figure 3 on these virtual machines to support data-

    intensive distributed applications in aerospace industries.

    Our first Hadoop cluster consists of three nodes, one of

    which was designated as the NameNode and JobTracker.

    The other two machines acted as both DataNode and

    TaskTracker. A distributed filesystem was configured

    and formatted across the three nodes.

    MapReduce is the core of the Hadoop technology for

    easily writing applications to process vast amounts of

    data in-parallel on Hadoop clusters. Our Hadoop cluster

    consists of a single master JobTracker and one slave

    TaskTracker per cluster-node. The master node is

    responsible for scheduling the jobs' component tasks on

    the slave nodes, monitoring them and re-executing the

    failed tasks. The slave computers execute the tasks as

    directed by the master.

    Figure 2.

    The Cloud-based ML Framework working with our existing data sources and analytic tools.

    Figure 3.

    Screenshot of the Apache Hadoop Framework.

  • Figure 4 (a).

    A Screenshot of the MapReduce running on our Hadoop Cluster.

    Figure 4 (b).

    A Screenshot of our MapReduce working with HDFS.

    Specifically, our MapReduce runs on the Hadoop cluster

    and jobs can be submitted from the command line as

    shown in the examples, Figures 4 (a) and (b). The

    MapReduce searches for a string in an input set and

    writes its result to an output set. Each job can be split up

    into parallel tasks working on independent chunks of

    data across the nodes.

    In addition, our HBase supports massively parallelized

    processing via MapReduce for using HBase as both

    source and sink. Hadoop Distributed File System (HDFS)

    is a distributed file system that is well suited for the

    storage of large files. The HDFS also handles failovers,

    and replicates blocks. Our HBase is built on top of the

    HDFS and provides fast record lookups and updates for

    large tables (see Figure 1). In addition, Eucalyptus cloud

    development tool [4] was used for building Amazon

    Web Services compatible cloud, by leveraging your

    existing virtualized infrastructure.

    Several key Apache HBase tables are developed. The

    HBase provides column families of the data for Mahout

    ML algorithms to analyze. Sample results of our HBase

    tables are shown in Listings 2 and 3, respectively.

    Our cloud-based ML framework is integrated and used

    for our CBM analytics. The workflow of this framework

    used in our CBM analytics applications is shown in

    Figure 5. The CBM workflow can be broadly described

    as a set of application and web-servers. The application

    servers are intended to serve Matlab users, while the

    web-service is used primarily to display summary

    conclusions made by Matlab-based analytics. Both these

    servers are implemented within the Hadoop cluster and

    use a Meta-data Index file to retrieve the necessary data

    from our Hadoop-cloud. Application servers also allow

    remote method invocations so that analytic calculations

    can be done within Matlab making full use of its

    mathematical and statistical libraries.

  • Listing 2. Partial results of scan RMSS_report

    HBase table.

    37 column=event:removedEngine,

    timestamp=1340260441203, value=0

    ...

    9 column=asset:aircraftSerial,

    timestamp=1340260441203, value=F900EX

    9 column=asset:engineModel,

    timestamp=1340260441203, value=331-350

    9 column=asset:engineSerial,

    timestamp=1340260441203, value=R0122B

    Listing 3. Partial results of scan PTMD_report

    HBase table.

    98 column=asset:airline, timestamp=1340261260821,

    value=Singapore Airlines

    98 column=asset:airlinecode,

    timestamp=1340261260821, value=SNG

    98 column=asset:mission, timestamp=1340261260821,

    value=Flight

    98 column=asset:model, timestamp=1340261260821,

    value=331-350

    98 column=asset:serial, timestamp=1340261260821,

    Figure 5. Workflow of the Cloud-based ML Framework for CBM Analytics.

    4 Mahout Machine Learning

    Algorithms

    Our Mahout library has been integrated in our cloud-based

    ML framework in Figure 1. We have integrated and tested

    the Mahout ML algorithms, see Figure 6 (a) and (b).

    Most important applications using the Mahout ML

    algorithms include classification, clustering and

    recommendation, along with Pattern Mining, Regression,

    Dimension Reduction, and Evolutionary Algorithms. By

    analyzing over 100 millions of data examples [5]

    successfully, Mahout becomes more important ML tool

    to use with Hadoop and MapReduce. Therefore we

    decided to integrate Mahout ML algorithms into our

    cloud-based ML framework for real world applications

    such as anomaly detection, fault mode prediction, etc.

    The Mahout machine learning algorithms are written

    using MapReduce paradigm:

    (a) supervised learning algorithms including neural

    networks, support vector machines (SVMs),

    Naive Bayesian classifiers, decision trees,

    random forests, and logistic regression;

    (b) unsupervised learning algorithms including k-

    means, hierarchical clustering, self-organizing

    maps, fuzzy k-means, Dirichlet, PCA, ICA,

    expectation-maximization, and mean-shift, etc..

    One code sample of Naive Bayes in Mahout is shown in

    Listing 4.

    We found that Mahout ML algorithms can be used for

    better classification by working with ZooKeeper. More

    specifically, when serializing stochastic gradient descent

    ML models for deployment as classifiers, it is usually

    best to only serialize the best performing sub-model from

    Adaptive Logistic Regression. The resulting serialized

    file will be 100 times smaller [5] than would result from

  • serializing the entire ensemble of models contained in

    the Adaptive Logistic Regression object. Mahout ML

    algorithms are integrated into our cloud based ML

    framework including, Support Vector Machines, and

    Naive Bayes classifiers, K-Means, hierarchical clustering,

    and self-organizing maps, fuzzy k-Means, Dirichlet, and

    Mean-Shift. All these algorithms are written in Map-

    Reduce paradigm.

    Listing 4. A code snippet of the Naive Bayes

    classifier in Mahout.

    public class Starter { public static void main( final

    String[] args )

    { final BayesParameters params = new

    BayesParameters();

    params.set( "classifierType", "bayes" );

    try { Path input = new Path( "/tmp/input" );

    TrainClassifier.trainNaiveBayes( input,

    "/tmp/output", params );

    Algorithm algorithm = new BayesAlgorithm();

    Datastore datastore = new

    InMemoryBayesDatastore( params );

    ClassifierContext classifier =

    New ClassifierContext( algorithm, datastore );

    Experiments: Specifically, we tested Mahout algorithms

    against our data sources such as PTMD datasets, test

    cells, and others. One of our experiments was to test

    Mahout K-Means for the PTMD data. Details are briefed

    in the following. One of the models we want to build is a

    trend model that describes how the EGT_MARGIN

    changes as a function of the APU age (e.g., flight.AHRS,

    etc.), by using the PTMD data.

    We believe this the flight.TAT is a random perturbation

    to this model.

    Our initial steps include:

    (1) Mahout Vectors are converted as a Hadoop

    sequence file, by using seqdirectory tool to convert

    text file into a Hadoop sequence file for Mahout to

    run;

    (2) feature vectors are created with three dimensions,

    (3) RandomSparseAccessvector is used for vectorizing...

Recommended

View more >