rm world 2014: a user interface for big data with rapidminer
DESCRIPTION
TRANSCRIPT
![Page 1: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/1.jpg)
A User Interface For Big Data With RapidMiner
Marcelo Beckmann
Nelson F. F. Ebecken
Beatriz S. L. Pires de Lima
Myrian Christina de Aragão Costa
![Page 2: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/2.jpg)
Agenda
Introduction
Previous Work
Motivations
Architecture
Operators
Mouse Runner
Experiments
Conclusion
![Page 3: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/3.jpg)
Introduction
Since 2012, 2.5 exabytes of data were created every day, andthis volume is still growing;
How to extract useful information from this daily mountain ofdata?
Map Reduce paradigm and it's related frameworks answered thequestion;
Google, Yahoo, Netflix, Amazon, YouTube, Facebook, and Appleare good examples of successful big data projects;
The Hadoop environment is the result of the great effort madeby open source initiatives since 2004;
![Page 4: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/4.jpg)
Introduction
Despite the great progress made in the backend engines, thereis a lack of user interfaces;
Nowadays, most of the work to configure and run Hadoopcomponents is done through scripting;
This work aims to contribute some how to improve thisscenario.
![Page 5: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/5.jpg)
Previous Work
Since the MapReduce advent, the research and development were more focused on backend engines;
In the last years, several initiatives started to make the Hadoop environment more user friendly;
Companies like Cloudera, Pentaho, Talend, Hortonworks made huge contributions to improve the tools usability in Hadoop environment, specially in execution control, ETL and databases;
Radoop (*) made significant contributions to integrate Mahout to RapidMiner with a proprietary solution.
* Radoop was acquired by Rapidminer in July/2014
![Page 6: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/6.jpg)
Motivations
The Hadoop environment, and specially the Mahout engine, still lack of an open source UI integration;
In terms of Java coding, the job start, remote API calls, and result retrieval from the Hadoop environment is too complex. An encapsulation is needed to simplify this kind of activity;
There are integration and connectivity problems in heterogeneous environments and complex network infrastructure.
![Page 7: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/7.jpg)
Architecture
Our research
![Page 8: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/8.jpg)
Architecture. Big Data Extension
RapidMiner is easy to extend;
A RapidMiner extension with 14 operators was created;
Big data operators can be mixed with already existingRapidMiner operators, in order to run jobs and analyzeresults;
Integrated with Hadoop, HDFS, Hive, Mahout;
Open Source.
. Mouse Runner Provides an extra layer for remote call and activation;
Reduces the coupling between presentation-tier and businessservices;
Start jobs and retrieve results from the Hadoop relatedcomponents.
![Page 9: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/9.jpg)
Operators
![Page 10: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/10.jpg)
Operators
Masters node – Contains all the configuration necessary toconnect the operators to a Hadoop environment;
IO Operators – Execute operations in HDFS and HiveDatabase;
Read Hive Database – Execute queries in Hive Database,returns an ExampleSet with samples, but points to a file inHDFS. Other Big Data operators will refer to this pointedfile, not the samples;
Clustering – Cluster algorithms from Mahout;
Transformation –To perform transformations in Hivedatabase;
Utility - Run scripts through SSH connection, Kill Jobs.
![Page 11: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/11.jpg)
Mouse RunnerMouse Runner simplifies the call to Hadoop components
KMeansRunner runner =new KMeansRunner();
runner.setHost("192.168.13.131");
runner.setHdfsPort("9000");
runner.setMapredPort("9001");
runner.setInputPath("/user/hadoop-users/testdata");
runner.setOutputPath("/user/hadoop-user/output");
runner.setK(5);
runner.setMaxRuns(10);
ClusterResult result = runner.run();
![Page 12: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/12.jpg)
Mouse Runner
Ports to open:
9000, 9001, 50070, 50075, 50090, 50105, 50030, 50060, 8020, 50010, 50020, 50100, 10000, ...
Integration among heterogeneous OS and networks
![Page 13: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/13.jpg)
Mouse Runner
Ports to open:
9999, 10000
![Page 14: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/14.jpg)
Experiments
• K-means clustering comparison between RapidMiner and Mahout using Davies–Bouldin index;
•Davies–Bouldin index: Has an internal evaluation method to measure the quality of clusters, the lower the DBI better the cluster quality;
•The aim is to validate the integration made with Mahout, using the RapidMiner K-Means as baseline;
•Datasets: Synthetic Control, Covertype and Household from UCI machine learning repository;
•Results obtained in terms of Davies-Bouldin were pretty similar;
•RapidMiner had an instant response in the smaller dataset;
•Mahouts scaled better in the bigger datasets.
![Page 15: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/15.jpg)
Experiments
![Page 16: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/16.jpg)
Experiments
![Page 17: RM World 2014: A user interface for big data with RapidMiner](https://reader033.vdocuments.net/reader033/viewer/2022061218/54811a0f5906b5d76c8b463e/html5/thumbnails/17.jpg)
Conclusion
•An open source extension for RapidMiner called “Big Data” wascreated;
•This extension Integrates RapidMiner with Hadoop, HDFS, Hive andMahout;
•Counts initially with 14 operators;
•Created a component called “Mouse Runner”, wich provides remoteactivation facilities and a simplified API for activation and resultretrieval for Hadoop related components;
•A comparisson between K-Means operators from RapidMiner andMahout showed similar results in terms of Davies-Bouldin index;
•Mahout scaled better in bigger datasets. RapidMiner had instantresponse in the smaller dataset.