big data mining technologies final

Big Data Mining TechnologiesBy Kushagra Trivedi

Contents

• Introduction to big data and big data mining

• Apache Hadoop for big data mining

• Apache S4 for big data mining

• Apache Mahout for machine learning

• Some other tools of machine learning and data mining

• Comparison of big data mining technologies

• Conclusion

• References I

Introduction To Big Data And Big Data Mining

• Data with large amount and greater complexity

• Definition of big data

• Sources of data expansion

• Definition of data mining

• Why data mining is necessary

• Some of the technologies are used for data mining

1

Apache Hadoop

• Data intensive distributes architecture

• Centralized server vs. distributed server

• MapReduce and Hadoop Distributed File System

• HDFS divides data blocks among

• Writing application that rapidly process large amount of data in parallel on large clusters of compute nodes

• Applications - Yahoo, Facebook and other Fortune 50 companies are using apache Hadoop

2

Cont…

Hadoop Distributed File System

Name Node

DataNode 4

DataNode 3

DataNode 2

DataNode 1

b1

b2

…b2

b3

…b1

b3

…b1

b2

…

3

Cont…

• NameNode maintains all meta information about DataNodes

• DataNodes contains actual data blocks

• HDFS distributes and replicates data blocks among data nodes

• Clients executes a query goes to NameNode and search actual data by looking at meta infomation

4

Cont….

MapReduce Algorithm

Figure 2 MapReduce distribution [2]

Austin Powers defeated the league of evil villains

The league of evil villainsAustin Power defeated

If Austin Power defeated theLeague of evil villains

Where’s the league of evil villains Austin Powers defeated

AustinPowersdefeatedtheleagueofevilvillains

theleagueofevilvillainsAustinPowersdefeatedif

austinpowersdefeated theleagueofevilvillains Where’s

TheLeagueOfEvilVillainsAustinPowersdefeated

11111111

111111111

11111111 1

11111111

AustinPowersdefeatedtheleagueofevilvillains

IfAustinPowersdefeatedtheleagueofevilvillainsWhere’s

IfAustinPowersdefeatedtheleagueofevilvillainsWhere’s

22222222

2222222221

1444444441

MAP

MAP

MAP

MAP

GROUP

GROUP

REDUCE

REDUCE

REDUCE

GROUP

5

Continue….

• Uses two functions: map and reduce

• Data are fed into map function in order to produce intermediate key and value pair

• Intermediate result is then given to reduce function in order to produce final result

• Task tracker- do work that is assigned by job tracker

• Job tracker- if task tracker fails then reallocation of task tracker is done

6

Apache S4

• S4 stands for simple scalable streaming system

• Uses MapReduce and Actor model for computation

• Data processing is done through processing elements

• S4 framework provide a way to route and create processing elements according to necessary

• Applications - Yahoo, LinkedIn, A9 and Quantbench are several companies use Apache S4 for big data mining

7

Continue…

Figure 3 S4 word count sample [6]

8

Continue…

• Processing elements are basic computational units

• Processing elements only executes those events for which key it was created

• A special processing element is keyless element and it is created for accepting any type of input

• Processing nodes are logical hosts of processing elements

• S4 routes events to processing nodes based on hash value of keyed attributes in those events

9

Apache Mahout

• Open source project of Apache foundation which allows programmer to write machine learning algorithm

• Works on three different algorithms those are clustering, classification and collaborative filtering

• Includes several distributed clustering algorithm such as k-Means, Fuzzy k-Means, Dirchlet, Mean-Shift and Canopy

• Applications- Products you want to buy, people you might want to connect with, potential life partners and recommending songs you might like

10

Continue….

1) Building a recommendation engine

• Currently provides “Taste Library” in order to build recommendation engine

• Library comes up with user based and item based recommendations

• Five preliminary components- DataMode, UserSimilarity, ItemSimilarity, Recommender, UserNeighborhood

• User can develop application that can give online and offline recommendations using these components

11

Continue….

2) Clustering with Apache Mahout

• Clustering algorithm written using MapReduce algorithm

• Canopy, k-Means, Mean-Shift, and Dirichlet are clustering algorithms

• Select the data and convert it into numerical presentation

• Select particular algorithm any of above

• Evacuate the result

12

Continue….

3) Categorizing content with Mahout

• Two approaches for categorizing - Naïve Bayes classifier and complementary naïve Bayes classifier

• One part of Naive Bayes classifier process that deal with keeping track of the words associated with a particular document and category

• Second deal with information prediction using part one

• Complementary Naïve Bayes classifier is similar to naïve Bayes approach with simplicity

13

Some Other Tools of Machine Learning and Data Mining

• Big data R is used for statistical computing using high performance statistical computing on big data

• Machine Online Analysis is machine learning algorithm that is used for data stream mining

• Massive Online Analysis uses classification, regression, clustering and frequent item set mining and frequent graph mining

• Vowpal Wabbit is able to handle terabytes of data

• Vowpal Wabbit can give better throughput using single machine network

• Pegasus is big graph mining tool that finds patterns and anomalies from large massive graphs

• GraphLab is High level parallel data mining system built without using MapReduce

14

Comparison

• Apache Hadoop is used for batch processing

• Data is divided into large size of blocks that makes it easy to handle

• Put extra overhead of segmentation

• Apache S4 is used for streaming data

• No need of segmentation of data

• Cannot add or remove nodes from running clusters

• Apache Mahout is used for writing machine learning algorithm

• No lack in community and documentation and examples are provided

15

Conclusion

• Big data is crucial concern as data is going to increase in future

• Different techniques are needed for mining this big data

• Apache Mahout gives recommendations to users according to their past experience

• Hadoop is used for data mining using MapReduce and HDFS

• Apache S4 for mining streams of data

• All techniques have their own significance for different types of companies

16

References

[1] Apache Hadoop Fundamentals – HDFS and MapReduce Explained with a Diagram By RAMESH NATARAJAN on JANUARY 4, 2012

[2] Pros and Cons of Hadoop By Guruzon.com on June 01, 2013

[3] HDFS: Facebook has the world's largest Hadoop cluster!

[4] S4 distributed stream of computing platform- overview

[5] S4 distributed stream computing Platform By Aleksandar Bradic, Sr. Director, Engineering and R&D

[6] Streaming Big Data By William Zhou in William Zhou's Blog on Sep 24, 2012

[7] Introducing Apache Mahout -Scalable, commercial-friendly machine learning for building intelligent applications by Grant Ingersoll on 08 September 2009

[8] Introduction to scalable machine learning with apache mahout Grant Ingersoll on 15 September 2010

[9] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer. MOA: Massive Online Analysis

[10] J. Langford. Vowpal Wabbit, 2011.

[11]U. Kang, D. H. Chau, and C. Faloutsos. PEGASUS: Mining Billion-Scale Graphs in the Cloud. 2012.

[12]R. Smolan and J. Erwitt. The Human Face of Big Data. Sterling Publishing Company Incorporated, 2012.

17

Any Queries :

big data mining technologies final

Documents

big data miningdata

big data miningapache

computation data processing

big data miningapache

big data miningapache

replicates data blocks

data nodes clients

actual data blocks hdfs