detailed presentation on big data hadoop +hadoop project near duplicate detection(summer training...

POORNIMA INSTITUTE OF ENGINEERING & TECHNOLOGY, JAIPUR

(DEPARTMENT OF COMPUTER ENGINEERING)

Big Data Hadoop

Presented By: Guided By:

Ashok Royal Dr. E.S. Pilli

©Ashok Royal

Topics

1. Organization Profile

2. Schedule

3. Training Description

4. Project Description

5. Learning

6. Snapshots

7. Future Scope

8. Conclusion

9. References9 September 2014

©Ashok Royal

Organization Profile

Name - Malviya National Institute of Techonology,

Jaipur.

MNIT, Jaipur is one of 30 national institutes of

technology in India.

MNIT, established in 1963 inspired by Pt. Madan

Mohan Malviya.

The institute's director is I. K. Bhat and the chairman

of the board of Governors is Dr. K. K. Aggarwal.

9 September 2014

©Ashok Royal

Organization Contacts

Organization’s contacts:

Address: Jawaharlal Nehru Marg,

Jhalana, Malviya Nagar, Jaipur,

Rajasthan 302017

Phone: 0141 271 3201

Email : [email protected]

Website : www.mnit.ac.in

9 September 2014

©Ashok Royal

Schedule

Our training at MNIT were broadly divided into three

phases:

Case study of Hadoop and related papers (first 30

days).

Hadoop cluster making (first 30 days).

Single node

Multi node

Implementation of Near Duplicate Detection Using

Hadoop MapReduce (last 15 days).9 September 2014

©Ashok Royal

Training Coordinator

Name - Dr. E. S. Pilli

Assistant Professor at MNIT, Jaipur. He has done

Ph.D.(CSE) from IIT Roorkee. He is a very

supportive person and currently guiding many

M.Tech and Ph. D students in Cloud Computing,

Big Data & Botnets.

Email: [email protected]

9 September 2014

©Ashok Royal

What is Big Data ?

Lots of data (Terabyte or Petabyte)

BigData is a term used for a collection of data

sets so large and complex that it becomes

difficult to process using existing traditional data

processing application.

Big data refers to large data sets that are

challenging to store, search, share, visualize and

analyze. 9 September 2014

©Ashok Royal

Various forms of Data

Data comes mainly in three forms-

Structured

Unstructured Data

Semi-structured data

9 September 2014

©Ashok Royal

Why Data is so BIG ?

20 terabyte photos uploaded to Facebook

each month .

330 terabyte data that the large collider will

produce each week.

530 terabyte all the videos on YouTube.

1 petabyte data processed by Google's

servers every 72 minutes.9 September 2014

©Ashok Royal

Data growth

9 September 2014

©Ashok Royal

What is hadoop ?

It is a open source software library

written in java.

Hadoop Software library is a

framework that allow for the

distributed processing of large data sets

(Big Data) across clusters of computers using

programming models.9 September 2014

©Ashok Royal

Modules of hadoop

Hadoop Commons

Hadoop Distributed File system (HDFS)

Hadoop mapReduce

9 September 2014

©Ashok Royal

Hadoop Commons

It provides access to the file system supported by

Hadoop.

The Hadoop common package contain the

necessary JAR files and scripts needed to start

hadoop.

The package also provides sourse code,

documentation, and a contribution section which

include projects from Hadoop Community.9 September 2014

©Ashok Royal

HDFS

Hadoop uses HDFS, a distributed file

system based on GFS (Google File

System) as its shared filesystem.

HDFS architecture divides files into large

chunks distributed across data servers.

It has namenode and datanodes.

9 September 2014

©Ashok Royal

Main components of HDFS

Namenode:

Master of the system

Maintains and manage the blocks which are present on

the datanodes.

Datanodes:

Slaves which are deployed on each machine and provide

the actual storage.

Responsible for serving read and write requests for the

clients.9 September 2014

©Ashok Royal

Main components of HDFS

Secondary Name node

Used as savepoint

Connects to Namenode every hour*

Backup of namenode metadata

Saved metadata can build a failed

Namenode.

9 September 2014

©Ashok Royal

Map Reduce

The Hadoop MapReduce framework

harnesses a cluster of machines and

excute user define MapReduce jobs

across the nodes in the cluster.

A MapReduce computation has two

phases

A map phase and

A reduce phase9 September 2014

©Ashok Royal

HDFS and MapReduce Layers

9 September 2014

©Ashok Royal

Hadoop Server Roles

JobTrackerMapReduce job

submitted by client computer

Master node

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

TaskTracker

Slave node

Task instance

9 September 2014

©Ashok Royal

Hadoop Architecture

9 September 2014

©Ashok Royal

Hadoop Streaming

It allows to create and run map/reduce jobs with any

executable or script as the Mapper and/or Reducer.

HDFS is basically designed to process large files on

commodity cluster at high speed.

Write onces read many approch. After huge data

being placed, we tend to use the data not modify it.

Time to read the whole data is more important.

9 September 2014

©Ashok Royal

Hadoop Workflow

1. Load data into the cluster (HDFS

writes)

2. Analyze the data (Map Reduce)

3. Store results in the cluster (HDFS

writes)

4. Read the results from the cluster (HDFS

reads) 9 September 2014

©Ashok Royal

Word count Example

9 September 2014

©Ashok Royal

Prominent users of Hadoop

Amazon – 100 nodes

Facebook – two cluster of 8000 and 3000 nodes

Adobe – 80 nodes

Ebay – 532 nodes

Yahoo – cluster of about 4500 nodes

IIIT Hyderabad – 30 nodes

IBM, Microsoft many more companies are also using

Hadoop.9 September 2014

©Ashok Royal

near duplicates = pairs of objects with high similarity

similarity -> quantitative way -> similarity function

Given a collection of records, the similarity join problem is to find all pairs of records, <x,y>, such

that sim(x,y)>=t

Tokenize:

Each record is a set of tokens from a finite universe.

Suppose each record is a single text document

x = “yes as soon as possible”

y = “as soon as possible please”

x = {A, B, C, D, E}

y = {B, C, D, E, F}

9 September 2014

©Ashok Royal

Project Description

Project Name - Near Duplicate Detection

Comparative analysis of millions documents exist in

network jargon to find similar document based on a

predefined threshold value.

Near duplicate detection is essentially used in web

crawls and many others data mining tasks.

The near duplicates are not considered as “exact

duplicates ” , but are files with minute differences .

9 September 2014

©Ashok Royal

Application in Search Engine

9 September 2014

©Ashok Royal

Application in Search Engine cont. The web documents with similarity score

greater than a predefined threshold are

considered as near duplicates

These near duplicated pages are not added

to the repository of search engine.

Reduce storage cost of search engines

Improve the quality of search index

9 September 2014

©Ashok Royal

Similarity Function

9 September 2014

For calculation of similarity between two

document we have

used jaccard Function.

Jaccard Similarity Function:

Example:

tyx

yxyxJ

),(

x = {A,B,C,D,E}y = {B,C,D,E,F}

4/6 = 0.67

©Ashok Royal

Conclusion

Training in big data helped us to know what is the

crazy trend in IT industries and how technology is

becoming more fruitful to human development.

Big Data is the future. Currently A lot of research is

going on in this field. As data is increasing at faster

rate thus there is a huge need of such tools and

technology which can handle it.

9 September 2014

©Ashok Royal

Conclusion

Hadoop is the most emerging framework used

by most of big firms like Facebook, Microsoft,

IBM, Yahoo, Amazon and lots of other more.

Our experience at MNIT, was absolutely

awesome as it has given as the platform and

support for our tasks and case study.

9 September 2014

©Ashok Royal

Bigdata and bigdata solutions is one of

the burning issues in the present it

industry so, work on those will surely

make us more useful to that.

9 September 2014

©Ashok Royal

The proposed method solve the difficulties of

information retrieval from the web.

The approach has detected the near duplicate web

pages efficiently based on the keywords extracted

from the web pages.

It reduces the memory space for web repositories.

The near duplicate detection increases the search

engines quality.

9 September 2014

©Ashok Royal

References

J. G. Conrad, X. S. Guo, and C. P. Schriber. Online

duplicate document detection: signature reliability

in a dynamic retrieval environment. In CIKM, 2003.

2012 2nd International Conference on Computer

Science and Network Technology Near Duplicate

Detection Using Map-Reduce by Qinsheng Du, Wei

Liu, Guolin Li and Yonglin Tang

9 September 2014

©Ashok Royal

References

Chuan Xiao, Wei Wang, Xuemin Lin,

Jeffrey Xu Yu, Guoren Wang. Efficient

Similarity Joins for Near-Duplicate

Detection

Apache Hadoop.

http://hadoop.apache.org.

Hadoop-beginners.blogspot.com.9 September 2014

©Ashok Royal

Namenode and Datanode

The NameNode executes file system namespace

operations like opening, closing, and renaming files

and directories. It also determines the mapping of

blocks to DataNodes.

The DataNodes are responsible for serving read and

write requests from the file system’s clients. The

DataNodes also perform block creation, deletion, and

replication upon instruction from the NameNode.

9 September 2014

©Ashok Royal

MapReduce paradigm

Map phase: Once divided, datasets are assigned to the task

tracker to perform the Map phase. The data functional

operation will be performed over the data, emitting the

mapped key and value pairs as the output of the Map

phase, (i.e data processing).

Reduce phase: The master node then collects the answers

to all the subproblems and combines them in some way to

form the output; the answer to the problem it was originally

trying to solve, (i.e data collection and digesting).

9 September 2014

9 September 2014

Any Query?

Thank you

Hadoop

9 September 2014