big data and hadoop

20
Big data and Hadoop Learn how Hadoop deals problems associated with Big data Analysis Srikanth M V

Upload: sri-kanth

Post on 06-May-2015

442 views

Category:

Technology


0 download

DESCRIPTION

An Introduction to Big data and problems associated with storing and analyzing big data and How Hadoop solves the problem with its HDFS and MapReduce frameworks. A little intro to HDInsight, Hadoop on windows azure.

TRANSCRIPT

Page 1: Big data and hadoop

Big data and HadoopLearn how Hadoop deals problems associated with Big data Analysis

Srikanth M V

Page 2: Big data and hadoop

There are 30 billion pieces of content shared on Facebook every day.

Page 3: Big data and hadoop

Wal-Mart handles more than 1 million customer transactions an hour.

Page 4: Big data and hadoop

More than 5 billion people are calling, texting, tweeting and browsing websites using Smart phones.

Page 5: Big data and hadoop
Page 6: Big data and hadoop

The 3-Vs of Big Data

VolumeGiga Bytes, Tera Bytes,

Peta Bytes or Zeta Bytes….

VelocityThe rate at which data

flows into an organization

Variety

Structured and Unstructured

Page 7: Big data and hadoop

So What is Big Data?

• Big data is large and complex data sets collected from various sources like Sensors, Social Media, Satellite images, Audio, Video, RFID etc.

• Big data is data that exceeds the processing capacity of conventional database systems.

• How ‘Big’ is big? GB, TB, PB , ZB?? NO.. Data is Big when the organization’s ability to handle,

store and analyze exceeds its capacity.

Page 8: Big data and hadoop

Problem: Storing and Analyzing

“Big Data”

Page 9: Big data and hadoop

Solution*: Move compute to data

*One among many, but Hadoop is flexible, Simple and reliable.

Hey there, I’m Hadoop and I can do that

for you..

Page 10: Big data and hadoop

• Created: 2005• Creators: Doug Cutting and Mike Cafarella• Contributors: Apache, Yahoo, Google• Language: Java

Page 11: Big data and hadoop

How Hadoop deals with “Big data”

• Primary Components• HDFS – Hadoop Distributed File System • Map Reduce• Hadoop YARN• Job Scheduling and Resource Management

• Hadoop Common• Access to file system

Page 12: Big data and hadoop

HDFS• Distributed, scalable,

reliable and portable file system.

• Hadoop Cluster is a set of Data Nodes and a Name Node

• Client divides the data to process, into blocks

• Each block of data is replicated in 3 Nodes*

• More Nodes, More Efficiency.

• Robust - Relies on Software instead of hardware

HDFS Cluster

Server

Data Node Name Node

Server

Data Node

Server

Data Node

Server

Data Node

Master

Slaves

B1 B2 B3

Somefile.txt

B1

B2

B3

B2

B3

B1

B3

B1

B3

Page 13: Big data and hadoop

Map Reduce• Divide and Conquer• Parallel Computing• Map(): Perform Sorting &

Filtering• Reduce(): Perform

Summary Operation • Each node has Task tracker

which communicates with Job Tracker.

• The output files will be available as local files on client.

Page 14: Big data and hadoop

Hadoop Architecture

Page 15: Big data and hadoop

Hadoop Secondary Components• Ambari

• Web Tool for provisioning, managing and monitoring Clusters• Hbase

• Scalable distributed database that supports structured data for large tables• Zoo Keeper

– A High performance coordination service for distributed applications• Pig

– A High level data flow language and execution framework for parallel computation• Hive

– A Data warehouse infrastructure that provides data summarization and ad hoc querying• Cassandra

– A scalable multi master database with no single point failures• Chukwa

– A data collection system for managing large distributed systems• Lucene and Solr

– Search engines, currently not part of Hadoop

Page 16: Big data and hadoop

Real World Example of Big data Analytics using Hadoop

MySql Database

1

7

2 3

56

1. Users interact with Facebook using data in textual, image, video formats.2. Facebook transfers the core data to My SQL database.3. My SQL data is replicated to Hadoop clusters.4. Data is processed using Hadoop MapReduce functions5. The results are transferred back to My SQL6. Facebook uses the data to create recommendations for you based on

your interests.

4

Other users:

Page 17: Big data and hadoop

Why should an Enterprise move to Big Data Analytics?

• Enterprises will be able to harness relevant data and use it to make the best decisions– Increasing the redemption

rate– Determine optimum prices– Calculate risks in a minute,

and understand future possibilities to mitigate risk

– Enabling new products– Identifying patterns help

identify trends in business

The key lies in collecting quality data, not quantity.

Page 18: Big data and hadoop

What is in it for us?

Page 19: Big data and hadoop

Hadoop on Cloud

• Provision Scalable Storage for storing Big data as Blobs– PAAS• Provision Linux VMs on Cloud – IAAS• Language support for JS and C#• Business Intelligence – Connect MS Excel to Hadoop Hive• Remote Access to Hadoop Jobs via REST API, WebHCat REST API.• Easy to access Management Portal for monitoring Hadoop Jobs• .NET SDK to execute Hive Jobs on HDInsight• …..More

+

Page 20: Big data and hadoop

Thank you

• Questions ?

[email protected]://Vishwanathsrikanth.wordpress.com