hadoop stack for big data...apache hadoop is an open source software framework for storage and large...

52
Hadoop Stack for Big Data Dr. Rajiv Misra Dept. of Computer Science & Engg. Indian Institute of Technology Patna [email protected] Big Data Computing Big Data Hadoop Stack

Upload: others

Post on 20-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Hadoop Stack for Big Data

Dr. Rajiv Misra

Dept. of Computer Science & Engg.

Indian Institute of Technology Patna

[email protected]

Big Data Computing Big Data Hadoop Stack

Page 2: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Preface

Content of this Lecture:

In this lecture, we will provide insight into Hadooptechnologies opportunities and challenges for BigData.

We will also look into the Hadoop stack andapplications and technologies associated with Big Datasolutions.

Big Data Computing Big Data Hadoop Stack

Page 3: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Hadoop Beginnings

Big Data Computing Big Data Hadoop Stack

Page 4: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

What is Hadoop ?

Apache Hadoop is an open source softwareframework for storage and large scaleprocessing of the data-sets on clusters ofcommodity hardware.

Big Data Computing Big Data Hadoop Stack

Page 5: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Hadoop BeginningsHadoop was created by Doug Cutting and Mike Cafarella in2005

It was originally developed to support distribution of theNutch Search Engine Project.

Doug, who was working at Yahoo at the time, who is nowactually a chief architect at Cloudera, has named this projectafter his son’s toy elephant, Hadoop.

Big Data Computing Big Data Hadoop Stack

Page 6: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Moving Computation to Data

Big Data Computing Big Data Hadoop Stack

Hadoop started out as a simple batch processing framework.

The idea behind Hadoop is that instead of moving data tocomputation, we move computation to data.

Page 7: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Scalability

Big Data Computing Big Data Hadoop Stack

Scalability's at it's core of a Hadoop system.

We have cheap computing storage.

We can distribute and scale across very easilyin a very cost effective manner.

Page 8: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Reliability

Hardware FailuresHandlesAutomatically!

Big Data Computing Big Data Hadoop Stack

If we think about an individual machine or rack of machines, or alarge cluster or super computer, they all fail at some point of time orsome of their components will fail. These failures are so commonthat we have to account for them ahead of the time.

And all of these are actually handled within the Hadoop frameworksystem. So the Apache's Hadoop MapReduce and HDFS componentswere originally derived from the Google's MapReduce and Google'sfile system. Another very interesting thing that Hadoop brings is anew approach to data.

Page 9: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

New Approach to Data: Keep all data

Big Data Computing Big Data Hadoop Stack

A new approach is, we can keep all the data that we have, and wecan take that data and analyze it in new interesting ways. We cando something that's called schema and read style.

And we can actually allow new analysis. We can bring more datainto simple algorithms, which has shown that with moregranularity, you can actually achieve often better results in takinga small amount of data and then some really complex analytics onit.

Page 10: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Apache Hadoop Framework & its Basic Modules

Big Data Computing Big Data Hadoop Stack

Page 11: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Hadoop Common: It contains libraries and utilities neededby other Hadoop modules.

Hadoop Distributed File System (HDFS): It is a distributedfile system that stores data on a commodity machine.Providing very high aggregate bandwidth across the entirecluster.

Hadoop YARN: It is a resource management platformresponsible for managing compute resources in the clusterand using them in order to schedule users andapplications.

Hadoop MapReduce: It is a programming model thatscales data across a lot of different processes.

Apache Framework Basic Modules

Big Data Computing Big Data Hadoop Stack

Page 12: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Apache Framework Basic Modules

Big Data Computing Big Data Hadoop Stack

Page 13: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

High Level Architecture of Hadoop

Big Data Computing Big Data Hadoop Stack

Two major pieces of Hadoop are: Hadoop Distribute the File System and theMapReduce, a parallel processing framework that will map and reduce data.These are both open source and inspired by the technologies developed atGoogle.

If we talk about this high level infrastructure, we start talking about things likeTaskTrackers and JobTrackers, the NameNodes and DataNodes.

Page 14: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

HDFS Hadoop distributed file system

Big Data Computing Big Data Hadoop Stack

Page 15: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Distributed, scalable, and portable file-system written inJava for the Hadoop framework.

Each node in Hadoop instance typically has a single namenode, and a cluster of data nodes that formed this HDFScluster.

Each HDFS stores large files, typically in ranges ofgigabytes to terabytes, and now petabytes, acrossmultiple machines. And it can achieve reliability byreplicating the cross multiple hosts, and therefore doesnot require any range storage on hosts.

HDFS: Hadoop distributed file system

Big Data Computing Big Data Hadoop Stack

Page 16: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

HDFS

Big Data Computing Big Data Hadoop Stack

Page 17: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

HDFS

Big Data Computing Big Data Hadoop Stack

Page 18: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

MapReduce Engine

Big Data Computing Big Data Hadoop Stack

The typical MapReduce engine will consist of a job tracker, to which clientapplications can submit MapReduce jobs, and this job tracker typically pusheswork out to all the available task trackers, now it's in the cluster. Struggling tokeep the word as close to the data as possible, as balanced as possible.

Page 19: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Apache Hadoop NextGen MapReduce (YARN)

Big Data Computing Big Data Hadoop Stack

Yarn enhances the power of theHadoop compute cluster, withoutbeing limited by the map producekind of framework.

It's scalability's great. The processingpower and data centers continue togrow quickly, because the YARNresearch manager focusesexclusively on scheduling. It canmanage those very large clustersquite quickly and easily.

YARN is completely compatible withthe MapReduce. ExistingMapReduce application end userscan run on top of the Yarn withoutdisrupting any of their existingprocesses.

Page 20: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Hadoop 1.0 vs. Hadoop 2.0

Big Data Computing Big Data Hadoop Stack

Hadoop 2.0 provides a more general processing platform, that is not constraining to thismap and reduce kinds of processes.

The fundamental idea behind the MapReduce 2.0 is to split up two major functionalitiesof the job tracker, resource management, and the job scheduling and monitoring, andto do two separate units. The idea is to have a global resource manager, and perapplication master manager.

Page 21: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Yarn enhances the power of the Hadoop compute cluster, withoutbeing limited by the map produce kind of framework.

It's scalability's great. The processing power and data centerscontinue to grow quickly, because the YARN research managerfocuses exclusively on scheduling. It can manage those very largeclusters quite quickly and easily.

YARN is completely compatible with the MapReduce. ExistingMapReduce application end users can run on top of the Yarn withoutdisrupting any of their existing processes.

It does have a Improved cluster utilization as well. The resourcemanager is a pure schedule or they just optimize this clusterutilization according to the criteria such as capacity, guarantees,fairness, how to be fair, maybe different SLA's or service levelagreements.

What is Yarn ?

Big Data Computing Big Data Hadoop Stack

Scalability MapReduce Compatibility Improved cluster utilization

Page 22: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

It supports other work flows other than just map reduce.

Now we can bring in additional programming models, such as graphprocess or iterative modeling, and now it's possible to process thedata in your base. This is especially useful when we talk aboutmachine learning applications.

Yarn allows multiple access engines, either open source orproprietary, to use Hadoop as a common standard for either batch orinteractive processing, and even real time engines that cansimultaneous acts as a lot of different data, so you can put streamingkind of applications on top of YARN inside a Hadoop architecture,and seamlessly work and communicate between theseenvironments.

What is Yarn ?

Big Data Computing Big Data Hadoop Stack

Fairness

Supports Other Workloads

Iterative ModelingMachine Learning

Multiple Access Engines

Page 23: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

The Hadoop “Zoo”

Big Data Computing Big Data Hadoop Stack

Page 24: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Apache Hadoop Ecosystem

Big Data Computing Big Data Hadoop Stack

Page 25: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Original Google Stack

Big Data Computing Big Data Hadoop Stack

Had their original MapReduce, and they were storing and processinglarge amounts of data.

Like to be able to access that data and access it in a SQL likelanguage. So they built the SQL gateway to adjust the data into theMapReduce cluster and be able to query some of that data as well.

Page 26: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Original Google Stack

Big Data Computing Big Data Hadoop Stack

Then, they realized they needed a high-level specific language toaccess MapReduce in the cluster and submit some of those jobs. SoSawzall came along.

Then, Evenflow came along and allowed to chain together complexwork codes and coordinate events and service across this kind of aframework or the specific cluster they had at the time.

Page 27: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Original Google Stack

Big Data Computing Big Data Hadoop Stack

Then, Dremel came along. Dremel was a columnar storage in themetadata manager that allows us to manage the data and is able toprocess a very large amount of unstructured data.

Then Chubby came along as a coordination system that would manageall of the products in this one unit or one ecosystem that could processall these large amounts of structured data seamlessly.

Page 28: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Facebook’s Version of the Stack

Big Data Computing Big Data Hadoop Stack

Page 29: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Yahoo Version of the Stack

Big Data Computing Big Data Hadoop Stack

Page 30: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

LinkedIn’s Version of the Stack

Big Data Computing Big Data Hadoop Stack

Page 31: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Cloudera’s Version of the Stack

Big Data Computing Big Data Hadoop Stack

Page 32: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Hadoop Ecosystem Major Components

Big Data Computing Big Data Hadoop Stack

Page 33: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu PhamBig Data Computing Big Data Hadoop Stack

Page 34: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Tool designed for efficiently transferring bulkdata between Apache Hadoop and structureddatastores such as relational databases

Apache Sqoop

Big Data Computing Big Data Hadoop Stack

Page 35: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu PhamBig Data Computing Big Data Hadoop Stack

Page 36: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Hbase is a key component of the Hadoop stack, as itsdesign caters to applications that require really fast randomaccess to significant data set.

Column-oriented database management system

Key-value store

Based on Google Big Table

Can hold extremely large data

Dynamic data model

Not a Relational DBMS

HBASE

Big Data Computing Big Data Hadoop Stack

Page 37: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu PhamBig Data Computing Big Data Hadoop Stack

Page 38: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

High level programming on top of Hadoop MapReduce

The language: Pig Latin

Data analysis problems as data flows

Originally developed at Yahoo 2006

PIG

Big Data Computing Big Data Hadoop Stack

Page 39: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

PIG for ETL

Big Data Computing Big Data Hadoop Stack

A good example of PIG applications is ETL transaction model thatdescribes how a process will extract data from a source, transportingaccording to the rules set that we specify, and then load it into a datastore.

PIG can ingest data from files, streams, or any other sources using theUDF: a user-defined functions that we can write ourselves.

When it has all the data it can perform, select, iterate and do kinds oftransformations.

Page 40: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu PhamBig Data Computing Big Data Hadoop Stack

Page 41: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Data warehouse software facilitates querying andmanaging large datasets residing in distributedstorage

SQL-like language!

Facilitates querying and managing large datasets inHDFS

Mechanism to project structure onto this data andquery the data using a SQL-like language calledHiveQL

Apache Hive

Big Data Computing Big Data Hadoop Stack

Page 42: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu PhamBig Data Computing Big Data Hadoop Stack

Page 43: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Workflow scheduler system to manage Apache Hadoop jobs

Oozie Coordinator jobs!

Supports MapReduce, Pig, Apache Hive, and Sqoop, etc.

Oozie Workflow

Big Data Computing Big Data Hadoop Stack

Page 44: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu PhamBig Data Computing Big Data Hadoop Stack

Page 45: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Provides operational services for aHadoop cluster group services

Centralized service for: maintainingconfiguration information namingservices

Providing distributed synchronizationand providing group services

Zookeeper

Big Data Computing Big Data Hadoop Stack

Page 46: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

FlumeDistributed, reliable, and available service for efficiently collecting,aggregating, and moving large amounts of log data

It has a simple and very flexible architecture based on streamingdata flows. It's quite robust and fall tolerant, and it's really tunableto enhance the reliability mechanisms, fail over, recovery, and allthe other mechanisms that keep the cluster safe and reliable.

It uses simple extensible data model that allows us to apply allkinds of online analytic applications.

Big Data Computing Big Data Hadoop Stack

Page 47: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Additional Cloudera Hadoop Components Impala

Big Data Computing Big Data Hadoop Stack

Page 48: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Cloudera, Impala was designed specifically at Cloudera, and it's a queryengine that runs on top of the Apache Hadoop. The project was officiallyannounced at the end of 2012, and became a publicly available, open sourcedistribution.

Impala brings scalable parallel database technology to Hadoop and allowsusers to submit low latencies queries to the data that's stored within theHDFS or the Hbase without acquiring a ton of data movement andmanipulation.

Impala is integrated with Hadoop, and it works within the same powersystem, within the same format metadata, all the security and reliabilityresources and management workflows.

It brings that scalable parallel database technology on top of the Hadoop.It actually allows us to submit SQL like queries at much faster speeds with alot less latency.

Impala

Big Data Computing Big Data Hadoop Stack

Page 49: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Additional Cloudera Hadoop Components Spark The New Paradigm

Big Data Computing Big Data Hadoop Stack

Page 50: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Apache Spark™ is a fast and general engine for large-scale dataprocessing

Spark is a scalable data analytics platform that incorporates primitivesfor in-memory computing and therefore, is allowing to exercise somedifferent performance advantages over traditional Hadoop's clusterstorage system approach. And it's implemented and supportssomething called Scala language, and provides unique environmentfor data processing.

Spark is really great for more complex kinds of analytics, and it's greatat supporting machine learning libraries.

It is yet again another open source computing frame work and it wasoriginally developed at MP labs at the University of CaliforniaBerkeley and it was later donated to the Apache software foundationwhere it remains today as well.

Spark

Big Data Computing Big Data Hadoop Stack

Page 51: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

In contrast to Hadoop's two stage disk based MapReduce paradigmMulti-stage in-memory primitives provides performance up to 100times faster for certain applications.

Allows user programs to load data into a cluster's memory and queryit repeatedly

Spark is really well suited for these machined learning kinds ofapplications that often times have iterative sorting in memory kindsof computation.

Spark requires a cluster management and a distributed storagesystem. So for the cluster management, Spark supports standalonenative Spark clusters, or you can actually run Spark on top of aHadoop yarn, or via patching mesas.

For distributor storage, Spark can interface with any of the variety ofstorage systems, including the HDFS, Amazon S3.

Spark Benefits

Big Data Computing Big Data Hadoop Stack

Page 52: Hadoop Stack for Big Data...Apache Hadoop is an open source software framework for storage and large scale processing of the data-sets on clusters of commodity hardware. Big Data Computing

Vu Pham

Conclusion

In this lecture, we have discussed the specific componentsand basic processes of the Hadoop architecture, softwarestack, and execution environment.

Big Data Computing Big Data Hadoop Stack