introduction to hadoop hdfs and ecosystems · pdf fileintroduction to hadoop hdfs and...
TRANSCRIPT
![Page 1: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/1.jpg)
Introduction to Hadoop HDFS and Ecosystems
ANSHUL MITTAL
Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
![Page 2: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/2.jpg)
TopicsThe goal of this presentation is to give you a basic understanding of Hadoop's core components (HDFS and MapReduce)
In this presentation, I will try and answer the following few questions:◦ Which factors led to the era of Big Data
◦ What is Hadoop and what are its significant features
◦ How it offers reliable storage for massive amounts of data with HDFS
◦ How it supports large scale data processing through MapReduce
◦ How ‘Hadoop Ecosystem’ tools can boost an analyst’s productivity
2
![Page 3: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/3.jpg)
Topics
The Motivation For Hadoop
Hadoop Overview
HDFS
MapReduce
The Hadoop Ecosystem
3
![Page 4: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/4.jpg)
Cost of Data Storage
4Source: http://www.mkomo.com/cost-per-gigabyte-update
![Page 5: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/5.jpg)
Traditional vehicles for storing big data are prohibitively expensive
We must retain and process large amount of data to extract its value
Unfortunately, traditional vehicles for storing this data have become prohibitively expensive
◦ In dollar costs, $ per GB
◦ In computational costs
◦ The need for ETL jobs to summarize the data for warehousing
5
![Page 6: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/6.jpg)
The Traditional WorkaroundHow has industry typically dealt with these problem?
◦ Perform an ETL (Extract, Transform, Load) on the data to summarize and denormalize the data, before archiving the result in a data warehouse◦ Discarding the details
◦ Run queries against the summary data
Unfortunately, this process results in loss of detail◦ But there could be real nuggets in the lost detail
◦ Value lost with data summarization or deletion◦ Think of the opportunities lost when we summarize 10000 rows of data into a single record
◦ We may not learn much from a single tweet, but we can learn a lot from 10000 tweets
◦ More data == Deeper understanding◦ “There’s no data like more data” (Moore 2001)
◦ “It’s not who has the best algorithms that wins. It’s who has the most data”
6
![Page 7: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/7.jpg)
We Need A System That ScalesWe’re generating too much data to process with traditional tools
Two key problems to address◦ How can we reliably store large amounts of data at a reasonable cost?
◦ How can we analyze all the data we have stored?
7
![Page 8: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/8.jpg)
TopicsThe Motivation For Hadoop
Hadoop Overview
HDFS
MapReduce
The Hadoop Ecosystem
8
![Page 9: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/9.jpg)
‘Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.’
‘A free, open source software framework that supports data-intensive distributed applications.’
- Cloudera
‘Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.’
- TechTarget
9
What is Hadoop?
![Page 10: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/10.jpg)
What is Hadoop?
10
Hadoop is a free software package that provides tools that enable data discovery, question answering, and rich analytics based on very large volumes of information.
All this in fractions of the time required by traditional databases.
![Page 11: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/11.jpg)
What Is Apache Hadoop?
A system for providing scalable, economical data storage and processing
◦ Distributed and fault tolerant
◦ Harnesses the power of industry standard hardware
‘Core’ Hadoop consists of two main components◦ Storage: The Hadoop Distributed File System (HDFS)
◦ A framework for distributing data across a cluster in ways that are scalable and fast.
◦ Processing: MapReduce
◦ A framework for processing data in ways that are reliable and fast
◦ Plus the infrastructure needed to make them work, including◦ Filesystem and administration utilities
◦ Job scheduling and monitoring
11
![Page 12: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/12.jpg)
Where Did Hadoop Come From?Started with research that originated at Google back in 2003
◦ Google's objective was to index the entire World Wide Web
◦ Google had reached the limits of scalability of RDBMS technology
◦ This research led to a new approach to store and analyze large quantities of data
◦ Google File Systems (Ghemawat, et al 2003)
◦ MapReduce: Simplified Data Processing on Large Clusters (Dean and Ghemawat 2008)
A developer by the name of Doug Cutting (at Yahoo!) was wrestling with many of the same problems in the implementation of his own open-source search engine, Nutch
◦ He started an open-source project based on Google’s research and created Hadoop in 2005.
◦ Hadoop was named after his son’s toy elephant.
12
![Page 13: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/13.jpg)
ScalabilityHadoop is a distributed system
◦ A collection of servers running Hadoop software is called a cluster
Individual servers within a cluster are called nodes◦ Typically standard rack-mount servers running Linux
◦ Each node both stores and processes data
Add more nodes to the cluster to increase scalability◦ A cluster may contain up to several thousand nodes
◦ Facebook and Yahoo are each running clusters in excess of 4400 nodes
◦ Scalability is linear
◦ And its accomplished on standard hardware, greatly reducing the storage costs
13
![Page 14: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/14.jpg)
Reliability / AvailabilityReliability and availability are traditionally expensive to provide
◦ Requiring expensive hardware, redundancy, auxiliary resources, and training
With Hadoop, providing reliability and availability is much less expensive
◦ If a node fails, just replace it
◦ At Google, servers are not bolted into racks, they are attached to the racks using velcro, in anticipation of the need to replace them quickly
◦ If a task fails, it is automatically run somewhere else
◦ No data is lost
◦ No human intervention is required
In other words, reliability / availability are provided by the framework itself
14
![Page 15: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/15.jpg)
Fault ToleranceAdding nodes increases chances that any one of them will fail
◦ Hadoop builds redundancy into the system and handle failures automatically
Files loaded into HDFS are replicated across nodes in the cluster◦ If a node fails, its data is re-replicated using one of the other copies
Data processing jobs are broken into individual tasks◦ Each task takes a small amount of data as input
◦ Thousands of tasks (or more) often run in parallel
◦ If a node fails during processing, its tasks are rescheduled elsewhere
Routine failures are handled automatically without any loss of data
15
![Page 16: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/16.jpg)
How are reliability, availability, and fault tolerance provided?A file in Hadoop is broken down into blocks which are:
◦ typically 64MB each (a typical windows block size is 4KB)
◦ Distributed across the cluster
◦ Replicated across multiple nodes of the cluster (Default: 3)
Consequences of this architecture:◦ If a machine fails (even an entire rack) no data is lost
◦ Tasks can run elsewhere, where the data resides
◦ If a task fails, it can be dispatched to one of the nodes containing redundant copies of the same data block
◦ When a failed node comes back online, it can automatically be reintegrated with the cluster
16
![Page 17: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/17.jpg)
TopicsThe Motivation For Hadoop
Hadoop Overview
HDFS
MapReduce
The Hadoop Ecosystem
17
![Page 18: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/18.jpg)
HDFS: Hadoop Distributed File SystemProvides inexpensive and reliable storage for massive amounts of data
◦ Optimized for a relatively small number of large files◦ Each file likely to exceed 100 MB, multi-gigabyte files are common
◦ Store file in hierarchical directory structure◦ e.g. , /sales/reports/asia.txt
◦ Cannot modify files once written◦ Need to make changes? remove and recreate
◦ Use Hadoop specific utilities to access HDFS
18
![Page 19: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/19.jpg)
HDFS and Unix File SystemIn some ways, HDFS is similar to a UNIX filesystem
◦ Hierarchical, with UNIX/style paths (e.g. /sales/reports/asia.txt)
◦ UNIX/style file ownership and permissions
There are also some major deviations from UNIX◦ No concept of a current directory
◦ Cannot modify files once written◦ You can delete them and recreate them, but you can't modify them
◦ Must use Hadoop specific utilities or custom code to access HDFS
19
![Page 20: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/20.jpg)
HDFS Architecture (1 Of 3)Hadoop has a master/slave architecture
HDFS master daemon: Name Node◦ Manages namespace (file to block mappings) and
metadata (block to machine mappings)
◦ Monitors slave nodes
HDFS slave daemon: Data Node◦ Reads and writes the actual data
20
![Page 21: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/21.jpg)
HDFS Architecture (2 Of 3)Example:
◦ The NameNode holds metadata for the two files◦ Foo.txt (300MB) and Bar.txt (200MB)
◦ Assume HDFS is configured for 128MB blocks
◦ The DataNodes hold the actual blocks◦ Each block is 128MB in size
◦ Each block is replicated three times on the cluster
◦ Block reports are periodically sent to the NameNode
21
![Page 22: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/22.jpg)
HDFS Architecture (3 Of 3)Optimal for handling millions of large files, rather than billions of small files, because:
◦ In pursuit of responsiveness, the NameNode stores all of its file/block information
◦ Too many files will cause the NameNode to run out of storage space
◦ Too many blocks (if the blocks are small) will also cause the NameNode to run out of space
◦ Processing each block requires its own Java Virtual Machine (JVM) and (if you have too many blocks) you begin to see the limits of HDFS scalability
22
![Page 23: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/23.jpg)
Accessing HDFS Via The Command LineHDFS is not a general purpose file system
◦ HDFS files cannot be accessed through the host OS
◦ End users typically access HDFS via the hadoop fs command
Example: ◦ Display the contents of the /user/fred/sales.txt file
◦ Create a directory (below the root) called reports
23
hadoop fs -cat /user/fred/sales.txt
hadoop fs -mkdir /reports
![Page 24: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/24.jpg)
Copying Local Data To And From HDFSRemember that HDFS is separated from your local filesystem
◦ Use hadoop fs –put to copy local files to HDFS
◦ Use hadoop fs -get to copy HDFS files to local files
24
![Page 25: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/25.jpg)
More Command ExamplesCopy file input.txt from the local disk to the user’s home directory in HDFS
◦ This will copy the file to /user/username/input.txt
Get a directory listing of the HDFS root directory
Delete the file /reports/sales.txt
Other command options emulating Posix commands are available
25
hadoop fs -put input.txt input.txt
hadoop fs -ls /
hadoop fs -rm /reports/sales.txt
![Page 26: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/26.jpg)
TopicsThe Motivation For Hadoop
Hadoop Overview
HDFS
MapReduce
The Hadoop Ecosystem
26
![Page 27: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/27.jpg)
Introducing MapReduceNow that we have described how Hadoop stores data, lets turn our attention to how it processes data
We typically process data in Hadoop using MapReduce
MapReduce is not a language, it’s a programming model
MapReduce consists of two functions: map and reduce
27
![Page 28: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/28.jpg)
Understanding Map and ReduceThe map function always runs first
◦ Typically used to “break down”◦ Filter, transform, or parse data, e.g. Parse the stock symbol, price and time from a data feed
◦ The output from the map function (eventually) becomes the input to the reduce function
The reduce function◦ Typically used to aggregate data from the map function
◦ e.g. Compute the average hourly price of the stock
◦ Not always needed and therefore optional◦ You can run something called a “map-only” job
Between these two tasks there is typically a hidden phase known as the “Shuffle and Sort”
◦ Which organizes map output for delivery to the reducer
Each individual piece is simple, but collectively are quite powerful◦ Analogous to a pipe / filter in Unix
28
![Page 29: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/29.jpg)
MapReduce ExampleMapReduce code for Hadoop is typically written in Java
◦ But it is possible to use nearly any language with Hadoop Streaming
The following slides will explain an entire MapReduce job◦ Input: Text file containing order ID, employee name, and sale amount
◦ Output: Sum of all sales per employee
29
![Page 30: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/30.jpg)
Explanation Of The Map FunctionHadoop splits a job into many individual map tasks
◦ The number of map tasks is determined by the amount of input data
◦ Each map task receives a portion of the overall job input to process in parallel
◦ Mappers process one input record at a time
◦ For each input record, they emit zero or more records as output
In this case, 5 map tasks each parse a subset of the input records◦ And then output the name and sales fields for each as output
30
Mapper does a simple task of extracting 2
fields out of 3
![Page 31: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/31.jpg)
Shuffle and SortHadoop automatically sorts and merges output from all map tasks
◦ Sorting records by key (name, in this example)
◦ This (transparent) intermediate process is known as the “Shuffle and Sort”
◦ The result is supplied to reduce tasks
31
![Page 32: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/32.jpg)
Explanation Of Reduce FunctionReducer input comes from the shuffle and sort process
◦ As with map, the reduce function receives one record at a time
◦ A given reducer receives all records for a given key
◦ For each input record, reduce can emit zero or more output records
This reduce function sums the total sales per person◦ And emits the employee name (key) and total (sales) as output to a file
◦ Each reducer output file is written to a job specific directory in HDFS
32
Reducer sums up things
![Page 33: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/33.jpg)
Putting It All Together
Here’s the data flow for the entire MapReduce job
33
Map ReduceShuffle & sort
![Page 34: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/34.jpg)
Benefits of MapReduce
Simplicity (via fault tolerance)◦ Particularly when compared with other distributed programming models
Flexibility◦ Offers more analytic capabilities and works with more data types than
platforms like SQL
Scalability◦ Because it works with
◦ Small quantities of data at a time
◦ Running in parallel across a cluster
◦ Sharing nothing among the participating nodes
34
![Page 35: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/35.jpg)
Hadoop Resource Management: YARN
35
![Page 36: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/36.jpg)
36
![Page 37: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/37.jpg)
37
![Page 38: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/38.jpg)
38
![Page 39: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/39.jpg)
39
![Page 40: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/40.jpg)
40
![Page 41: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/41.jpg)
41
![Page 42: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/42.jpg)
42
![Page 43: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/43.jpg)
43
![Page 44: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/44.jpg)
44
![Page 45: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/45.jpg)
45
![Page 46: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/46.jpg)
Working with YARNDevelopers:
◦ Submit jobs (applications) to run on the YARN cluster
◦ Monitor and manage jobs
Hadoop includes three major YARN tools for developers◦ The Hue Job Browser
◦ The YARN Web UI
◦ The YARN command line
46
![Page 47: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/47.jpg)
TopicsThe Motivation For Hadoop
Hadoop Overview
HDFS
MapReduce
The Hadoop Ecosystem
47
![Page 48: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/48.jpg)
48
![Page 49: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/49.jpg)
49
![Page 50: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/50.jpg)
50
Data Ingestion
![Page 51: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/51.jpg)
51
Data Ingestion
![Page 52: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/52.jpg)
52
Data Processing
![Page 53: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/53.jpg)
53
Data Processing
![Page 54: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/54.jpg)
54
Data Processing
![Page 55: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/55.jpg)
55
Data Analysis and Exploration
![Page 56: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/56.jpg)
56
Data Analysis and Exploration
![Page 57: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/57.jpg)
57
Other Hadoop Ecosystem Tools
![Page 58: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/58.jpg)
The Hadoop Ecosystem
Hadoop is a framework for distributed storage and processing
Core Hadoop includes HDFS for storage and YARN for cluster resource management
The Hadoop ecosystem includes many components for
◦ Database integration: Sqoop, Flume, Kafka
◦ Storage: HDFS, HBase
◦ Data processing: Spark, MapReduce, Pig
◦ SQL support: Hive, Impala, SparkQL
◦ Other:
◦ Workflow: Oozie
◦ Machine learning: Mahout
58
![Page 59: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/59.jpg)
Recap: Data Center Integration (1 Of 2)
59
![Page 60: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/60.jpg)
Recap: Data Center Integration (2 Of 2)Hadoop is not necessarily intended to replace existing parts of a data center
◦ But its capabilities do allow you to offload some of the functionality currently being provided (more expensively) by other technologies
◦ So while Hadoop will not replace the use of an RDBMS to retrieve data as part of an transaction, it could be used to store far more detail about the transaction than you would in an ordinary RDBMS (due to costs limitations)
60
![Page 61: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/61.jpg)
Bibliography◦ 10 Hadoopable Problems (recorded presentation)
◦ http://tiny.Cloudera.com/dac02a
◦ Introduction to Apache MapReduce and HDFS (recorded presentation)◦ http://tiny.Cloudera.com/dac02b
◦ Guide to HDFS Commands◦ http://tiny.Cloudera.com/dac02c
◦ Hadoop: The Definitive Guide, 3rd Edition◦ http://tiny.Cloudera.com/dac02d
◦ Sqoop User Guide◦ http://tiny.Cloudera.com/dac02e
◦ Cloudera Version and Packaging Information◦ http://www.cloudera.com/content/support/en/documentation/version-download-
documentation/versions-download-latest.html
61
![Page 62: Introduction to Hadoop HDFS and Ecosystems · PDF fileIntroduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De ... The](https://reader033.vdocuments.net/reader033/viewer/2022051307/5ab84e7c7f8b9a684c8ca5f4/html5/thumbnails/62.jpg)
Thank you!
62