hadoop 2.0 introduction – with hdp for windows...2015/05/14 · agenda • what is big data –...
TRANSCRIPT
![Page 1: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/1.jpg)
Seele Lin
Hadoop 2.0 Introduction – with HDP for Windows
![Page 2: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/2.jpg)
Who am I
• Speaker: – 林彥⾠辰 – A.K.A Seele Lin – Mail: [email protected]
• Experience – 2010~Present
– 2013~2014 • Trainer of Hortonworks Certificated Training lecture
– HCAHD(Hortonworks Certified Apache Hadoop Developer) – HCAHA (Hortonworks Certified Apache Hadoop Administrator)
![Page 3: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/3.jpg)
Agenda
• What is Big Data – The Need for Hadoop
• Hadoop Introduction – What is Hadoop 2.0
• Hadoop Architecture Fundamentals – What is HDFS – What is MapReduce – What is YARN – Hadoop eco-systems
• HDP for Windows – What is HDP – How to install HDP on Windows – The advantages of HDP – What’s Next
• Conclusion • Q&A
![Page 4: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/4.jpg)
What is Big Data
![Page 5: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/5.jpg)
What is Big Data?
1. In what timeframe do we now create the same amount of information that we created from the dawn of civilization until 2003?
2. 90% of the world’s data was created in the last (how many years)?
3. This is data from 2010 report!
2 days
2 years
Sources: http://www.itbusinessedge.com/cm/blogs/lawson/just-the-stats-big-numbers-about-big-data/?cs=48051 http://techcrunch.com/2010/08/04/schmidt-data/
![Page 6: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/6.jpg)
How large can it be?
• 1ZB = 1000 EB = 1,000,000 PB = 1,000,000,000 TB
![Page 7: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/7.jpg)
Every minute…
• http://whathappensontheinternetin60seconds.com/
![Page 8: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/8.jpg)
The definition?
A set of files A database A single file
“Big Data is like teenage sex: Everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it too.”── Dan Ariely.
![Page 9: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/9.jpg)
Big Data Includes All Types of Data S
truct
ured
• Pre-defined schema
• Relational database system
Sem
i-stru
ctur
ed
• Inconsistent structure
• Cannot be stored in rows in a single table
• Logs, tweets • Often has
nested structure
Uns
truct
ured
• Irregular structure or..
• Parts of it lack structure
• Pictures • Video
Time-sensitive Immutable
![Page 10: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/10.jpg)
© Hortonworks Inc. 2013
6 Key Hadoop DATA TYPES
1. Sentiment How your customers feel
2. Clickstream Website visitors’ data
3. Sensor/Machine Data from remote sensors and machines
4. Geographic Location-based data
5. Server Logs
6. Text Millions of web pages, emails, and documents
Page
Value
![Page 11: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/11.jpg)
4 V’s of Big Data
http://www.datasciencecentral.com/profiles/blogs/data-veracity
![Page 12: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/12.jpg)
Next Product to Buy (NPTB)
Business Problem • Telecom product portfolios are complex • There are many cross-sell opportunities to installed base • Sales associates use in-person conversations to guess about NPTB
recommendations, with little supporting data
Solution • Hadoop gives telcos the ability to make confident NPTB
recommendations, based on data from all its customers • Confident NPTB recommendations empower sales associates and
improve their interactions with customers • Use the HDP data lake to reduce sales friction and create NPTB
advantage like Amazon’s advantage in eCommerce
![Page 13: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/13.jpg)
Use case – Walmart prediction
Revenue ?
Friday
Beer Diapers
![Page 14: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/14.jpg)
Localized, Personalized Promotions
Business Problem • Telcos can geo-locate their mobile subscribers • They could create localized and personalized promotions • This requires connections with both deep historical data and real-
time streaming data • Those connections have been expensive and complicated
Solution • Hadoop brings the data together to inexpensively localize and
personalize promotions delivered to mobile devices • Notify subscribers about local attractions, events and sales that
align with their preferences and location • Telcos can sell these promotional services to retailers
![Page 15: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/15.jpg)
360° View of the Customer
Business Problem • Retailers interact with customers across multiple channels • Customer interaction and purchase data is often siloed • Few retailers can correlate customer purchases with marketing
campaigns and online browsing behavior • Merging data in relational databases is expensive
Solution • Hadoop gives retailers a 360° view of customer behavior • Store data longer & track phases of the customer lifecycle • Gain competitive advantage: increase sales, reduce supply chain
expenses and retain the best customers
![Page 16: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/16.jpg)
Use case – Target case
• Target mined their customer data and send coupons to shopper who have high “pregnancy prediction” score.
• One angry father stormed into a Target to yell at them for sending his daughter coupons for baby clothes and cribs.
• Guess what, she was pregnant, and hadn’t told her father yet.
![Page 17: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/17.jpg)
Changes in Analyzing Data
Big data is fundamentally changing the way we analyze information.
– Ability to analyze vast amounts of data rather than evaluating sample sets.
– Historically we have had to look at causes. Now we can look at patterns and correlations in data that give us much better perspective.
![Page 18: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/18.jpg)
Recent day cases 1:
• http://www.ibtimes.co.uk/global-smartphone-data-traffic-increase-eightfold-17-exabyte-2020-1475571
![Page 19: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/19.jpg)
http://tech.naver.jp/blog/?p=2412
Recent day cases 1: Practice on LINE
![Page 20: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/20.jpg)
Recent day cases 2: in Taiwan
– The media analyze of 2014 Taipei City mayor election • IMHO,⿊黑貘來說 http://gene.speaking.tw/2014/11/blog-post_28.html • 破解社群與APP⾏行銷
http://taiwansmm.wordpress.com/2014/11/26/⾏行銷絕不等於買廣告-2014年台北市⻑⾧長選舉柯⽂文哲與連/
![Page 21: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/21.jpg)
Scale up or Scale out?
![Page 22: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/22.jpg)
Guess what…
• Traditionally, computation has been processor-bound • For decades, the primary push was to increase the
computing power of a single machine – Faster processor, more RAM
• Distributed systems evolved to allow developers to use multiple machines for a single job – At compute time, data is copied to the compute nodes
![Page 23: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/23.jpg)
Scaling with a traditional database
• scalling with a queue
• sharding the database
• fault-tolerane issue
• corruption issue
• problems
![Page 24: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/24.jpg)
NOSQL
• Not Only SQL • provides a mechanism for storage and retrieval of data
that is modeled in means other than the tabular relations used in relational databases.
• Motivations for this approach include simplicity of design, horizontal scaling and finer control over availability.
• Column: Accumulo, Cassandra, Druid, HBase, Vertica • Document: Clusterpoint, Apache CouchDB, Couchbase,
MarkLogic, MongoDB
• Key-value: Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-treeACE, Aerospike
• Graph: Allegro, Neo4J, InfiniteGraph, OrientDB, Virtuoso, Stardog
![Page 25: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/25.jpg)
First principles(1/2)
• "At the most fundamental level, what does a data system do?"
• a data system does: "A data system answers questions based on information that was acquired in the past".
• "What is this person's name?”
• "How many friends does this person have?”
• A bank account web page answers questions like "What is my current balance?”
• "What transactions have occurred on my account recently?"
![Page 26: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/26.jpg)
First principles(2/2)
• Data is often used interchangeably with the word "information".
• You answer questions on your data by running functions that take data as input
• The most general purpose data system can answer questions by running functions that take in the as input. In fact, any query can be answered entire dataset by running a function on the complete dataset
![Page 27: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/27.jpg)
Desired Properties of a Big Data System
• Robust and fault-tolerant • Low latency reads and updates
• Scalable
• General
• Extensible
• Allow ad hoc queries
• Minimal maintenance
![Page 28: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/28.jpg)
The Lambda Architecture
• There is no single tool that provides a complete solution.
• You have to use The Lambda Architecture a variety of tools and techniques to build a complete Big Data system.
• solves the problem of computing arbitrary functions on arbitrary data in realtime by decomposing the problem into three layers
• batch layer, the serving layer, and the speed layer
![Page 29: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/29.jpg)
The Lambda Architecture model
![Page 30: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/30.jpg)
Batch Layer 1
• The batch layer stores the master copy of the dataset and precomputes batch views on that master dataset.
• The master dataset can be thought of us a very large list of records.
• two things : – store an immutable, constantly growing master dataset, – and compute arbitrary functions on that dataset.
• If you're going to precompute views on a dataset, you need to be able to do so for any view and any dataset.
• There's a class of systems called "batch processing systems" that are built to do exactly what the batch layer requires
![Page 31: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/31.jpg)
Batch Layer(2
![Page 32: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/32.jpg)
What is Batch View
• Everything starts from the "query = function(all data)" equation.
• You could literally run your query functions on the fly on the complete dataset to get the results.
• it would take a huge amount of resources to do and would be unreasonably expensive
• Instead of computing the query on the fly, you read the results from the precomputed view.
![Page 33: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/33.jpg)
How we get Batch View
![Page 34: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/34.jpg)
Serving Layer 1
![Page 35: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/35.jpg)
Serving Layer 2
• The batch layer emits batch views as the result of its functions.
• The next step is to load the views somewhere so that they can be queried.
• The serving layer indexes the batch view and loads it up so it can be efficiently queried to get particular values out of the view.
![Page 36: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/36.jpg)
Batch and serving layers satisfy almost all properties • Robust and fault tolerant • Scalable
• General
• Extensible
• Allows ad hoc queries
• Minimal maintenance
![Page 37: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/37.jpg)
Speed Layer 1
![Page 38: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/38.jpg)
Speed Layer 2
• speed layer as similar to the batch layer in that it produces views based on data it receives.
• One big difference is that in order to achieve the fastest latencies possible, the speed layer doesn't look at all the new data at once.
• it updates the realtime view as it receives new data instead of recomputing them like the batch layer does.
• "incremental updates” vs "recomputation updates".
• Page view example
![Page 39: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/39.jpg)
Speed Layer 3
• complexity isolation:complexity is pushed into a layer whose results are only temporary
• The last piece of the Lambda Architecture is merging the resutlts from the batch and realtime views
![Page 40: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/40.jpg)
Summary of the Lambda Architecture
![Page 41: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/41.jpg)
Summary of the Lambda Architecture • All new data is sent to both the batch layer and the
speed layer
• The master dataset is an immutable, append-only set of data
• The batch layer pre-computes query functions from scratch
• The serving layer indexes the batch views produced by the batch layer and makes it possible to get particular values out of a batch view very quickly
• The speed layer compensates for the high latency of updates to the serving layer.
• Queries are resolved by getting results from both the batch and realtime views and merging them together
![Page 42: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/42.jpg)
The Need for Hadoop
• Store and use all types of data • Process all the data
• Scalability
• Commodity hardware
Traditional Database
SCALE (storage & processing)
Hadoop Platform
NoSQL MPP Analytics EDW
![Page 43: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/43.jpg)
Hadoop as a Data Factory
• A role Hadoop can play in an enterprise data platform is that of a data factory
Structured, semi-structured and raw data
Business value
Hadoop
![Page 44: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/44.jpg)
Hadoop as a Data Lake
• A larger more general role Hadoop can play in an enterprise data platform is that of a data lake.
![Page 45: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/45.jpg)
Integrating Hadoop
WEB LOGS, CLICK STREAMS
MACHINE GENERATED
OLTP
EDW Data Marts
Hadoop
Big Data Data Analysis
Social Media
Messaging
Applications & Spreadsheets
Visualization & Intelligence
ODBC
ODBC Access for Popular BI Tools Tools
Staging Area
Connectors
![Page 46: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/46.jpg)
Hadoop Introduction
![Page 47: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/47.jpg)
– inspired by
• Apache Hadoop project – inspired by Google's MapReduce and Google File System
papers.
• Open sourced, flexible and available architecture for large scale computation and data processing on a network of commodity hardware
• Yahoo has been the largest contributor to the project and uses Hadoop extensively in its Web search and Ad business.
Hadoop Creator: Doug Cu:ng
![Page 48: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/48.jpg)
Hadoop Concepts
• Distribute the data as it is initially stored in the system • Moving Computation is Cheaper than Moving Data
• Individual nodes can work on data local to those nodes
• Users can focus on developing applications.
![Page 49: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/49.jpg)
Relational Databases vs. Hadoop
VS.
schema
speed
governance
best fit use
Relational Hadoop
processing
Required on write Required on read
Reads are fast Writes are fast
Standards and structured Loosely structured
Limited, no data processing Processing coupled with data
data types Structured Multi and unstructured
Interactive OLAP Analytics Complex ACID Transactions
Operational Data Store
Data Discovery Processing unstructured data Massive Storage/Processing
![Page 50: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/50.jpg)
Different behaviors between RDBMS and Hadoop
– RDBMS
– Hadoop
Application Schema RDBMS SQL
Application Hadoop Schema MapReduce
![Page 51: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/51.jpg)
Why we use Hadoop, not RDBMS?
• Limitation of RDBMS – Capacity
• 100GB~100TB – Speed – Cost
• High-end devices’ price increases over than its linear proportion
• Software cost on technical support or license fee
– Too Complex
• A Distributed File System is more likely fit our need – DFS usually provides backup and fault-
tolerance mechanism – More cheap than RDBMS when the data is
really huge enough
![Page 52: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/52.jpg)
What is Hadoop 2.0?
• The Apache Hadoop 2.0 project consists of the following modules: – Hadoop Common: the utilities that provide support for the other
Hadoop modules. – HDFS: the Hadoop Distributed File System – YARN: a framework for job scheduling and cluster resource
management. – MapReduce: for processing large data sets in a scalable and
parallel fashion.
![Page 53: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/53.jpg)
Difference between Hadoop 1.0 and 2.0
![Page 54: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/54.jpg)
What is YARN
• Yet Another Resource Negotiator • Jira ticket (MAPREDUCE-279) raised in January 2008
by Hortonworks co-founder Arun Murthy.
• YARN is the result of 5 years of subsequent development in the open community.
• YARN has been tested by Yahoo! since September 2012 and has been in production across 30,000 nodes and 325PB of data since January 2013.
• More recently, other enterprises such as Microsoft, eBay, Twitter, XING and Spotify have adopted a YARN-based architecture.
• Apache Hadoop YARN wins Best Paper award at SoCC 2013! - Hortonworks
• http://hortonworks.com/blog/apache-hadoop-yarn-wins-best-paper-award-at-socc-2013/
![Page 55: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/55.jpg)
YARN: Taking Hadoop Beyond Batch
• With YARN, applications run natively in Hadoop (instead of on Hadoop)
![Page 56: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/56.jpg)
HDFS Federation
Hadoop Hadoop 2.0
http://hortonworks.com/blog/an-introduction-to-hdfs-federation/
/home/ /app/Hive /app/HBase
![Page 57: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/57.jpg)
HDFS High Availability (HA)
• Secondary Name Node is not Name Node • http://www.youtube.com/watch?v=hEqQMLSXQlY
![Page 58: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/58.jpg)
HDFS High Availability (HA)
https://issues.apache.org/jira/browse/HDFS-1623
![Page 59: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/59.jpg)
Hadoop Architecture Fundamentals
![Page 60: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/60.jpg)
What is HDFS
• Shared multi-petabyte file system for an entire cluster
• Managed by a single NameNode
• Multiple DataNodes NameNode
DataNode DataNode DataNode
![Page 61: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/61.jpg)
The Components of HDFS
• NameNode – The “master” node of HDFS – Determines and maintains how the chunks of data are
distributed across the DataNodes
• DataNode – Stores the chunks of data, and is responsible for replicating the
chunks across other DataNodes
![Page 62: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/62.jpg)
Concept: What is NameNode
• NameNode holds metadata for the files – One HDFS cluster only has one metadata
– NameNode is a single point of failure
• Only one NameNode for One HDFS cluster – One HDFS cluster only has one namespace and one root
directory
• Metadata saves in NameNode’s RAM in case to query it faster – 1G RAM can almost saves 1,000,000 blocks of the mapping
metadata information • If the block size is 64MB, the metadata may mapping to 64TB actual
data
![Page 63: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/63.jpg)
More on the Metadata
• NameNode uses two important local files to save the metadata information︰
• fsimage – fsimage saves file directory tree information – fsimage saves the mapping of the file and the blocks
• edits – edits saves the file system journal – When the client tries to create / move a file, the operation will be
first recorded into edits. If the operation succeed, the data in RAM will later be changed.
– fsimage WILL NOT instantly be changed.
![Page 64: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/64.jpg)
The NameNode
fsimage edits
3. A client application creates a new file in HDFS.
4. The NameNode logs that transaction in the edits file.
1. When the NameNode starts, it reads the fsimage and edits files.
2. The transactions in edits are merged with fsimage, and edits is emptied.
NameNode
Data
![Page 65: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/65.jpg)
File Name Replicas Block Sequence Others
/data/part-0 3 B1, B2, B3 user, group, ...
/data/part-1 3 B4, B5 user, group, ...
OP Code Operands
OP_SET_REPLICATION "/data/part-0", 2
OP_SET_OWNER "/data/part-1", "foo", "bar"
fsimage
edits
File Name Replicas Block Sequence Others
/data/part-0 2 B1, B2, B3 user, group, ...
/data/part-1 3 B4, B5 foo, bar, ...
Memory
Disk
![Page 66: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/66.jpg)
Concept: What is DataNode
• DataNode hold the actual blocks – Each block will be 64MB or 128MB in size – Each block is replicated three times on the cluster
• DataNode communicates through Heartbeat with NameNode
![Page 67: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/67.jpg)
Block backup and replication
• Each block is replicated multiple times – Default replica number = 3 – Client can modify the configuration
• Each block’s replica has the same ID – System has no need to record which blocks are the same
• Replicas can be set by rack awareness
– First backup on a rack – The other two backups are on another rack, but on the different
machines
![Page 68: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/68.jpg)
The DataNodes
“I’m still alive! This is my latest
Blockreport.”
“I’m here! Here is my latest Blockreport.”
123
“Replicate block 123 to DataNode 1.”
NameNode
DataNode 1
DataNode 2
DataNode 3
DataNode 4
![Page 69: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/69.jpg)
© Hortonworks Inc. 2013
Data
1. Client sends a request to the NameNode to add a file to HDFS
3. Client breaks the data into blocks and distributes the blocks to the DataNodes
4. The DataNodes replicate the blocks (as instructed by the NameNode)
2. NameNode tells client how and where to distribute the blocks
NameNode
DataNode 1
DataNode 2
DataNode 3
![Page 70: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/70.jpg)
What is MapReduce
• Two Functions – Mapper
• Since we are processing huge amount of data, it’s nature to split the input data
• The Mapper reads data in the form of key/value pairs • M(K1, V1) à list(K2, V2)
– Reducer • Since the input data are split, we would need another phase to
aggregate result in each split • R(K2, list(V2)) à list(K3, V3)
![Page 71: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/71.jpg)
Hadoop
MapReduce
Hadoop 1.0 Basic Core Architecture
Mapper Reducer
Hadoop Distributed File System (HDFS)
Shuffle/Sort Map Reduce
![Page 72: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/72.jpg)
Words to Websites - Simplified • From words provide locations • Provides what to display for a search
– Note: Page rank determines the order
• For example – to find URLs with books on them
books www.eslite.com www.amazon.com email www.google.com www.yahoo.com www.facebook.com finance www.yahoo.com www.google.com groceries www.costco.com www.wellcome.com toolkits www.costco.com www.amazon.com
K, V
<url, keyword>
<keyword, url>
www.eslite.com books calendars www.yahoo.com sports finance email celebrity www.amazon.com shoes books toolkits www.google.com finance email search www.microso6.com operaHng-‐system producHvity system
Map
Reduce
![Page 73: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/73.jpg)
Data Model
• MapReduce works on <key, value> pairs (Key input, Value input)
(Key intermediate, Value intermediate)
(Key output, Value output)
Map
Reduce
(books, www.eslite.com www.amazon.com)
(books, www.eslite.com)
(www.eslite.com , books calendars) Other Compute result (other map result)
![Page 74: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/74.jpg)
Job Tracker
Heartbeat, Task Report …
Worker Nodes
Task Tracker
M M M
R R
Task Tracker
M M M
R R
Task Tracker
M M M
R R
Task Tracker
M M M
R R
The M/R concept
![Page 75: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/75.jpg)
Task Tracker D
Map -> Shuffle -> Reduce
Reducer 0
Task Tracker A
Mapper A A Sort
Task Tracker B
Mapper B B Sort
Task Tracker C
Mapper C C Sort
A
B
C
Merge Fetch
![Page 76: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/76.jpg)
Map -> Shuffle -> Reduce
Reducer 0
Mapper A A0
A1
Mapper B B0
B1
Mapper C C0
C1
A0
B0
C0
A1
B1
C1
Reducer 1
Partition + Sort
Merge
Partition + Sort
Partition + Sort
Merge Fetch
Fetch
![Page 77: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/77.jpg)
Word Count Example
Key: offset Value: line
Key: word Value: count
Key: word Value: sum of count
0:The cat sat on the mat 22:The aardvark sat on the sofa
![Page 78: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/78.jpg)
What is YARN?
• YARN is a re-architecture of Hadoop that allows multiple applications to run on the same platform
![Page 79: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/79.jpg)
Why YARN • support non-MapReduce workloads
– reducing the need to move data between Hadoop HDFS and other storage systems
• improve scalability – 2009 – 8 cores, 16GB of RAM, 4x1TB disk – 2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk. – scale to production deployments of ~5000 nodes of hardware of
2009 vintage
• cluster utilization – JobTracker views the cluster as composed of nodes (managed
by individual TaskTrackers) with distinct map slots and reduce slots
• customer agility
![Page 80: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/80.jpg)
How YARN Works
• YARN’s original purpose was to split up the two major responsibilities of the JobTracker/TaskTracker into separate entities:
• a global ResourceManager
• a per-application ApplicationMaster
• a per-node slave NodeManager
• a per-application Container running on a NodeManager
![Page 81: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/81.jpg)
MapReduce v1
![Page 82: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/82.jpg)
YARN
![Page 83: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/83.jpg)
The Hadoop 1.x and 2 Ecosystem
Hadoop
![Page 84: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/84.jpg)
The Path to ROI
Hadoop Distributed File System
Structured Data
Raw Data
1. Put the data into HDFS in its raw format
Answers to questions = $$
2. Use Pig to explore and transform
3. Data Analysts use Hive to query the data
4. Data Scientists use MapReduce, R and Mahout to mine the data
Hidden gems = $$
![Page 85: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/85.jpg)
Flume & Sqoop
![Page 86: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/86.jpg)
Flume / Sqoop – Data Integration Framework
What’s the problem for data collection • Data collection is currently a priori and ad hoc
• A priori – decide what you want to collect ahead of time
• Ad hoc – each kind of data source goes through its own collection path
![Page 87: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/87.jpg)
(and how can it help?)
• A distributed data collection service
• It efficiently collecting, aggregating, and moving large amounts of data
• Fault tolerant, many failover and recovery mechanism
• One-stop solution for data collection of all formats
![Page 88: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/88.jpg)
Flume: High-Level Overview • Logical Node • Source
• Sink
![Page 89: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/89.jpg)
An example flow
![Page 90: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/90.jpg)
Sqoop
• Easy, parallel database import/export • You want…
– Insert data from RDBMS to HDFS – Export data from HDFS back into RDBMS
![Page 91: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/91.jpg)
Sqoop - import process
![Page 92: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/92.jpg)
Sqoop - export process
• Exports are performed in parallel using MapReduce
![Page 93: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/93.jpg)
Why Sqoop
• JDBC-based implementation – Works with many popular database vendors
• Auto-generation of tedious user-side code – Write MapReduce applications to work with your data, faster
• Integration with Hive – Allows you to stay in a SQL-based environment
![Page 94: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/94.jpg)
Pig & Hive
![Page 95: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/95.jpg)
Why Hive and Pig?
• Although MapReduce is very powerful, it can also be complex to master
• Many organizations have business or data analysts who are skilled at writing SQL queries, but not at writing Java code
• Many organizations have programmers who are skilled at writing code in scripting languages
• Hive and Pig are two projects which evolved separately to help such people analyze huge amounts of data via MapReduce – Hive was initially developed at Facebook, Pig at Yahoo!
![Page 96: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/96.jpg)
Pig
• An engine for executing programs on top of Hadoop
• A high-level scripting language (Pig Latin)
• Process data one step at a time
• Simple to write MapReduce program
• Easy understand
• Easy debug
A = load ‘a.txt’ as (id, name, age, ...) B = load ‘b.txt’ as (id, address, ...) C = JOIN A BY id, B BY id;STORE C into ‘c.txt’
– Initiated by
![Page 97: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/97.jpg)
Hive – Developed by
• What is Hive? – An SQL-like interface to Hadoop
• Treat your Big Data as tables
• Data Warehouse infrastructure – Which provides
• Data summarization – MapRuduce for execution
• Ad hoc querying on top of Hadoop – Maintains metadata information about your Big Data stored on HDFS
• Hive Query Language SELECT * FROM purchases WHERE price > 100 GROUP BY storeid
![Page 98: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/98.jpg)
• Input • For the given sample input the map emits
• the reduce just sums up the values
Hello World Bye World Hello Hadoop Goodbye Hadoop
< Hello, 1> < World, 1> < Bye, 1> < World, 1> < Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>
< Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
WordCount Example
![Page 99: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/99.jpg)
WordCount Example In MapReduce public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
![Page 100: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/100.jpg)
WordCount Example By Pig
A = LOAD 'wordcount/input' USING PigStorage as (token:chararray); B = GROUP A BY token; C = FOREACH B GENERATE group, COUNT(A) as count; DUMP C;
![Page 101: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/101.jpg)
WordCount Example By Hive
CREATE TABLE wordcount (token STRING); LOAD DATA LOCAL INPATH ’wordcount/input' OVERWRITE INTO TABLE wordcount; SELECT count(*) FROM wordcount GROUP BY token;
![Page 102: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/102.jpg)
Hive vs. Pig
Hive Pig Language HiveQL (SQL-like) Pig Latin, a scripting language Schema Table definitions
that are stored in a metastore
A schema is optionally defined at runtime
Programmait Access JDBC, ODBC PigServer
![Page 103: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/103.jpg)
HCatalog in the Ecosystem
Java MapReduce
HCatalog
HDFS HBase ???
![Page 104: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/104.jpg)
Oozie
![Page 105: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/105.jpg)
What is ?
• A Java Web Application • Oozie is a workflow scheduler for Hadoop
• Crond for Hadoop
Job 1
Job 3
Job 2
Job 4 Job 5
![Page 106: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/106.jpg)
How it triggered
• Time – Execute your workflow every 15 minutes
• Event – Materialize your workflow every hour, but only run them when
the input data is ready.
00:15 00:30 00:45 01:00
01:00 02:00 03:00 04:00
Hadoop Input Data Exists?
![Page 107: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/107.jpg)
Defining an Oozie Workflow
Start
End
Action
Control
Flow
Action
Action
Action
![Page 108: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/108.jpg)
HDP on Windows
![Page 109: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/109.jpg)
General Planning Considerations
• Run on a single node? – For test and perform simple operations – Not suitable for big data
• Start from a small cluster – Maybe 4 or 6 nodes – As the data grows, add more nodes.
• Expand when necessary – Storage is not enough – Improve computing capability
![Page 110: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/110.jpg)
Traditional Operating System Selection
• RedHat Enterprise Linux • CentOS
• Ubuntu Server
• SuSE Enterprise Linux
![Page 111: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/111.jpg)
Topology
• Master Node – Active NameNode – ResourceManager – Secondary NameNode ( or Standby NameNode)
• Slave Node – DataNode – NodeManager
![Page 112: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/112.jpg)
World
Network Topology
Switch Switch
Switch
Namenode
DN + NM DN + NM DN + NM DN + NM
Rack 1
Switch
RM
Rack 2
Switch
SNN
Rack 3
Switch
Rack n
DN + NM DN + NM DN + NM DN + NM
DN + NM DN + NM DN + NM DN + NM
DN + NM DN + NM DN + NM DN + NM
![Page 113: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/113.jpg)
Hadoop Distribution
• Apache • Cloudera
• MapR
• Hortonworks
• Amazon EMR – Apache Hadoop – MapR
• Greenplum
• IBM
![Page 114: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/114.jpg)
Get all your Hadoop packages and …
• Make sure the packages compatibility!
![Page 115: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/115.jpg)
Who is Hortonworks? Upstream Community Projects Downstream Enterprise Product
Hortonworks Data Platform
Design & Develop
Distribute
Integrate & Test
Package & Certify
Apache Pig
Apache Hadoop
Test & Patch
Design & Develop
Release
No Lock-in: Integrated, tested & certified distribution lowers risk by ensuring close alignment with Apache projects
Virtuous cycle when development & fixed issues done upstream & stable project releases flow downstream
Apache Hive
Apache HBase
Other Apache Projects
Apache HCatalog
Apache Ambari
![Page 116: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/116.jpg)
What is HDP
• Enterprise Hadoop – The Hortonworks Data
Platform (HDP)
– The ONLY 100% open source and complete distribution
– Enterprise grade, proven and tested at scale
– Ecosystem endorsed to ensure interoperability
OS Cloud VM Appliance
HORTONWORKS DATA PLATFORM (HDP)
PLATFORM SERVICES
HADOOP CORE
Enterprise Readiness: HA, DR, Snapshots, Security, …
HDFS YARN
WEBHDFS MAP REDUCE
DATA SERVICES
HCATALOG
HIVE PIG HBASE
SQOOP
FLUME
OPERATIONAL SERVICES
Manage & Operate at
Scale OOZIE
AMBARI
![Page 117: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/117.jpg)
The management for HDP
![Page 118: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/118.jpg)
What is HDP for Windows
• HDP for Windows significantly expands the ecosystem for the next generation big data platform. This means that the Microsoft partners and tools you already rely on can help you with your Big Data initiatives.
• HDP for Windows is the Microsoft recommended way to deploy Hadoop on Windows Server environments.
• Support – Windows server 2008 – Windows server 2012
![Page 119: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/119.jpg)
Choose your HDP
![Page 120: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/120.jpg)
HDP hardware recommendations
Machine Type Workload Pattern/ Cluster Type Storage Processor (# of
Cores) Memory (GB) Network
Slave Nodes
Balanced workload Twelve 2-3 TB disks 8 128-256 1 GB onboard, 2x10 GBE mezzanine/external
Compute-intensive workload Twelve 1-2 TB disks 10 128-256
1 GB onboard, 2x10 GBE mezzanine/external
Storage-heavy workload Twelve 4+ TB disks 8 128-256
1 GB onboard, 2x10 GBE mezzanine/external
NameNode Balanced workload Four or more 2-3 TB RAID 10 with spares
8 128-256 1 GB onboard, 2x10 GBE mezzanine/external
ResourceManager Balanced workload Four or more 2-3 TB RAID 10 with spares
8 128-256 1 GB onboard, 2x10 GBE mezzanine/external
![Page 121: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/121.jpg)
The installation of HDP for Windows 1
![Page 122: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/122.jpg)
The installation of HDP for Windows 2
![Page 123: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/123.jpg)
The installation of HDP for Windows 3
• #Log directory
• HDP_LOG_DIR=d:\hadoop\logs
• #Data directory • HDP_DATA_DIR=d:\hdp\data
• #Hosts
• NAMENODE_HOST=NAMENODE_MASTER.acme.com • SECONDARY_NAMENODE_HOST=SECONDARY_NAMENODE_MASTER.acme.com
• RESOURCEMANAGER_HOST.acme.com • HIVE_SERVER_HOST=HIVE_SERVER_MASTER.acme.com
• OOZIE_SERVER_HOST=OOZIE_SERVER_MASTER.acme.com • WEBHCAT_HOST=WEBHCAT_MASTER.acme.com
• FLUME_HOSTS=FLUME_SERVICE1.acme.com,FLUME_SERVICE2.acme.com,FLUME_SERVICE3.acme.com
• HBASE_MASTER=HBASE_MASTER.acme.com
• HBASE_REGIONSERVERS=slave1.acme.com, slave2.acme.com, slave3.acme.com
• ZOOKEEPER_HOSTS=slave1.acme.com, slave2.acme.com, slave3.acme.com • SLAVE_HOSTS=slave1.acme.com, slave2.acme.com, slave3.acme.com
![Page 124: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/124.jpg)
The installation of HDP for Windows 4
• #Database host
• DB_FLAVOR=derby • DB_HOSTNAME=DB_myHostName
• #Hive properties
• HIVE_DB_NAME=hive • HIVE_DB_USERNAME=hive
• HIVE_DB_PASSWORD=hive
• #Oozie properties • OOZIE_DB_NAME=oozie
• OOZIE_DB_USERNAME=oozie • OOZIE_DB_PASSWORD=oozie
![Page 125: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/125.jpg)
What is HDP for Windows(Con.)
![Page 126: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/126.jpg)
The management of HDP for Windows
• The Ambari SCOM integration is made possible by the pluggable nature of Ambari.
![Page 127: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/127.jpg)
The management of HDP for Windows(Con.)
![Page 128: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/128.jpg)
The Advantages of HDP for Windows
• Hadoop on Windows Made Easy – With HDP for Windows, Hadoop is both simple to install and
manage. It demystifies the Hadoop distribution so you don’t need to choose and test the right combination of Hadoop projects to deploy.
• Clean and Easy Management – Apache Ambari, the open source choice for management of a
Hadoop cluster is integrated and extends Microsoft System Center so that IT Operators can manage their Hadoop clusters side-by-side with their databases, applications and other IT assets on a single screen.
• Secure, Reliable, Enterprise-Ready Hadoop – Offering the most reliable, innovative and trusted distribution
available, Microsoft and Hortonworks together deliver tighter security through integration with Windows Server Active Directory, ease of management through System Center integration.
![Page 129: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/129.jpg)
The Data Integration
• The Hive ODBC Driver
Hive ODBC Driver
BI Tools Analytics Reporting
![Page 130: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/130.jpg)
Using Hive with Excel
• Using the Hive ODBC Driver, your Excel spreadsheets can query data stored in Hadoop
![Page 131: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/131.jpg)
Querying Hive from Excel
![Page 132: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/132.jpg)
Querying Hive from Excel (Con.)
![Page 133: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/133.jpg)
Combine model using Power View in Excel
![Page 134: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/134.jpg)
Then Next?
• Spark
• Ambari
• Ranger
• Falcon
![Page 135: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/135.jpg)
Why
• MapReduce is too slow • Aims to make data analytics fast — both fast to run and
fast to write.
• When you have the request: iterative algorithms
![Page 136: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/136.jpg)
What is
• In-memory distributed computing framework • Create by UC Berkeley AMP Lab in 2010
• Target Problem that Hadoop MR is bad at – Iterative algorithm (Machine Learning ) – Interactive data mining
• More general purpose than Hadoop MR
• Active contributions from ~15 companies
![Page 137: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/137.jpg)
What Different between Hadoop and Spark
Data Source
Map()
Data Source 2
Join()
Cache() Transform
http://spark.incubator.apache.org
HDFS
Map
Reduce
Map
Reduce
![Page 138: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/138.jpg)
What is
• Provision a Hadoop Cluster – Ambari provides a step-by-step wizard for installing Hadoop
services across any number of hosts. – Ambari handles configuration of Hadoop services for the cluster.
• Manage a Hadoop Cluster – Ambari provides central management for starting, stopping, and
reconfiguring Hadoop services across the entire cluster.
• Monitor a Hadoop Cluster – Ambari provides a dashboard for monitoring health and status of
the Hadoop cluster.
![Page 139: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/139.jpg)
Ambari installation Wizard
![Page 140: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/140.jpg)
Ambari central dashboard
![Page 141: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/141.jpg)
Conclusion
![Page 142: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/142.jpg)
Container Container
Container
Application Master
Recap - Lifecycle of a YARN Application
Resource Manager
Node Manager
Client
Node Manager
Node Manager
Node Manager
Container • Basic unit of allocation
Ex. Container A = 2GB, 1CPU
• Fine-grained resource allocation
• Replace the fixed map/reduce slots
Container
Container
Container Container
![Page 143: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/143.jpg)
Hadoop 2.0 Eco-systems
• a
![Page 144: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/144.jpg)
Q&A
![Page 145: Hadoop 2.0 Introduction – with HDP for Windows...2015/05/14 · Agenda • What is Big Data – The Need for Hadoop • Hadoop Introduction – What is Hadoop 2.0 • Hadoop Architecture](https://reader033.vdocuments.net/reader033/viewer/2022042306/5ed2be993df6dd07425d7c04/html5/thumbnails/145.jpg)
Questions?