whirlwind tour of hadoop inspired by google's gfs clusters from 1-10000 systems batch...
TRANSCRIPT
![Page 1: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/1.jpg)
Whirlwind tour of Hadoop
Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance in storage Fault Tolerance in processing
![Page 2: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/2.jpg)
HDFS
Distributed File System (HDFS) Centralized “INODE” store NameNode DataNodes provide storage 1-10000 Replication set Per File Block Size normally large 128 MB Optimized for throughput
![Page 3: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/3.jpg)
Map Reduce
Jobs are typically long running Move processing not data Algorithm uses map() and optionally reduce() Source Input is split Splits processed simultaneously Failed Tasks are retried
![Page 4: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/4.jpg)
Things you do not have
Locking Blocking Mutexes wait/notify POSIX file semantics
![Page 5: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/5.jpg)
Problem: Large Scale Log Processing
Challenges Large volumes of Data Store it indefinitely Process large data sets on multiple systems Data and Indexes for short periods (one week)
do NOT fit in main memory
![Page 6: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/6.jpg)
Log Format
![Page 7: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/7.jpg)
Getting Files into HDFS
![Page 8: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/8.jpg)
Hadoop Shell
Hadoop has a shell ls, put, get, cat..
![Page 9: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/9.jpg)
Sample ProgramGroup and Count
Source data looks like jan 10 2009:.........:200:/index.htm
jan 10 2009:.........:200:/index.htmjan 10 2009:.........:200:/igloo.htmjan 10 2009:.........:200:/ed.htm
![Page 10: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/10.jpg)
In case your a Math 'type'
(input) <k1, v1> →map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
Map(k1,v1) -> list(k2,v2)Reduce(k2, list (v2)) -> list(v3)
![Page 11: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/11.jpg)
Calculate Splits
Input does not have to be a file to be “splittable”
![Page 12: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/12.jpg)
Splits → Mappers
Mappers run the map() process on each split
![Page 13: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/13.jpg)
Mapper runs map function
For each row/line (in this case)
![Page 14: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/14.jpg)
Shuffle sort, NOT the hottest new dance move
Data Shuffled to reducer on key.hash() Data sorted by key per reducer.
![Page 15: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/15.jpg)
Reduce
One Reducer gets data that looks like ”index.htm”,<1,1>
“ed,htm”,<1>//igloo.htm went to another reducer
![Page 16: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/16.jpg)
Reduce Phase
![Page 17: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/17.jpg)
Scalable
Tune mappers and reducers per job More Data...get...More Systems Lose DataNode files replicate elsewhere Lose TaskTracker running tasks will restart Lose NameNode, Jobtracker- Call bigsaad
![Page 18: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/18.jpg)
Hive
Translates queries into map/reduce jobs SQL like language atop Hadoop like MySQL CSV engine, on Hadoop, on roids Much easier and reusable then coding your
own joins, max(), where CLI, Web Interface
![Page 19: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/19.jpg)
Hive has no “Hive Format”
Works with most input Configurable delimiters Types like string,long also map, struct, array Configurable SerDe
![Page 20: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/20.jpg)
Create & Load
hive>create table web_data ( sdate STRING, stime STRING, envvar STRING, remotelogname STRING ,servername STRING, localip STRING, literaldash STRING, method STRING, url STRING, querystring STRING, status STRING, litteralzero STRING ,bytessent INT,header STRING, timetoserver INT, useragent STRING ,cookie STRING, referer STRING);
LOAD DATA INPATH '/user/root/phpbb/phpbb.extended.log.2009-12-19-19' INTO TABLE web_data
![Page 21: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/21.jpg)
Query
SELECT url,count(1) FROM web_data GROUP BY url
Hive translates query into map reduce program(s)
![Page 22: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/22.jpg)
![Page 23: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/23.jpg)
![Page 24: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/24.jpg)
![Page 25: Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from 1-10000 systems Batch Processing High Throughput Partition-able problems Fault Tolerance](https://reader035.vdocuments.net/reader035/viewer/2022062801/56649e7a5503460f94b7aaf0/html5/thumbnails/25.jpg)
The End...
Questions???