hadoop: data processing by minions abcd-gis august 2015 presentation dave strohschein, harvard...
TRANSCRIPT
![Page 1: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/1.jpg)
Hadoop: Data Processing by
Minions
ABCD-GIS August 2015 PresentationDave Strohschein, Harvard Center for Geographic Analysis
![Page 2: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/2.jpg)
Today’s TalkWhy use Hadoop?What is Hadoop?How does Hadoop work?How are we using Hadoop?Issues encounteredA broader view – future directions
![Page 3: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/3.jpg)
Background“…WorldMap will be extended to be capable of gathering interactive map information from hundreds of other servers around the world and making this map layer information searchable together with the WorldMap layer information.”
http://worldmap.harvard.edu/
![Page 4: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/4.jpg)
Orientation / Motivationgathering interactive map information
from hundreds of other servers around the world
KML
Shapefiles
![Page 5: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/5.jpg)
Overall Process
Billions of webpages
Hundreds of terabytes of compressed HTML text data
![Page 6: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/6.jpg)
We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone.
![Page 7: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/7.jpg)
![Page 8: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/8.jpg)
Process the DataHundreds of terabytes of compressed HTML text data Thousands CPU
hours
Months of processing !
![Page 9: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/9.jpg)
Common Crawl Frequency• [ARC] s3://aws-publicdatasets/common-crawl/crawl-001/ - Crawl #1 (2008/2009)• [ARC] s3://aws-publicdatasets/common-crawl/crawl-002/ - Crawl #2 (2009/2010)• [ARC] s3://aws-publicdatasets/common-crawl/parse-output/ - Crawl #3 (2012)• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-20/ -
Summer 2013• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2013-48/ -
Winter 2013• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-10/ -
March 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-15/ -
April 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-23/ -
July 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-35/ -
August 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-41/ -
September 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-42/ -
October 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-49/ -
November 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2014-52/ -
December 2014• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-06/ -
January 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-11/ -
February 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-14/ -
March 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-18/ -
April 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-22/ -
May 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/ -
June 2015• [WARC] s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-32/ -
July 2015
![Page 10: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/10.jpg)
Master
Slaves
![Page 11: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/11.jpg)
Master
Slaves
![Page 12: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/12.jpg)
Process the data
Hours
Thousands CPU hours
![Page 13: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/13.jpg)
Master
Slaves
• Scalability• Fault Tolerance• Resource Sharing
![Page 14: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/14.jpg)
Hadoop 1.0 Framework
Hadoop Distributed File System - HDFS
MapReduce - MR
![Page 15: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/15.jpg)
MapReduce Implementation
Key : Value
orK : V
K1 : V1
KO : VO
![Page 16: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/16.jpg)
MapReduce Flow
(K ,V)
(K ,V)
(K ,[V]) (K ,V)
![Page 17: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/17.jpg)
Hadoop HDFS
![Page 18: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/18.jpg)
![Page 19: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/19.jpg)
Hadoop 1.0 IssuesScalability – Job Tracker does it
Job Tracker – single point of failure
Resource Utilization – Map & Reduce slots
Designed for MapReduce Applications
![Page 20: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/20.jpg)
Hadoop Evolution
Yet Another Resource Negotiator - YARN
![Page 21: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/21.jpg)
Hadoop 2.0 Framework
![Page 22: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/22.jpg)
Hadoop Environments
Cloud
Local Cluster
‘Virtual’
![Page 23: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/23.jpg)
A Commodity Server2009 – 8 cores, 16GB of RAM, 4x1TB disk
2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk.
http://hortonworks.com/blog/apache-hadoop-yarn-background-and-an-overview/
![Page 24: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/24.jpg)
Amazon Web Services
![Page 25: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/25.jpg)
![Page 26: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/26.jpg)
Hadoop on AWS EMR
Elastic Cloud Compute (EC2)Elastic Map Reduce (EMR)
Amazon Web Services (AWS)
![Page 27: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/27.jpg)
Implementing Hadoop at CGA
AWS account –FREE 750 hrs/month t1.micro (Hadoop 1.0) Smallest Amazon EC2 Instance Good for learning basics Can’t execute Hadoop 2 – needed for libraries
t1.micro m1.medium Hadoop 2 Clusters
Develop on local machine Create test specific test WARCs
Process on cluster m1.medium r3.xlarge
![Page 28: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/28.jpg)
CommonCrawl Processing on AWS
• Local algorithm development
• Upload application (jar file) to S3
• Ruby command-line-interface for EC2/EMR initialization
![Page 29: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/29.jpg)
Implementing Hadoop at CGA
WARCTagCounter.java
TagCounterMap.java
• Hadoop ‘configuration’• Input data information• Mapper selection• Reducer selection – simple summer
• Mapper functionality • Extends the Mapper class• Mapper<Text, ArchiveReader, Text, LongWritable>
K1 : V1
K2 : V2
![Page 30: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/30.jpg)
WARC/1.0WARC-Type: responseWARC-Date: 2014-08-02T09:52:13ZWARC-Record-ID: <urn:uuid:ffbfb0c0-6456-42b0-af03-3867be6fc09f>Content-Length: 43428Content-Type: application/http; msgtype=responseWARC-Warcinfo-ID: <urn:uuid:3169ca8e-39a6-42e9-a4e3-9f001f067bdf>WARC-Concurrent-To: <urn:uuid:d99f2a24-158a-4c77-bb0a-3cccd40aad56>WARC-IP-Address: 212.58.244.61WARC-Target-URI: http://news.bbc.co.uk/2/hi/africa/3414345.stmWARC-Payload-Digest: sha1:M63W6MNGFDWXDSLTHF7GWUPCJUH4JK3JWARC-Block-Digest: sha1:YHKQUSBOS4CLYFEKQDVGJ457OAPD6IJOWARC-Truncated: length
HTTP/1.1 200 OKServer: ApacheVary: X-CDNCache-Control: max-age=0Content-Type: text/htmlDate: Sat, 02 Aug 2014 09:52:13 GMTExpires: Sat, 02 Aug 2014 09:52:13 GMTConnection: closeSet-Cookie: BBC-UID=......Set-Cookie: BBC-UID=......
<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/.....><html><head><title>
BBC NEWS | Africa | Namibia braces for Nujoma exit</title>
![Page 31: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/31.jpg)
Signature Detection
<!DOCTYPE html>
<p>… <a href="http://maps.vcgi.org/arcgis/rest/services/ ...
</a> </p>
![Page 32: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/32.jpg)
Signatures “http(s)://…/arcgis/rest/services” “http(s)://…/arcgiscache”
“http(s)://…?request=getcapabilities”
“http(s)://… .kml” or “http(s)://… .kmz” (“shape” || “shp”) && “.zip”
“http(s)://… "${z}/${x}/${y}" || "${z}/${y}/${x}" || "$[z]/$[x]/$[y]" ||
"$[z]/$[y]/$[x]" ||"{z}/{x}/{y}" || "{z}/{y}/{x}" || "[z]/[x]/[y]" || "[z]/[y]/[x]"
“http(s)://… request=getmap”
“http(s)://… .jp2” “http(s)://… .ecw” “http(s)://… .sid” “http(s)://… .tfw”
“http(s)://… .gpx” “http(s)://… .geojson” “http(s)://… .gdb”
“http(s)://…thredds…” “http(s)://…opendap…”
![Page 33: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/33.jpg)
Reducer Output
http://cinematreasures.org/theaters/10911.kml 1
http://cinematreasures.org/theaters/10911/map|||http://cinematreasures.org/theaters/10911.kml -1
Signature match
Signature matchURI (base of URL)
![Page 34: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/34.jpg)
ResultsIt worked!
Pre-built parsers vs. ‘homebrew’ Jsoup parser: inconsistent processing times RegEx parser: much more consistent results
A wide array of geo services vis-à-vis signature choice
![Page 35: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/35.jpg)
Issues Implementing Hadoop
Hadoop learning curve Native Java application Tutorial information exists
Hadoop on AWS: S3, EMR, terminology, billing / cluster size
Optimizing cluster: Instance type, CPU, Memory, etc.
A wide array of geo services vis-à-vis signature choiceWhat’s out there and what’s its signature
![Page 36: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/36.jpg)
![Page 37: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/37.jpg)
Future Directions
SpatialHadoopA MapReduce Framework for Spatial Data GIS Tools for Hadoop
Processing GeoTweets
![Page 38: Hadoop: Data Processing by Minions ABCD-GIS August 2015 Presentation Dave Strohschein, Harvard Center for Geographic Analysis dstrohschein@cga.harvard.edu](https://reader036.vdocuments.net/reader036/viewer/2022062516/56649e6b5503460f94b68de7/html5/thumbnails/38.jpg)
Backup