eleni srtoulia -migrating to the hadoop ecosystem.ppt
DESCRIPTION
ÂTRANSCRIPT
4/24/12
1
Migrating to the Hadoop Ecosystem: An experience report
Eleni Stroulia Professor, NSERC/AITF (w. IBM support) IRC on "Service Systems Management”
Computing Science University of Alberta
http://ssrg.cs.ualberta.ca/
1 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
Outline
• Background – Why?
• PaaS with “the Hadoop Ecosystem”: • HDFS, Hadoop, and HBase – What?
• The TAPoR Migration – How?
• Closing Remarks 4/24/12 2 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
4/24/12
2
WHY?
3 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
Big Data… Cheap Hardware…
• Data is growing at an unprecedented rate – More people use the web and publish data
• The Internet Usage around the world: in 2000: 360m; in 2011: 2 billion (1/3 of earth population)
• Facebook, in 2009 uploading 60 TB images every week – Things are on the Internet
• A jet engine produces 10TB data every 30 Zlight mins
• Commodity hardware is cheap • Owning and maintaining hardware is expensive
4/24/12 4 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
4/24/12
3
Internet World Usage
• 2000: 360m • 2011: 2 billion (1/3 of earth population)
• Source: http://www.internetworldstats.com/stats.htm
5 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
WHAT?
6 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
4/24/12
4
Cloud Infrastructure: IaaS • Providers offer on-demand virtual computation, memory and network resources
• Users install on the machines operating system images and application software
• Computing is billed as a utility (pay per use)
7
IaaS Infrastructure as a Service
4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
Platform Cloud: PaaS • Providers deliver a solution stack (on top of the infrastructure) – i.e., operating system, programming language environment, database, web server.
• Users develop, run and maintain their applications on this platform
• Some platforms are “elastic”, i.e., adapt the underlying resources based on application demands
8
IaaS Infrastructure as a Service
PaaS Pla1orm as a Service
4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
4/24/12
5
Software Cloud: SaaS • Providers install and operate application software in the cloud
• Users use cloud clients to access the software
• These applications are elastic • Work is distributed by load balancers
• Applications can be multitenant (a machine may serve more than one user organization)
• SaaS pricing is typically (monthly or yearly) Zlat fee per user
9
IaaS Infrastructure as a Service
PaaS Pla1orm as a Service
SaaS So4ware as a Service
4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
Google’s Solution Scalability through Virtualization
• Key observation: Many computations are data parallel
• Solution Elements:
1. MapReduce Hadoop 2. GFS HDFS 3. BigTable HBase
10
Google Apache
4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
4/24/12
6
Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
MapReduce/Hadoop • Inspired by functional programming: – Input – Map() – Copy/Sort – Reduce() – Output
• The platform takes care of – RPC – job scheduling – data-‐locality – fault tolerance
11
1. The program uses the MapReduce library to split the input files into M pieces.
1
2. It starts Master and worker nodes. The master assigns each of the workers any one of M map tasks and R reduce tasks
2
3. A worker assigned a map task reads the contents of the corresponding input split; parses key/value pairs; and passes each pair to the user-‐defined map func7on. The intermediate key/value pairs produced by the map func7on are buffered in memory.
3
4. Periodically, the buffered pairs are wriXen to local disk, par77oned into R regions by the par77oning func7on.
4
5. The master no7fies a reduce worker about these loca7ons, which uses RPC to read the buffered data from the local worker disks. It sorts the data by the intermediate keys.
5
6. The reduce worker iterates over the sorted intermediate data; for each unique intermediate key, it passes the key and intermediate values to the user’s reduce func7on. The output of the reduce func7on is appended to a final output file for this reduce par77on.
6
4/24/12
GFS/HDFS
12
• Distributed Zile system
• Fault tolerance by replication
• Sequential reads of large data
• Random reads of small data (a few KBs)
• Write once; read multiple times
4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
4/24/12
7
BigTable/HBase
• A distributed, 3-‐D table data structure – time as the third dimension (versioning)
• Rows sorted based on a primary key • Supports – updates – random reads – real-‐time querying
13 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
HBase Tables
• Sorted by RowKey • Table has one or more “column families”. • A column family is
– A group of column qualiZiers (deZined at run time) – Stored as one Zile in HDFS
• Sparse tables are supported • Timestamp: 3rd dimension • A cell is identiZied by Table:Rowkey:CF:CQ:timestamp
15 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
4/24/12
8
HOW?
17 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
TAPoR
18 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
4/24/12
9
Three Migration Stories • Migrating to IaaS 1. No architectural changes; deploy the software
(with a load balancer) to multiple machines (on Amazon EC2)
19
Improves latency BUT does not address the scalability problem
• Migrating to PaaS Using Hadoop, create indices 2. store on HDFS 3. store to HBase
✗
✓
4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
Migrating to PaaS
20 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
4/24/12
10
Indices on HDFS
• An index has, for each word, a count of its occurrences in the collection, a list of the Ziles that word appears in, and the byte locations for each of those Ziles.
• We need to keep key-‐value pairs sorted by source Zile • Map: each word is emitted as a key and its byte location and the
corresponding document ID as values. • Reduce: the indices for each word are combined into a collective
index; sorted alphabetically.
• A separate index is sorted by word frequency (to support the top-‐k words operation)
21
foo #6 doc1, doc4, doc12 doc1, 3123, 4223 doc4,
4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
bar #234 doc1, doc4, doc12, .. doc1, 3123, 4223, … doc4,
foo2 #199
Indices on HBase
• The row key is the document ID • Two column families, “bl” and “spl” (“byte location” and “special keywords”).
• The word “foo” occurred twice in Document 1, at byte offsets 3123 and 4223.
• The top K words are stored in the “spl” column family 22 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
4/24/12
11
Results
23 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
In Conclusion
• Infrastructure and computation must scale to “Big Data”
• Migration must become more systematic • Migration to IaaS is simpler but less effective than migration to PaaS
• Migration to PaaS usually requires rearchitecting for – Data preprocessing and Indexing – Reimplementation of features to rely on pre-‐computed indices
• The cost-‐effectiveness question is application speciZic
24 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)
4/24/12
12
Thank You!
25
• Eleni Stroulia • Professor, NSERC/AITF (w. IBM support) IRC on "Service Systems Management”
• Computing Science • University of Alberta • http://ssrg.cs.ualberta.ca/
• Member of the SAVI Strategic Research Network -‐ http://savinetwork.ca/
4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)