eleni srtoulia -migrating to the hadoop ecosystem.ppt

4/24/12

1

Migrating to the Hadoop Ecosystem: An experience report

Eleni Stroulia Professor, NSERC/AITF (w. IBM support) IRC on "Service Systems Management”

Computing Science University of Alberta

http://ssrg.cs.ualberta.ca/

1 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)

Outline

•  Background – Why?

•  PaaS with “the Hadoop Ecosystem”: •  HDFS, Hadoop, and HBase – What?

•  The TAPoR Migration – How?

•  Closing Remarks 4/24/12 2 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)

4/24/12

2

WHY?


Big Data… Cheap Hardware…

•  Data is growing at an unprecedented rate – More people use the web and publish data

•  The Internet Usage around the world: in 2000: 360m; in 2011: 2 billion (1/3 of earth population)

•  Facebook, in 2009 uploading 60 TB images every week – Things are on the Internet

•  A jet engine produces 10TB data every 30 Zlight mins

•  Commodity hardware is cheap •  Owning and maintaining hardware is expensive

4/24/12 4 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)

4/24/12

3

Internet World Usage

•  2000: 360m •  2011: 2 billion (1/3 of earth population)

•  Source: http://www.internetworldstats.com/stats.htm


WHAT?


4/24/12

4

Cloud Infrastructure: IaaS •  Providers offer on-demand virtual computation, memory and network resources

•  Users install on the machines operating system images and application software

•  Computing is billed as a utility (pay per use)

7

IaaS Infrastructure as a Service

4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)

Platform Cloud: PaaS •  Providers deliver a solution stack (on top of the infrastructure) –  i.e., operating system, programming language environment, database, web server.

•  Users develop, run and maintain their applications on this platform

•  Some platforms are “elastic”, i.e., adapt the underlying resources based on application demands

8


PaaS Pla1orm as a Service


4/24/12

5

Software Cloud: SaaS •  Providers install and operate application software in the cloud

•  Users use cloud clients to access the software

•  These applications are elastic •  Work is distributed by load balancers

•  Applications can be multitenant (a machine may serve more than one user organization)

•  SaaS pricing is typically (monthly or yearly) Zlat fee per user

9


PaaS Pla1orm as a Service

SaaS So4ware as a Service


Google’s Solution Scalability through Virtualization

•  Key observation: Many computations are data parallel

•  Solution Elements:

1.  MapReduce Hadoop 2.  GFS HDFS 3.  BigTable HBase

10

Google Apache


4/24/12

6

Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)

MapReduce/Hadoop •  Inspired by functional programming: –  Input –  Map() –  Copy/Sort –  Reduce() –  Output

•  The platform takes care of –  RPC –  job scheduling –  data-‐locality –  fault tolerance

11

1.  The program uses the MapReduce library to split the input files into M pieces.

1

2. It starts Master and worker nodes. The master assigns each of the workers any one of M map tasks and R reduce tasks

2

3. A worker assigned a map task reads the contents of the corresponding input split; parses key/value pairs; and passes each pair to the user-‐defined map func7on. The intermediate key/value pairs produced by the map func7on are buffered in memory.

3

4. Periodically, the buffered pairs are wriXen to local disk, par77oned into R regions by the par77oning func7on.

4

5.  The master no7fies a reduce worker about these loca7ons, which uses RPC to read the buffered data from the local worker disks. It sorts the data by the intermediate keys.

5

6.  The reduce worker iterates over the sorted intermediate data; for each unique intermediate key, it passes the key and intermediate values to the user’s reduce func7on. The output of the reduce func7on is appended to a final output file for this reduce par77on.

6

4/24/12

GFS/HDFS

12

•  Distributed Zile system

•  Fault tolerance by replication

•  Sequential reads of large data

•  Random reads of small data (a few KBs)

•  Write once; read multiple times


4/24/12

7

BigTable/HBase

•  A distributed, 3-‐D table data structure –  time as the third dimension (versioning)

•  Rows sorted based on a primary key •  Supports – updates – random reads – real-‐time querying


HBase Tables

•  Sorted by RowKey •  Table has one or more “column families”. •  A column family is

–  A group of column qualiZiers (deZined at run time) –  Stored as one Zile in HDFS

•  Sparse tables are supported •  Timestamp: 3rd dimension •  A cell is identiZied by Table:Rowkey:CF:CQ:timestamp


4/24/12

8

HOW?


TAPoR


4/24/12

9

Three Migration Stories •  Migrating to IaaS 1.  No architectural changes; deploy the software

(with a load balancer) to multiple machines (on Amazon EC2)

19

Improves latency BUT does not address the scalability problem

•  Migrating to PaaS Using Hadoop, create indices 2.  store on HDFS 3.  store to HBase

✗

✓


Migrating to PaaS


4/24/12

10

Indices on HDFS

•  An index has, for each word, a count of its occurrences in the collection, a list of the Ziles that word appears in, and the byte locations for each of those Ziles.

•  We need to keep key-‐value pairs sorted by source Zile •  Map: each word is emitted as a key and its byte location and the

corresponding document ID as values. •  Reduce: the indices for each word are combined into a collective

index; sorted alphabetically.

•  A separate index is sorted by word frequency (to support the top-‐k words operation)

21

foo #6 doc1, doc4, doc12 doc1, 3123, 4223 doc4,


bar #234 doc1, doc4, doc12, .. doc1, 3123, 4223, … doc4,

foo2 #199

Indices on HBase

•  The row key is the document ID •  Two column families, “bl” and “spl” (“byte location” and “special keywords”).

•  The word “foo” occurred twice in Document 1, at byte offsets 3123 and 4223.

•  The top K words are stored in the “spl” column family 22 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)

4/24/12

11

Results


In Conclusion

•  Infrastructure and computation must scale to “Big Data”

•  Migration must become more systematic •  Migration to IaaS is simpler but less effective than migration to PaaS

•  Migration to PaaS usually requires rearchitecting for –  Data preprocessing and Indexing –  Reimplementation of features to rely on pre-‐computed indices

•  The cost-‐effectiveness question is application speciZic


4/24/12

12

Thank You!

25

•  Eleni Stroulia •  Professor, NSERC/AITF (w. IBM support) IRC on "Service Systems Management”

•  Computing Science •  University of Alberta •  http://ssrg.cs.ualberta.ca/

•  Member of the SAVI Strategic Research Network -‐ http://savinetwork.ca/


eleni srtoulia -migrating to the hadoop ecosystem.ppt

Documents

cloud platform cloud

uoa analy7cs

cloud infrastructure

software cloud

tb data

cloud users

cloud mapreducehadoop

cloud clients