eleni srtoulia -migrating to the hadoop ecosystem.ppt

12
4/24/12 1 Migrating to the Hadoop Ecosystem: An experience report Eleni Stroulia Professor, NSERC/AITF (w. IBM support) IRC on "Service Systems Management” Computing Science University of Alberta http://ssrg.cs.ualberta.ca/ 1 4/24/12 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud) Outline Background Why? PaaS with “the Hadoop Ecosystem”: HDFS, Hadoop, and HBase What? The TAPoR Migration How? Closing Remarks 4/24/12 2 Eleni Stroulia, CS, UoA (Analy7cs, Big Data, and the Cloud)

Upload: bon-mark-uy

Post on 06-Mar-2016

225 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

1  

Migrating to the Hadoop Ecosystem: An experience report

Eleni  Stroulia  Professor,  NSERC/AITF  (w.  IBM  support)  IRC  on  "Service  Systems  Management”    

Computing  Science  University  of  Alberta  

http://ssrg.cs.ualberta.ca/    

1  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Outline

•  Background    – Why?  

•  PaaS  with  “the  Hadoop  Ecosystem”:    •    HDFS,  Hadoop,  and  HBase  – What?  

•  The  TAPoR  Migration  – How?  

•  Closing  Remarks  4/24/12   2  Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Page 2: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

2  

WHY?

3  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Big Data… Cheap Hardware…

•  Data  is  growing  at  an  unprecedented  rate  – More  people  use  the  web  and  publish  data  

•  The  Internet  Usage  around  the  world:  in  2000:  360m;  in  2011:  2  billion  (1/3  of  earth  population)  

•  Facebook,  in  2009  uploading  60  TB  images  every  week  – Things  are  on  the  Internet  

•  A  jet  engine  produces  10TB  data  every  30  Zlight  mins  

•  Commodity  hardware  is  cheap  •  Owning  and  maintaining  hardware  is  expensive  

4/24/12   4  Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Page 3: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

3  

Internet World Usage

•  2000:  360m    •  2011:  2  billion  (1/3  of  earth  population)  

•  Source:  http://www.internetworldstats.com/stats.htm    

5  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

WHAT?

6  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Page 4: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

4  

Cloud Infrastructure: IaaS •  Providers  offer  on-­demand  virtual  computation,  memory  and  network  resources    

•  Users  install  on  the  machines  operating  system  images  and  application  software  

•  Computing  is  billed  as  a  utility  (pay  per  use)  

7  

IaaS  Infrastructure  as  a  Service  

4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Platform Cloud: PaaS •  Providers  deliver  a  solution  stack  (on  top  of  the  infrastructure)  –   i.e.,  operating  system,  programming  language  environment,  database,  web  server.    

•  Users  develop,  run  and  maintain  their  applications  on  this  platform  

•  Some  platforms  are  “elastic”,  i.e.,  adapt  the  underlying  resources  based  on  application  demands  

8  

IaaS  Infrastructure  as  a  Service  

PaaS  Pla1orm  as  a  Service  

4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Page 5: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

5  

Software Cloud: SaaS •  Providers  install  and  operate  application  software  in  the  cloud    

•  Users  use  cloud  clients  to  access  the  software  

•  These  applications  are  elastic  •  Work  is  distributed  by  load  balancers    

•  Applications  can  be  multitenant  (a  machine  may  serve  more  than  one  user  organization)  

•  SaaS  pricing  is  typically  (monthly  or  yearly)  Zlat  fee  per  user  

9  

IaaS  Infrastructure  as  a  Service  

PaaS  Pla1orm  as  a  Service  

SaaS  So4ware  as  a  Service  

4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Google’s Solution Scalability through Virtualization

•  Key  observation:    Many  computations  are  data  parallel  

•  Solution  Elements:  

1.  MapReduce                                    Hadoop  2.  GFS                                                                          HDFS  3.  BigTable                                                    HBase  

10  

Google   Apache  

4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Page 6: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

6  

Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

MapReduce/Hadoop •  Inspired  by  functional  programming:  –  Input      –  Map()      –  Copy/Sort      –  Reduce()      –  Output  

•  The  platform  takes  care  of    –  RPC  –  job  scheduling    –  data-­‐locality  –  fault  tolerance  

11  

1.  The  program  uses  the  MapReduce  library  to  split  the  input  files  into  M  pieces.    

1  

2.    It  starts  Master  and  worker  nodes.  The  master  assigns  each  of  the  workers  any  one  of  M  map  tasks  and  R  reduce  tasks  

2  

3.    A  worker  assigned  a  map  task  reads  the  contents  of  the  corresponding  input  split;  parses  key/value  pairs;  and  passes  each  pair  to  the  user-­‐defined  map  func7on.  The  intermediate  key/value  pairs  produced  by  the  map  func7on  are  buffered  in  memory.  

3  

4.    Periodically,  the  buffered  pairs  are  wriXen  to  local  disk,  par77oned  into  R  regions  by  the  par77oning  func7on.    

4  

5.  The  master  no7fies  a  reduce  worker  about  these  loca7ons,  which  uses  RPC  to  read  the  buffered  data  from  the  local  worker  disks.  It  sorts  the  data  by  the  intermediate  keys.    

5  

6.  The  reduce  worker  iterates  over  the  sorted  intermediate  data;  for  each  unique  intermediate  key,  it  passes  the  key  and  intermediate  values  to  the  user’s  reduce  func7on.  The  output  of  the  reduce  func7on  is  appended  to  a  final  output  file  for  this  reduce  par77on.  

6  

4/24/12  

GFS/HDFS

12  

•  Distributed  Zile  system    

•  Fault  tolerance  by  replication  

•  Sequential  reads  of  large  data  

•  Random  reads  of  small  data  (a  few  KBs)  

•  Write  once;  read  multiple  times  

4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Page 7: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

7  

BigTable/HBase

•  A  distributed,  3-­‐D  table  data  structure  –  time  as  the  third  dimension  (versioning)  

•  Rows  sorted  based  on  a  primary  key  •  Supports    – updates  – random  reads  – real-­‐time  querying  

13  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

HBase Tables

•  Sorted  by  RowKey  •  Table  has  one  or  more  “column  families”.  •  A  column  family  is    

–  A  group  of  column  qualiZiers  (deZined  at  run  time)  –  Stored  as  one  Zile  in  HDFS  

•  Sparse  tables  are  supported  •  Timestamp:  3rd  dimension  •  A  cell  is  identiZied  by  Table:Rowkey:CF:CQ:timestamp  

15  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Page 8: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

8  

HOW?

17  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

TAPoR

18  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Page 9: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

9  

Three Migration Stories •  Migrating  to  IaaS  1.  No  architectural  changes;  deploy  the  software  

(with  a  load  balancer)  to  multiple  machines  (on  Amazon  EC2)    

19  

 Improves  latency  BUT  does  not  address  the  scalability  problem  

•  Migrating  to  PaaS  Using  Hadoop,  create  indices    2.  store  on  HDFS  3.  store  to  HBase    

✓  

4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Migrating to PaaS

20  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Page 10: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

10  

Indices on HDFS

•  An  index  has,  for  each  word,  a  count  of  its  occurrences  in  the  collection,  a  list  of  the  Ziles  that  word  appears  in,  and  the  byte  locations  for  each  of  those  Ziles.    

•  We  need  to  keep  key-­‐value  pairs  sorted  by  source  Zile    •  Map:  each  word  is  emitted  as  a  key  and  its  byte  location  and  the  

corresponding  document  ID  as  values.    •  Reduce:  the  indices  for  each  word  are  combined  into  a  collective  

index;  sorted  alphabetically.    

•  A  separate  index  is  sorted  by  word  frequency  (to  support  the  top-­‐k  words  operation)  

21  

foo   #6   doc1,  doc4,  doc12   doc1,  3123,  4223   doc4,      

4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

bar   #234   doc1,  doc4,  doc12,  ..   doc1,  3123,  4223,  …   doc4,      

foo2   #199  

Indices on HBase

•  The  row  key  is  the  document  ID  •  Two  column  families,  “bl”  and  “spl”  (“byte  location”  and  “special  keywords”).  

•  The  word  “foo”  occurred  twice  in  Document  1,  at  byte  offsets  3123  and  4223.    

•  The  top  K  words  are  stored  in  the  “spl”  column  family  22  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Page 11: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

11  

Results

23  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

In Conclusion

•  Infrastructure  and  computation  must  scale  to  “Big  Data”    

•  Migration  must  become  more  systematic  •  Migration  to  IaaS  is  simpler  but  less  effective  than  migration  to  PaaS  

•  Migration  to  PaaS  usually  requires  rearchitecting  for    –  Data  preprocessing  and  Indexing  –  Reimplementation  of  features  to  rely  on  pre-­‐computed  indices  

•  The  cost-­‐effectiveness  question  is  application  speciZic  

24  4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)  

Page 12: Eleni Srtoulia -Migrating to the Hadoop Ecosystem.ppt

4/24/12  

12  

Thank You!

25  

•  Eleni  Stroulia  •  Professor,  NSERC/AITF  (w.  IBM  support)  IRC  on  "Service  Systems  Management”    

•  Computing  Science  •  University  of  Alberta  •  http://ssrg.cs.ualberta.ca/    

•  Member  of  the  SAVI  Strategic  Research  Network  -­‐  http://savinetwork.ca/    

4/24/12   Eleni  Stroulia,  CS,  UoA  (Analy7cs,  Big  Data,  and  the  Cloud)