analytics in the cloud - usenix · 2019-02-25 · analytics in the cloud dow we really need to...

15
© 2002 IBM Corporation IBM Research Storage Systems Research © 2008 IBM Corporation Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta, Prashant Pandey, Himabindu Pucha, Prasenjit Sarkar, Mansi Shah, Renu Tewari

Upload: others

Post on 24-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

© 2002 IBM Corporation

IBM Research

Storage Systems Research © 2008 IBM Corporation

Analytics in the cloud

Dow we really need to reinvent the storage stack?

Image courtesy NASA / ESA

R. Ananthanarayanan, Karan Gupta, Prashant Pandey,

Himabindu Pucha, Prasenjit Sarkar, Mansi Shah, Renu

Tewari

Page 2: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

Data-Intensive Internet Scale Applications

Typical Applications

• Web-scale search, indexing, mining

• Genomic sequencing

• brain-scale network simulations

Page 3: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

Data-Intensive Internet Scale Applications

� Key Requirements

• Scale to very large data sets

• Platform needs to scale to 1000’s of nodes

• Built of commodity hardware for cost efficiency

• Tolerate failures during “every” job execution

• Support data shipping to reduce network requirements

Page 4: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

MapReduce for analytics

� MapReduce is emerging as a model for large-scale analytics application

� Important design goals are extreme-scalability and fault-tolerance

� Storage layer is separated and has well-defined requirements

QuickTime™ and a decompressor

are needed to see this picture.

Image source: http://developer.yahoo.com/hadoop/tutorial/module1.html

Page 5: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

MapReduce Data-store requirements

� Provide a hierarchical namespace with directories and files

� Allow applications to read/write data to files

� Protect data availability and reliability in the face of node and

disk failures

� Provide high bandwidth access to reasonably-sized chunks of

data to all compute nodes (not necessarily all-to-all)

� Provide chunk access-affinity information to allow proper

scheduling of tasks

Page 6: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

Data store options: Cluster FS Vs Specialized FS

YesYesCommodity hardware compliant

Yes

Specialized FS

YesScaling

Cluster FS

Page 7: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

YesNo Mature management tools

YesYesCommodity hardware compliant

YesNoTraditional application support

Yes

Specialized FS

YesScaling

Cluster FS

Data store options: Cluster FS Vs Specialized FS

Page 8: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

YesNo Mature management tools

YesYesCommodity hardware compliant

YesNoTraditional application support

Yes

Yes

Specialized FS

NoTuned for Hadoop

YesScaling

Cluster FS

Data store options: Cluster FS Vs Specialized FS

Page 9: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

Modifying a Cluster Filesystem for MapReduce

� GPFS

• Mature filesystem - many large production installations

• High performance, Highly scalable

• Reliability features focused on SAN environments

� Supports rack-aware 2-way replication

• POSIX interface

• Supports shared disk (SAN) and shared-nothing setups

• Not optimized for MapReduce workloads

� Does not expose data location information

� largest block size = 16 MB

� Changes for Hadoop:

• Make blocks bigger

• Let the platform know where the big blocks are

• Optimize replication and placement to reduce network usage

Page 10: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

Key change: Metablocks

� Works for many workloads

• Small FS blocks (eg: 512K)

• Large Application blks (eg: 64M)

� New allocation scheme

• Metablock size granularity for

wide striping

� Block map operates on large

Metablock size

� All FS operations operate on small

regular block size

� Additional changes to provide

block location information and

“write affinity”

Application “meta-block”FS block

New allocation policy

Page 11: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

MapReduce performance

Test bed

iDataPlex: 42 nodes

8 cores, 8GB RAM

4+1 disks per-node

16 nodes

160 GB data

(replication factor = 2)

Hadoop : version 0.18.1

GPFS: version pre3.3

0.00

500.00

1000.00

1500.00

2000.00

2500.00

Grep TeraGen Terasort

Time (sec)

GPFS rack=1

HDFS rack=1

GPFS rack=2

HDFS rack=2

Page 12: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

Impact on traditional workloads

0

0.5

1

1.5

2

2.5

512K 512K, metablock 16M

Normalized perf

Normalized Random perf Normalized Seq read perf

iDataPlex: 42 nodes

8 cores, 8GB RAM

4+1 disks per-node

GPFS: version pre3.3

Bonnie filesystem benchmark

Page 13: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

Things that didn’t work

� Large filesystem block-size

� Turn-off Prefetching

� Create alignment of records

to block boundaries 0

100

200

300

400

500

600

700

800

HDFS

GPFS

16M

GPFS

16ML

GPFS

16MLA

GPFS

16MLA

NP

File System

0

0.5

1

1.5

2

2.5

512K 16M 16M, no-prefetch

Time (sec)

Normalized Random perf Normalized Seq read perf

Page 14: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

Advantages of traditional filesystems

� Traditional filesystems have solved many hard problems like

access control, quotas, snapshots …

� Allow traditional and MapReduce applications to share the

same input data.

� Exploit Filesystem tools & scripts based on “regular”

filesystems.

� Re-use of Backup/Archive solutions built around particular

filesystems.

� Mixed analytics pipelines.

Load AnalyzeCrawl Output Serve

Crawl Analyze Serve

Using a MapReduce-specific filesystem (e.g. HDFS):

Using a general-purpose filesystem (e.g. GPFS):

Crawler writes to a traditional filesystem Into mapreduce filesystem Back to traditional fielsystem

Page 15: Analytics in the cloud - USENIX · 2019-02-25 · Analytics in the cloud Dow we really need to reinvent the storage stack? Image courtesy NASA / ESA R. Ananthanarayanan, Karan Gupta,

IBM Research

Storage Systems Research © 2008 IBM Corporation

Conclusion

� MapReduce platforms can use traditional filesystems without

loss of performance.

� There are important reasons why traditional filesystems are

attractive to users of MapReduce.