hpc growing pains

HPC Growing PainsIT Lessons Learned from the Biomedical Data Deluge

John L. WoffordCenter for Computational Biology & Bioinformatics

Columbia University

Tuesday, March 27, 12

What is ?

• Internationally recognized biomedical computing center.

• Broad range of computational biomedical and biology research, from biophysics to genomics.

• More than 15 Labs and nearly 200 faculty, staff, students and postdocs across multiple campuses and 8 departments.

• IT staff of 8, covering everything from desktop to HPC & datacenter.

Affiliates


HPC Growing Pains

Before (2008) After (2012)

Total CPU-cores ~500 ~4500

Largest cluster: CPU 400 core ~4000 core

Largest cluster: Memory 800 GB 8 TB

Annual CPU-hr 2 M CPU-hrs >50 M CPU-hrs

Average Daily Active Users 20 120

Storage capacity 30 TB ~ 1PB

Data Center space 800 sq.ft. 4000 sq.ft.

Over the past 3 years we have grown our HPC resources by an order of magnitude, driven largely by genomic data storage and

processing demands.


Outline

I. Intro: Biomedical data growth

II. Storage challenges

II.1. Performance

II.2. Capacity

II.3. Data integrity

III. Conclusions

IT Lessons Learned from the Biomedical Data Deluge


• A few stock facts:

• With a collective 269 petabytes of data, education was among the U.S. economy’s top 10 sectors storing the largest amount of data in 2009, according to a McKinsey Global Institute survey.

• The world will generate 1.8 zettabytes of data this year alone, according to IDC’s 2011 Digital Universe survey.

• Worldwide data volume is growing a minimum of 59% annually, a Gartner report estimates, outrunning Kryder’s law for disk capacity per cost growth.

• Biomedical data–primarily driven by gene sequencing–is growing dramatically faster than industry average and Kryder’s law.

• ...not only does that data need to be stored, it needs to be heavily analyzed, demanding both performance and capacity.

What data deluge?


http://www.mckinsey.com/mgi/publications/big_data/index.asp

http://www.mckinsey.com/mgi/publications/big_data/index.asp

http://en.wikipedia.org/wiki/

http://en.wikipedia.org/wiki/

http://www.emc.com/leadership/programs/digital-universe.htm

http://www.emc.com/leadership/programs/digital-universe.htm

Sequence data production rates


From Kahn S.D. On the future of genomic data. Science 2011;331:728-729.


0

750

1500

2250

3000

0 15 30 45 60

C2B2 data growth, June ’08 - March ’12Da

ta U

sage

(TB)

Months since June ’08

Raw capacity

Raw

requ

irem

ent

(50%

ove

rhea

d)

Usage

trend

Industry trend (59% annual

growth)Usage (logical)


The storage challenge:

1. Perform well enough to analyze the data (i.e. stand up to a top500 supercomputer);

2. Scale from from TeraBytes to PetaBytes (without having to constantly rebuild);

3. Protect important data (from crashes, users and floods);

Design a storage system that can:


Performance

• We have 4000 CPUs working around the clock on analyzing data; we want to keep them all fed with data all of the time.

• Our workload is “random” and not well behaved. It’s notoriously difficult to design for this kind of workload. Ideally, we want a solution that can be flexible as our workloads change.

• The more disks we have spinning in parallel, the better the performance. We’re going to need a lot of disks. Using a rough heuristic, 1 disk per compute node would mean ~500 disks.

• But, to make that many disks useful, we’re going to need a lot of processing and network capabilities.

For parallel processes you need parallel file access.


Traditional NAS ArchitectureSingle NAS head with multiple disk arrays

• Pro

• Support: Time-tested architecture with many major, competing vendors.

• Capacity scaling: relatively easy on modern NAS.

• Con

• Performance scaling: is difficult & unpredictable.

• Management: Storage pools must be managed and tuned.

• Reliability: NAS head provides single failure point.

Traditional NASSingle controller, many arrays

SAN or Direct Attached Interconnect

(FC, SATA,…)

NAS HeadProcesses network file storage requests (NFS, CIFS, etc). Manage SAN storage pools.

Disk arraysJBODs or RAID

...

SSD SATA SATA

SAN Storage node

SSD SATA SATA

SAN Storage node

SSD SATA SATA

SAN Storage node

Disk Disk Disk

Storage client* Cluster node, Server, Desktop, ...



CPU Cache Network




Netapp, Bluearc, EMC...


Clustered NAS architectureSingle filesystem distributed across many nodes.

• Pro

• Capacity scaling: new nodes automatically integrate.

• Performance scaling: new nodes add CPU, Cache and network performance.

• Reliability: most architectures can survive multiple node failures.

• Con

• Support: Relatively new technology. Few vendors (but rapidly growing).

Clustered NASA single filesystem is presented by multiple

nodes which each process filesystem requests.Storage client* Cluster node, Server, Desktop, ...



Clustered NAS node

Disk pool

Disk Disk Disk

CPU Cache Network

Clustered NAS node

Disk pool

Disk Disk Disk

CPU Cache Network

Clustered NAS node

Disk pool

Disk Disk Disk

CPU Cache Network




Hig

h sp

eed

back

end

netw

ork

(tran

sfer

s da

ta b

etw

een

node

s).

Isilon, Panasas, Gluster,...


• In late 2009 we had scaled to where we thought we should be, but our system was unresponsive, with constant, very high load.

• More puzzling, was that the load didn’t seem to have anything correlation with the network, disk throughput, or even load on the compute cluster.

Clustered NAS doesn’t solve everything...


What we found(With the help of good analytics)


Namespace readsNamespace operations consume CPU,

and waste I/O capabilities.

• It’s common in biomedical data to have thousands, or millions of small files on a project.

• We have ~500M files, with an average filesize of less than 8 kb.

• Many genome “databases” are directory structures of flat files that get “indexed” by filename (NCBI, for instance, hosts > 20k files in their databases).

• Our system was thrashing, and we weren’t getting a lot of I/O... the 40% namespace reads were killing our performance.

Other13%

Write15%

Read32%

Namespace Read40%

Distribution of Protocol Operations


SSD Accelerated NamespaceNamespace data live on SSDs (very low seek time),

Ordinary data lives on ordinary disks.

SSD Enabled Node

Data

Disk

Metadata

DiskSSD

SSD Enabled Node

Data

Disk

Metadata

DiskSSD

SSD Enabled Node

Data

Disk

Metadata

DiskSSD

SSD Accelerated NamespaceSold Sate Disks provide fast

seek time for namespace reads.

• Namespace reads are seek intensive (not throughput intensive).

• SSDs generally have > 40x seek time of spinning disks.

• We were able to spread our filesystem metadata over SSDs on our nodes, dramatically increasing namespace performance and decreasing system load.

• We experienced an immediate, ~8x increase in filesystem responsiveness, and overall system performance increase.

File

syst

em m

etad

ata


Capacity

1. While the High-performance Clustered NAS naturally scales capacity as well as performance, it’s an expensive way to build capacity.

2. We don’t want to have the “big” filesystem and the “fast” filesystem. We want everyone to see the same files from everywhere.

3. High-performance systems benefit from uniform hardware. Capacity scaling benefits from being able to use the latest, biggest, densest disks.

4. In fact, typically a clustered NAS is made of entirely uniform hardware, so how do you update without time consuming data migrations?

How we scaled a single filesystem from 168 TB to 1 PB


Multi-tiered Clustered NASSingle filesystem distributed across many pools of nodes.

• Multiple “pools” of nodes share a single filesystem namespace.

• Different pools can have different performance/capacity, allowing for independent scaling of capacity and performance.

• New pools can be added, and old removed, allowing seamless upgrades.

• All pools are active, so nodes configured for large capacity can still serve data to low-demand devices.

Multi-tiered Clustered NASmultiple pools, single namespace

Rule baseddata migration

High-speed storage poolcompute clusters, sequencers, etc.

...

SSD SATA SATASSD-accelerated storage node



SSD SATA SATA

High-capacity storage poolinfrastructure servers, desktops, etc.

...

SSD SATA SATA

Nearline storage node

SSD SATA SATA


SSD SATA SATA


SATASATA

SATASATA

SATASATA

Sharedbackend network

High-Performance Compute Clusters

Blade chassis

Virtual machineVirtual machine

Compute nodeCompute node

Data generating research equipment

(e.g. gene sequencers)

Server Infrastructure


Server

Desktops &Workstations


Evolution of a filesystem

0

250

500

750

1000

0 12.5 25.0 37.5 50.0

Raw capacity

Requ

iremen

t

2008inception

Cap: 168 TB

2009new nodes

Cap: 276 TB

2010 Q31. Merged pools

2. New nodes (SSD)Cap: 672 TB

2010 Q1new nodes

(separate cluster)496 TB

2010 Q4Swap original nodes

for new nodes Cap: 648 TB

2011 Capacity upgrade

(denser nodes)Cap: 984 TB


Caveat: A single namespace has challenges

• Since the namespace doesn’t refresh, it tends to grow and grow. We currently have > 500 Million files in our filesystem.

• While it’s nice to have all of your files in one space, it takes a lot of effort to keep it organized.

• In initial deployment, spent roughly 30x longer planning filesystem structure than deploying hardware.

• If you plan your filesystem poorly, it could take a long time to relocate or remove all of those files.


Data Integrity

1. Users: “Oops! I didn’t mean to delete that!”

2. Glitches: “Error$mounting:$mount:$wrong$fs$type,$bad$option,$bad$superblock$on$/dev/... ”

3. Floods: “Who installed the water main over the data center?!”

How do you protect large-scale, important datafrom users, glitches & floods?


Tape vs. Disk-to-DiskTape is dead. Right?

Tape Library

TapeTapeTape

Backup serverBackup server

NAS File serverFile server

Tape backup

Primary NAS Secondary NAS

Disk-to-disk backup

Tape D-t-D

Fast

Easy to maintain

Reliable

Cheap

Low-power

Long shelf life


• Using only tape is impractical. LTO5 can write 1TB in ~ 6hr, or 200 TB in 50 days. You can split across drives, but this becomes a management nightmare.

• Using only disk is cost prohibitive. A complete disk system vs. a complete tape system. Plus:• you need to keep disk powered.

• there’s no easy (or safe) way to archive disks.

• it has to grow faster than primary storage (if you want historical archives).

Neither option is ideal on its own.


Our middle groundSnapshots + replication + tape =

protection from:

• Users: frequent snapshots on the source provide easy “oops” recovery to the user.

• Glitches: replication provides short-term rapid recovery. Added snapshots extend replication archives to the mid-term (~6 mo.).

• Floods: Tape backup provides cheap, reliable archival of data, for large-scale disaster recovery (or important files from ’06). Leaving backup windows flexible keeps tape manageable.

Tape LibraryOffsite DRArchival

TapeTapeTape

Replication clusterLive replication of critical data, with historical snapshots

Replication storage node

SATA SATA SATASATA SATA SATA





Primary storage clusterEasy user recovery with short-term snapshots

...

SSD SATA SATAStorage node



Disk Disk Disk

Frequent replication (daily)

Lazy archival(~ 6 mo.)

Tape storageLong-term archival

TapeTapeTape

1 copy per year

Data backup path


Backup Infrastructure

The big picture:

Tape LibraryOffsite DRArchival

TapeTapeTape

Replication clusterLive replication of critical data







Primary Storage Clustermultiple pools, single filesystem

Rule baseddata migration

High-speed storage poolcompute clusters, sequencers, etc.

...




SSD SATA SATA

Nearline storage poolinfrastructure servers, desktops, etc.

...

SSD SATA SATANearline storage node



SATA SATA SATA

Server Infrastructure

Virtualization Infrastructure


Virtual machine

Virtual machineVirtual machinePhysical servers

Data generating research equipment

(e.g. gene sequencers)

Desktops &Workstations

High-Performance Compute Clusters

Blade chassis



Blade chassis



10 Gbps Aggregation netw

ork

Multi-tiered scale-out storage architecture from HPC Infrastructure to the Desktop

Tape storageLong-term archival

TapeTapeTape


Conclusion

• With Clustered NAS and SSD acceleration, we’re regularly seeing filesystem throughput in excess of 10 Gbps and IOPs well over 500k without an issue.

• So far we’ve managed to stay ahead of our data-growth curve with multi-tiered storage. We plan to at least double capacity in the next 6-12 months with no major architectural changes.

• With combination of snapshots, disk-to-disk replication and tape, we’re getting daily backups of all important data as well as long-term archivals.

• Thank you! Questions?

Putting it all together.


hpc growing pains

Documents