scientific computing on amazon web services dave cuthbert solutions architect [email protected]

17
Scientific Computing on Amazon Web Services Dave Cuthbert Solutions Architect [email protected]

Upload: barrie-hutchinson

Post on 31-Dec-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

PowerPoint Presentation

Scientific Computing onAmazon Web ServicesDave CuthbertSolutions [email protected] Facets (That Ill Mention Today)Facet 1: Availability of scientific applicationsGeneral purpose analysisPython (SciPy, NumPy, iPython notebooks).Octave, R, C, C++, Fortran, Databases/data formatsNetCDF, HDF, Cassandra, MongoDB, CouchDB, Redis, Berkeley DB, MySQL/MariaDB, PostgreSQL, Commercial Applications are widely available.Licensing can be thorny.Two Facets (That Ill Mention Today)Facet 2: CyclesWhat everyone thinks: HPC.Mental trap 1: Its not real science if its not running on an HPC cluster.Mental trap 2: If your lab has an HPC cluster, you should be coding for it.

So everyone demands cluster time, andA Typical HPC Cluster Workload

A Typical HPC Cluster Workload

But What Is HPC, Anyway?If I wanted to start a flame war:What is real HPC?

[CLICK] Hadoop and Spark[CLICK] InfiniBand and low latency interconnects[CLICK] GPU[CLICK] Vector processors, SIMD[CLICK] TPC6

HPC Is Not A Panacea!HadoopGPULow LatencyHadoopLow LatencyGPU7Its A Trap!Facet 2: CyclesWhat everyone thinks: HPC.Mental trap 1: Its not real science if its not running on an HPC cluster.Mental trap 2: If your lab has an HPC cluster, you should be coding for it.The right systems for the job.How AWS Is Attacking The Problem

AmazonLinux with SLURM AMI

Availability Zone us-west-2aRegion: us-west-2 (Oregon)

controllerVPC Space: 192.168.0.0/16Subnet 192.168.0.0/24Availability Zone us-west-2bSubnet 192.168.1.0/24Availability Zone us-west-2cSubnet 192.168.2.0/24

node-0

node-1

node-2

node-3

node-4

node-5

node-6

node-7

node-8

node-9

node-10

node-11

VBL S3 Bucket

ScriptsCode

Input Decks

Output Files

CloudFormation Template

Internet gateway

Work Request QueueWork Response QueueSQS Queues

CloudFormation(Bootstrap controller)Standard high-availability deployment.10

AmazonLinux with SLURM AMI

Availability Zone us-west-2aRegion: us-west-2 (Oregon)

controllerVPC Space: 192.168.0.0/16Subnet 192.168.0.0/24Availability Zone us-west-2bSubnet 192.168.1.0/24Availability Zone us-west-2cSubnet 192.168.2.0/24

node-0

node-1

node-2

node-3

node-4

node-5

node-6

node-7

node-8

node-9

node-10

node-11

VBL S3 Bucket

ScriptsCode

Input Decks

Output Files

CloudFormation Template

Internet gateway

Work Request QueueWork Response QueueSQS Queues

CloudFormation(Bootstrap controller)

Lets say we recode for GPUs. 1x or 4x Grid K520 (1536 cores, 6144 cores); 15 GB or 60 GB main memory.[CLICK] add GPU.[CLICK] remove C3s.11

AmazonLinux with SLURM AMI

Availability Zone us-west-2aRegion: us-west-2 (Oregon)

controllerVPC Space: 192.168.0.0/16Subnet 192.168.0.0/24Availability Zone us-west-2bSubnet 192.168.1.0/24Availability Zone us-west-2cSubnet 192.168.2.0/24

node-0

node-1

node-2

node-3

node-4

node-5

node-6

node-7

node-8

node-9

node-10

node-11

VBL S3 Bucket

ScriptsCode

Input Decks

Output Files

CloudFormation Template

Internet gateway

Work Request QueueWork Response QueueSQS Queues

CloudFormation(Bootstrap controller)

Map reduce. 12

AmazonLinux with SLURM AMI

Availability Zone us-west-2aRegion: us-west-2 (Oregon)

controllerVPC Space: 192.168.0.0/16Subnet 192.168.0.0/24Availability Zone us-west-2bSubnet 192.168.1.0/24Availability Zone us-west-2cSubnet 192.168.2.0/24

node-0

node-1

node-2

node-3

node-4

node-5

node-6

node-7

node-8

node-9

node-10

node-11

VBL S3 Bucket

ScriptsCode

Input Decks

Output Files

CloudFormation Template

Internet gateway

Work Request QueueWork Response QueueSQS Queues

CloudFormation(Bootstrap controller)min229 sp50239 sp90258 sp99280 smax472 s min329 sp50340 sp90354 sp99377 smax611 s min1048 sp501052 sp901094 sp991182 smax2125 s Lets say we find we need the inter-node communication.[CLICK] Run a test, intra-AZ. [CLICK] Ok, around 250 micro RT.Ok, what about inter-AZ? [CLICK] Well go from 2a to 2b. [CLICK] Hrm. Getting worse, but still in the ballpark. Jitters gone up, though.[CLICK] Is it the same between AZs? [CLICK] Ugh. 4x what we saw earlier. Jitter stays controlled until we lose a packet.

13

AmazonLinux with SLURM AMI

Availability Zone us-west-2aRegion: us-west-2 (Oregon)

controllerVPC Space: 192.168.0.0/16Subnet 192.168.0.0/24Availability Zone us-west-2bSubnet 192.168.1.0/24Availability Zone us-west-2cSubnet 192.168.2.0/24

node-0

node-1

node-2

node-3

node-4

node-5

node-6

node-7

node-8

node-9

node-10

node-11

VBL S3 Bucket

ScriptsCode

Input Decks

Output Files

CloudFormation Template

Internet gateway

Work Request QueueWork Response QueueSQS Queues

CloudFormation(Bootstrap controller)Fortunately there is a tool we can use.14

AmazonLinux with SLURM AMI

Availability Zone us-west-2aRegion: us-west-2 (Oregon)

controllerVPC Space: 192.168.0.0/16Subnet 192.168.0.0/24

node-0

node-1

node-2

node-3

node-4

node-5

node-6

node-7

node-8

node-9

node-10

node-11

VBL S3 Bucket

ScriptsCode

Input Decks

Output Files

CloudFormation Template

Internet gateway

Work Request QueueWork Response QueueSQS Queues

CloudFormation(Bootstrap controller)Placement Group Amin85 sp5096 sp90106 sp99189 smax233 s min87 sp5099 sp90174 sp99189 smax246 s Is AWS The Silver Bullet? No silver bullets Fred BrooksCommonly heard latency number: 10 sProximity to other resources might be an issue.

People-hours are more expensive than core-hours.Enable facilities like NERSC to focus on harder problems not served (or currently served) by COTS.THANK YOU!