hpc, grid and cloud computing - the past, present, and future challenge

HPC, Grid and Cloud Computing - The Past, Present and Future

Jason Shih Academia Sinica Grid computing

FBI 極簡主義, Nov 3rd, 2010

Outline

Trend in HPC Grid: eScience Research @ PetaScale Cloud Hype and Observation Future Exploration Path of Computing Summary

21

About ASGC

Large Hadron Collider (LHC)

Avian Flu Drug Discovery Grid Application Platform

A Worldwide Grid Infrastructure

Asia Pacific Regional Operation Center

>280 sites, >45 countries >80,000 CPUs, >20 PetaBytes >14,000 users, >200 VOs >250,000 jobs/day

Best Demo Award of EGEE’07!

Lightweight Problem Solving Framework!

1. Most Reliable T1: 98.83%!2. Very Highly Performing and

most Stable Site in CCRC08!

Max CERN/T1-ASGC Point2Point Inbound : 9.3 Gbps!

100 meters underground 27km of circumstances; locate in Geneva

Emerging Trend and Technologies: 2009 -2010

Hype Cycle for Storage Technologies - 2010

Trend in High Performance Computing

Ugly? Performance of HPC Cluster

272 (52%) of world fastest clusters have efficiency lower than 80% (Rmax/Rpeak)

Only 115 (18%) could drive over 90% of theoretical peak Sampling from Top500 HPC cluster

Trend of Cluster Efficiency 2005-2009

Performance and Efficiency 20% of Top-performed clusters contribute 60% of Total

Computing Power (27.98PF) 5 Clusters Eff. < 30

Impact Factor: Interconnectivity - Capacity and Cluster Efficiency

Over 52% of Cluster base on GbE With efficiency around 50% only

InfiniBand adopt by ~36% HPC Clusters

HPC Cluster - Interconnect Using IB SDR, DDR and QDR in Top500

Promising efficiency >= 80% Majority of IB ready cluster adopt

DDR (87%) (2009 Nov) Contribute 44% of total computing

power ~28 Pflops

Avg efficiency ~78%

Trend in HPC Interconnects: Infiniband Roadmap

Common semantics

Programmer productivity Easy of deployment HPC filesystem are more mature, wider feature set:

High concurrent read and write In the comfort zone of programmers (vs cloudFS)

Wide support, adoption, acceptance possible pNFS working to be equivalent Reuse standard data management tools

Backup, disaster recovery and tiering

Evolution of Processors

Trend in HPC

Some Observations & Looking for Future (I) Computing Paradigm

(Almost) Free FLOPS (Almost) Logic Operation Data Access (Memory) Is A Major Bottleneck Synchronization Is the Most Expensive Data Communication Is A Big Factor in Performance I/O Still A Major Programming Consideration MPI Coding Is the Motherhood of Large Scale Computing Computing in Conjunction of Massive Data Management Finding Parallelism Is Not A Whole Issue In Programming Data Layout Data Movement Data Reuse Frequency of Interconnected Data Communication

Some Observations & Looking for Future (II) Emerging New Possibility

Massive “Small” Computing Elements with On Board Memory Computing Node Can Be Caonfigured Dynamically (including Failure

recovery) Network Switch (within on site complex) Will Nearly Match Memory

Performance Parallel I/O Support for Massive Parallel System Asynchronous Computing/Communication Operation Sophisticate Data Pre-fetch Scheme (Hardware/Algorithm) Automate Dynamic Load Balance Method Very High Order Difference Scheme (also Implicit Method) Full Coupling of Formerly Split Operators Fine Numerical Computational Grid (grid number > 10,000) Full Simulation of Protein Full Coupling of Computational Model Grid Computing for All

Some Observations & Looking for Future (3)

System will get more complicate & Computing Tool will get more sophisticated:

Vendor Support & User Readiness?

Grid: eScience Research @ PetaScale

WLCG Computing Model - The Tier Structure Tier-0 (CERN)

Data recording Initial data reconstruction Data distribution

Tier-1 (11 countries) Permanent storage Re-processing Analysis

Tier-2 (~130 countries) Simulation End-user analysis

4 EGEE07, Budapest, 1-5 October 2007

Enabling Grids for E-sciencE

EGEE-II INFSO-RI-031688 4

Archeology Astronomy Astrophysics Civil Protection Comp. Chemistry Earth Sciences Finance Fusion Geophysics High Energy Physics Life Sciences Multimedia Material Sciences …

Objectives

Building sustainable research and collaboration infrastructure

Support research by e-Science, on data intensive sciences and applications require cross disciplinary distributed collaboration

ASGC Milestone

Operational from the deployment of LCG0 since 2002 ASGC CA establish on 2005 (IGTF in same year) Tier-1 Center responsibility start from 2005 Federated Taiwan Tier-2 center (Taiwan Analysis Facility, TAF)

is also collocated in ASGC Rep. of EGEE e-Science Asia Federation while joining EGEE

from 2004 Providing Asia Pacific Regional Operation Center (APROC)

services to regional-wide WLCG/EGEE production infrastructure from 2005

Initiate Avian Flu Drug Discovery Project and collaborate with EGEE in 2006

Start of EUAsiaGrid Project from April 2008

LHC First Beam – Computing at the Petascale

General Purpose, pp, heavy ions

ATLAS: General Purpose, pp, heavy ions

ALICE: Heavy ions, pp LHCb: B-physics, CP Violation

CMS: General Purpose, pp, heavy ions

Size of LHC Detector

Bld. 40 ATLAS

CMS 7,000 Tons

25 Meters in Height

45 Meters in Length

ATLAS Detector

UNESCO Information Preservation debate, April 2007 -

[email protected]

25 http://www.damtp.cam.ac.uk/user/gr/public/bb_history.html

Standard Cosmology

Good model from 0.01 sec after Big Bang

Supported by considerable observational evidence

Elementary Particle Physics

From the Standard Model into the unknown: towards energies of 1 TeV and beyond: the Terascale

Towards Quantum Gravity

From the unknown into the unknown...

Tim

e

Energy, Density, Tem

perature

WLCG Timeline

First Beam on LHC, Sep. 10, 2008

Severe Incident after 3w operation (3.5TeV)

Petabyte Scale Data Challenges

Why Petabyte? Experiment Computing Model Comparing with conventional data management

Challenges Performance: LAN and WAN activities

Sufficient B/W between CPU Farm Eliminate Uplink Bottleneck (Switch Tires)

Fast responding of Critical Events Fabric Infrastructure & Service Level Agreement

Scalability and Manageability Robust DB engine (Oracle RAC) KB and Adequate Administration (Training)

Tier Model and Data Management Components

Disk Pool Configuration - T1 MSS (CASTOR)

Distribution of Free Capacity - Per Disk Servers vs. per Pool

Storage Server Generation - Drive vs. Net Capacity (Raid6)

TB

TB TB

TB 15TB/DS

21TB/DS 31TB/DS

40TB/DS

IDC Collocation Facility install complete at Mar 27th Tape system delay after Apr 9th

Realignment RMA for faulty parts

Storage Farm ~ 110 raid subsystem deployed since 2003. Supporting both Tier1 and 2 storage fabric DAS connection to front-end blade server

Flexible switching front end server upon performance requirement

4-8G fiber channel connectivity

Computing/Storage System Infrastructure

Throughput of WLCG Experiments Throughput defined as Job Eff. x # Jobs running Characteristic of 4 LHC Exp. depicting in-efficiency is due to poor coding.

Reliability From Different View Perspective

Storage Fabric Management – The Challenges: Events Management

Cloud Hype and Observation

Open Cloud Consortium

Cloud Hype

Metacomputing (~1987, L. Smarr) Grid Computing (~1997, I. Foster, K. Kesselman) Cloud Computing (~2007, E. Schmidt?)

Type of Infrastructure

Proprietary solutions by public providers Turnkey solutions developed internally as they own

the software and hardware solution/tech. Cloud specific support

Developers of specific hardware and/or software solutions that are utilized by service providers or used internally when building private cloud

Traditional providers Leverage or tweak their existing

Grid and Cloud: Comparison Cost & Performance Scale & Usability Service Mapping Interoperability Application Scenarios

Cloud Computing: “X” as a Service Type of Cloud Layered Service Model Reference Model

Virtualization is not Cloud computing

Ref: Linux-based virtualization for HPC clusters.

Performance Overhead FV vs. PV

Disk I/O and network throughput (VM scalability)

Cloud Infrastructure Best practical & Real world performance Start Up: 60 ~ 44s Restart : 30 ~ 27s Deletion: 60 ~ <5s Migrate

30 VM ~ 26.8s 60 VM ~ 40s 120 VM ~ 89s

Stop 30VM ~ 27.4s 60VM ~ 26s 120VM ~ 57s

Cloud Infrastructure Best practical Real World Performance Start Up: 60 ~ 44s Restart : 30 ~ 27s Deletion: 60 ~ <5s Migrate

30 VM ~ 26.8s 60 VM ~ 40s 120 VM ~ 89s

Stop 30VM ~ 27.4s 60VM ~ 26s 120VM ~ 57s

Virtualization: HEP Best Practical

Grid over Cloud or Cloud over Grid?

Power Consumption Challenge

Conclusion: My Opinion

Future of Computing: Technology-Push & Demand-Pull

Emerging of new science paradigm Virtualization: Promising Technology but being overemphasized

Green: Cloud Service Transparency & Common Platform More Computing Power ~ Power Consumption

Challenge Private Clouds Will be predominant way

Commercial Cloud (Public) expect not evolving fast

Acknowledgment

Thanks valuable discussion/inputs from TCloud (Cloud OS: Elaster)

Professional Technical Support from Silvershine Tech. at beginning of the collaboration.

The interesting thing about Cloud Computing is that we’ve defined Cloud Computing to include everything that we already do….. I don’t understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads.

Larry Ellison, quote in the Wall Street Journal, Sep 26, 2008

Issues

Scalability? Infrastructure operation vs. performance

Assessment Application aware – Cloud service Cost analysis Data center power usage – PUE Cloud Myth Top 10 Cloud Computing Trend

http://www.focus.com/articles/hosting-bandwidth/top-10-cloud-computing-trends/

Use Cases & Best Practical

Issues (II)

Volunteer computing (boinc)? Total capacity & performance successful stories & research Despines

What’s hindering cloud adoption? Try human. http://gigaom.com/cloud/whats-hindering-cloud-

adoption-how-about-humans/ Future projection?

service readiness? Service level? Technical barriers?

hpc, grid and cloud computing - the past, present, and future challenge

Education