intel insights at sc11 · pdf file• per blade : 2 sockets intel xeon processor 5600 2.67...

Copyright 2011, Intel, All rights reserved

Intel Insights at SC11

Illuminate Insight Solve

Intel® Technical Computing Group

Dr. Rajeeb Hazra General Manager

Intel Technical Computing Group


Legal Disclaimers Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.

Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.

SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjbb, SPECjvm, SPECWeb, SPECompM, SPECompL, SPEC MPI, SPECjEnterprise* are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information. TPC-C, TPC-H, TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.

Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.

Hyper-Threading Technology requires a computer system with a processor supporting HT Technology and an HT Technology-enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. For more information including details on which processors support HT Technology, see here

Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost

No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security. In addition, Intel TXT requires that the original equipment manufacturer provides TPM functionality, which requires a TPM-supported BIOS. TPM functionality must be initialized and may not be available in all countries.

Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to execute the instructions in the correct sequence. AES-NI is available on Intel® Core™ i5-600 Desktop Processor Series, Intel® Core™ i7-600 Mobile Processor Series, and Intel® Core™ i5-500 Mobile Processor Series. For availability, consult your reseller or system manufacturer. For more information, see http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni/

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different processor sequences. See http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject to change without notice

Copyright © 2011 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Intel Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and products specified are for planning purposes only and are subject to change without notice

2

http://www.spec.org/

http://www.tpc.org/

http://www.intel.com/info/hyperthreading/

http://www.intel.com/technology/turboboost

http://www.intel.com/products/processor_number


Technical Computing

Computing which serves an intellectual property design,

creation, simulation or analysis purpose…

…customers demand the capability and the capacity to

get results fast, and the reliability to get results right


Universe of Usage Continuum of Solutions

INTEL INSIDE.

PURDUE.

Workstations

Supercomputers Servers

Cloud


Technical Computing Group

High Performance Computing, Workstation,

and future Intel MIC® products

Roughly 1/3 of Intel’s Data Center Business

Driving future Workstation and HPC architectures


A Pivotal Moment in the Virtuous Cycle A Universe of Opportunity

Technology Reaching

New Levels of

Performance

Technical Computing Indispensible

as

National and Corporate

Competency Globally

Exascale Requirements

and Technical Momentum

Accelerating


State of Technical Computing Recession not slowing investment

China has for the first time unveiled a supercomputer using domestically developed microprocessor chips

Wall St. Journal – 10/28/11

The Tri-Lab Linux Capacity Cluster 2 (TLCC2) award is a multi-million and multi-

year contract HPC Wire – 6/08/11

The estimated investment [in TACC’s Stampede system] will be more than $50 million over

four years eWeek.com – 9/26/11

Russia invests in first petaflop capable supercomputer

Softpedia.com – 3/11/11

.


Architecture presence in Top 500

8

Intel on the November 2011 Top 500 List

Source: Top500.org, November 2011

• Intel® Xeon® processor 5600 #1 processor generation on list (223 systems)

• Intel® Xeon® processor E5 family in 10 spots on the list including #15 (LLNL) on the list

• Publically announced top10 systems coming soon.. Genci Curie, LRZ, IFERC

• 85% of New Systems on Top500 list 93% of New CHINA systems

• 5 systems in top 10 (2,4,5,7,9) (unchanged from june 2011)

• …. AND..,….

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Jun

'06

Nov

'06

Jun

'07

Nov

'07

Jun

'08

Nov

'08

Jun

'09

Nov

'09

Jun

'10

Nov

'10

Jun

'11

Nov

'11

Intel IBM Power AMD Other

Copyright 2011, Intel, All rights reserved 9

Breakthrough Performance

1 8 GT/s and 128b/130b encoding in PCIe* 3.0 specification is estimated to double the interconnect bandwidth over the PCIe* 2.0 specification

* Other names and brands may be claimed as the property of others

Intel® Xeon® Processor E5 Family

The Foundation of High Performance Computing

Max 152GF effective flops per socket,

91% efficiency

172 peak Gflops / socket – 2x improvement with Intel® AVX

**Industry’s first integrated PCI Express* 3.0 –

2x the bandwidth

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using

specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and

performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Intel Internal

measurements October 2011, See backup for configuration details. For more information go to http://www.intel.com/performance . Any difference in system hardware or software design or

configuration may affect actual performance. Copyright © 2010, Intel Corporation.

0

50

100

150

200

Eff GF Peak GF

SNB

ITL

Nov 2011 Top500 listing #105 SNB cluster, #20 ITL

Source: Top500.org, November 2011

ANNOUNCING


1.3 1.4

1.5 1.5

1.7

1.0

2.1

X5690

Baseline

(3.46GHz, 6C)

Matrix

Multiplication

(Linpack)

Life

Sciences

CAE Energy FSI Numerical Weather

Intel® Xeon® Processor E5 Family

Synthetic Technical Computing Real-world applications

Increased Application Performance up to 1.7x

10

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Intel Internal measurements October 2011, See backup for configuration details. For more information go to http://www.intel.com/performance . Any difference in system hardware or software design or configuration may affect actual performance. Copyright © 2010, Intel Corporation.

Higher is

better

Relative Geometric Mean Scores by segment

Actual performance will vary by workload

Higher is better

Intel® Xeon® Processor E5 Family (8C, 2.7 GHz)

GENCI in the French and European HPC ecosystem

Alain Lichnewsky, CSO - GENCI

GENCI Grand équipement national de calcul intensif

• Settle the major equipment for the French computer centers for academic research

• Contribute to the organization of a European HPC ecosystem

• Promote the use of simulation and HPC in fundamental and industrial research

21/10/2011 TGCC / CURIE 12

1 %

49 %

20 %

20 %

10 %

Total capacity of Tier1 HPC ressources available for the french scientific community in 2011

From 20 to 620 TFlops

A 31x increase !

NEC SX8 Brodie (IDRIS) NEC SX8R-SX9 Mercure (CCRT)

Cluster BULL Titane (CCRT)

Cluster SGI ICE Jade (CINES) Cluster IBM Hera (CINES) Cluster BULL Titane (CCRT) Cluster BULL Platine (CCRT) Cluster BULL Curie fat nodes (TGCC) Cluster IBM Vargas (IDRIS)

IBM BG/P Babel (IDRIS)

= 399 TFlops, > 33 000 cores

Operating the PRACE Research Infrastructure since April 2010 PRACE AISBL created with 20 countries, head office in Brussels

• Association Internationale Sans But Lucratif

Now 21 country members, more to come soon

• France is represented by GENCI

PRACE is providing Tier-0 services since August 2010

• In Germany and in France since 2010

• In Italy and in Spain from 2012

Funding secured for 2010-2015 • 400 Million€ from France, Germany, Spain and Italy, provided as Tier-0 services on TCO basis

• 70+ Million€ from EC FP7 for preparatory and implementation, complemented by nearly 60 Million€ from PRACE members

14

PRACE The Partnership for Advanced Computing in Europe

TGCC / CURIE 21/10/2011

http://www.flaggezeigen.de/catalog/product_info.php?products_id=848&osCsid=419ba85839d98946a8c35337d98692f3




















2 PFlops at the end of 2011

CURIE, a tool to concretize the French commitment in PRACE

More than 92 000 cores

Installed at TGCC : a technical and scientific environment of

the highest quality

360 TB of main memory

21/10/2011 TGCC / CURIE 15

15 PB attached storage @ 250 Go/s

120 racks, <200m2, 2.5MW

A deep dive into CURIE

21/10/2011 TGCC / CURIE 16

A fat nodes partitions

360 BULL S6010 servers • Per node: 4 sockets Intel Xeon Processor 7500 series 2.26 GHz, 128 GB memory, 1 TB disk, 1

QDR link

• Q2 2012 : 360 servers 90 fat nodes, BULL BCS switch (128 cores, 512 GB, same SSI)

• 105 TF peak performance, available since January 2011

An hybrid nodes partition

144 BULL B505 hybrid blades • Per blade : 2 sockets Intel Xeon processor 5600 2.67 GHz, 2 nVIDIA M2090, 128 GB SSD, 1QDR

• 200 TF peak performance, available since October 2011

A thin nodes partition

5040 BULL B510 blades • Per blade : 2 Xeon Processor E5 (2.7Ghz)

64 GB memory, 128 GB SSD, 1 QDR link

• 10 080 processors, 80 640 cores

• 1.7 PF peak performance, available beg. 2012

A first Intel based 2 PF system in the PRACE AISBL

17

Very good first early results on the first 30k Xeon E5 cores installed

• Memory controller + L3 management + DDR3 1600 MHz Stream Improvement

• Promising HPL performance :

• 90% efficiency on a single blade

• Between 83% and 87% expected on the full system

• Very good power management and DVFS implementation

• Maturity of the compilers and the MKL for AVX generation

Beyond Stream and HPL : A lot of key scientific challenges to address

Specific grand challenges about to start on CURIE

• Massive simulations in fusion, astrophysics, combustion and Life Sciences on the whole machine

CURIE open to PRACE users on March 1st, 2012

CURIE A tool for Science & Open Innovation

18

In the medicine field: molecular simulations for Alzheimer’s disease

• Solve some key issues of the chemistry of Alzheimer's disease:

• The formation of amyloid plaques, an hallmark of the disease

Role of metallic ions (Cu, Zn, Fe)

Comparison of 2 models, in the case of a human and mouse

Optimisation of the QMC=Chem code done with the French Intel Exascale Lab (ECR)

CURIE A tool for Science & Open Innovation

Ongoing Grand Challenge

on up to 50k cores of CURIE


Intel’s Many Core and Multi-core Engines

19

• Die Size not to scale

Intel Xeon processor:

• Foundation of HPC Performance

• Suited for full scope of workloads

• Industry leading performance/watt for serial & highly parallel workloads.

MIC Architecture:

• Optimized for highly parallelized compute intensive workloads

• Common software tools with Xeon enabling efficient app readiness and performance tuning

Multi-core Intel® Xeon® processor at 2.26-3.5 GHz

Many Integrated Cores at 1-1.2 GHz


Texas Advanced Computing Center

20

Die Sizes not to scale

Planned January 2013 – “Stampede”

10 petaflop Intel® Xeon/Intel ® MIC

National Cyber Infrastructure Super Computer

Operational October 2011

Targeting codes in Biology, AstroPhysics,

Computational Fluid Dynamics (CFD)

TACC: Dell Experimental Cluster

8 Intel® Xeon® nodes / 8 Intel® MIC nodes


Sandia National Laboratories

21

Operational October 2011 Evaluating highly parallel programming models

Computation fluid dynamics application enabled in a few days; results targeted in 3D

“Arthur” Experimental Cluster

42 Intel® Xeon® nodes / 84 Intel® MIC nodes

Die Sizes not to scale


Jeff Nichols

Acting Director

National Center for Computational Sciences /

National Leadership Computing Facility

Oak Ridge National Laboratory

Driving leadership technology in parallel computing

– High performance and parallelism - regardless of

architecture

Experience with Intel® MIC Architecture

− Programmability and performance promise

Jeff Nichols Associate Laboratory Director Computing and Computational Sciences

Robert Harrison Director, Joint Institute for Computational Sciences

[email protected] [email protected]

Opportunities and

Challenges Posed by

Exascale Computing

- ORNL's Plans and Perspectives

Managed by UT-Battelle for the U.S. Department of Energy

Our vision for sustained leadership

and scientific impact

• Provide world’s most powerful open resources for scalable computing and simulation, data and analytics, and infrastructure for science

• Follow a well-defined path for maintaining world leadership in these critical areas

• Attract the brightest talent and partnerships from all over the world

• Deliver leading-edge science relevant to missions of DOE and key federal and state agencies

• Invest in cross-cutting partnerships with industry

• Provide unique opportunity for innovation based on multiagency collaboration

• Invest in education and training


Partnerships are key

to sustained leadership

• International

• Interagency (multiagency)

• Multi-institutional

• Multi-disciplinary

• Industrial / vendor

• Experiment, theory, and simulation

• Facility and R&D

Executing the vision

Scalable Compute Data

Applications Chemistry, climate, fusion, combustion, energy grid, . . .

Discrete-event, real-time, sensors, social networks Research

and development co-design end stations

Algorithms and analytics

Solvers, 13 “motifs,” clock-constrained information fusion, geospatial-temporal

Tools and middleware

Compilers, performance optimization, global arrays, streaming, tagging, smart data, agent-based

Hardware systems and associated software

Heterogeneous multicore, massively threaded, visualization systems, mobile systems, workflow, operating systems, programming models, . . .

Acquisition

Deployment

Operations Infrastructure

Power, cooling, networking, hierarchical storage, cyber security, experimental systems, . . .


Partnerships are essential to success

• Co-design of applications, computational environment, and platforms

• Application teams with dual responsibility

– Mission/science

– Exascale co-design

• Simulation environment characterized by broad community participation

– Common across all applications and platforms

– Leveraging open source software and product support

– Supporting both evolutionary and revolutionary approaches

• Long-term industry partnerships

– Leverage and influence business plans of vendor partners

– Joint R&D and leveraged community efforts reduce vendor risk

– Minimum of 2 tracks provides competitiveness, risk reduction, and architectural diversity, as does deployment of at least 2 platform generations to get to exascale

National Institute for

Computational Sciences

Joint Institute for Computational Sciences

University of Tennessee & ORNL

Funded by the National Science Foundation (NSF)

Operates Kraken, a 1.17 PetaFLOP Cray XT5 that is the NSF’s most productive supercomputer

Major partner in the NSF’s Extreme Science and Engineering Discovery Environment (XSEDE)

Managed by UT-Battelle

for the U.S. Dept. of Energy

Application Acceleration

Center of Excellence (AACE)

Joint Institute for Computational Sciences

Director: R. Glenn Brook ([email protected])

Prepare national supercomputing community for effective and efficient use of future architectures – Research, education, outreach

– Future technology evaluation and development in partnership with multiple vendors

Intel® Many Integrated Core (Intel® MIC) Partner – Multiple Knights Ferry prototype cards in multiple

platforms (e.g., clusters from Cray, Appro, …)*

– Evaluate as future technology for NSF applications

– Collaborate with Intel in testing and platform design

*See demo platforms in ORNL/UT

and Intel booths at SC11

Experience with Knights Ferry

design and development kit

Unparalleled productivity: in under 3 months – Ported all of NWChem (chemistry), ENZO (astro.),

ELK (mat. sci.), MADNESS (app. math.), MPI, GA, …

– Correct ports in less than one day each

– Circa 5M LOC (Fortran 77/90, C, C++, Python)

– MPI, Global Arrays, …

Most of this software does not run on GPGPUs and probably never will due to cost and complexity

Demonstrated execution modes: – Native mode: KNF is fully networked Linux system

– Offload mode: KNF is an attached accelerator

– Reverse offload mode: KNF in native mode offloads to host

– Cluster mode: parallel application distributed across multiple KNF and hosts using MPI

Initial scaling results

Eulerian inviscid flow on unstructured grid,

five variables per grid point. Speedup from

Sod shock test problem run in native

mode OMP on the KNF using both single

and double precision [Brook and Hulguin]

1

2

4

8

16

32

1 2 4 8 16 32 64 128

Sp

eed

up

Number of Threads

Single Precision

Double Precision

Double Precision (2x Problem Size)

Ideal

92% – 96% of ideal Speedup

73% – 99% of ideal Speedup

1

2

4

8

16

32

1 2 4 8 16 32

Sp

eed

up

Number of Threads

Observed

Ideal

ENZO-C astrophysics/cosmology 1283

non-AMR pure MPI mode (1-process

per core) [Harkness]

Productivity and performance

• Productivity

• Fully portable standards-compliant source code

• Complete environment for scientific computing including robust, fast, math libraries (e.g., MKL)

• Rich vector intrinsics with full predication

• Performance

• Embarrassingly parallel code trivially runs well

• Established MPI + OMP/threads + scalar-vector programming model enables vastly more complex algorithms to run efficiently

• But Knights Ferry is not a product

• Need a faster & bigger platform to buy in quantity


Knights Corner Performance


Shattering Barriers Crossing 1 sustained TeraFlops

ASCI Red: 1TF

1997 First System 1 TF Sustained

9298 Pentium II Xeon

OS: Cougar

72 Cabinets

Knights Corner: 1TF

2011 First Chip 1 TF Sustained

1 22nm Chip

OS: Linux

1 PCI express slot

Source and Photo: http://en.wikipedia.org/wiki/ASCI_Red

ANNOUNCING

http://en.wikipedia.org/wiki/File:ASCI-red.jpg


Software Development and System Environment

36

• Die sizes not to scale

Intel® Xeon® Processor

Intel® Many Integrated

Core Architecture

Same Comprehensive Set of Tools Established HPC Operating

System

Application Source Code Builds with a Compiler Switch

No Changes Required!


Exascale Systems Design

38

Needs A Multi-disciplinary Approach

Power Management

Microprocessor

Parallel Software

Interconnect

Reliability and Resiliency

Memory


Intel Exascale Labs - Europe

ExaScale Computing

Research Center, Paris

Performance and scalability of

Exascale applications

Exascale Cluster Lab

Jülich

Exascale cluster scalability

and reliability

Space weather prediction

Architectural simulation and visualization

Numerical kernels

Exascience Lab

Leuven

Scalable Run Time System

Exascale tools

New algorithms

Intel -BSC Exascale Lab, Barcelona

Strong commitment to advance computing leading edge: Intel collaborating with HPC community & European researchers 4 labs in Europe, Exascale computing is the central topic

Signed Collaboration agreement

http://www.google.fr/imgres?imgurl=http://www2.iap.fr/users/riazuelo/img/logoCEA.JPG&imgrefurl=http://www2.iap.fr/users/riazuelo/index.html&h=213&w=217&sz=7&tbnid=e0g5b27UZQ9I_M::&tbnh=105&tbnw=107&prev=/images?q=logo+CEA&hl=fr&usg=__ZEHCoO_aADKuKHtZjYRNDlTUGE8=&ei=zuUDSvSlMuONjAfQi43ZBA&sa=X&oi=image_result&resnum=3&ct=image


Intel Xeon processor E5 will serve as

foundation of HPC in 2012

Intel MIC architecture with industry-leading

performance and a familiar and useful usage

model

Our partnership ramp in Exascale

Copyright © 2011, Intel Corporation. All rights reserved.

Legal Disclaimers Linpack and HPC suite

For the 2.1X Linpack claim, I would use these as footnotes. Baseline: 2S Intel® Xeon® E5-2680 score of 342.7Gflops based on Intel internal measurements as of 7 September 2011 using an Intel® Rose City platform with two Intel® Xeon® processor E5-2680, Turbo Enabled, EIST Enabled, Hyper-Threading Enabled, 64 GB memory (8 x 8GB DDR3-1600), Red Hat* Enterprise Linux Server 6.1 beta for x86_6 New Configuration: Intel® Xeon® 5600 processor platform with two Intel® Xeon® Processor X5690 (6-Core, 3.46GHz, 12MB L3 cache, 6.4GT/s, B1-stepping), EIST Enabled, Turbo Boost enabled, Hyper-Threading Disabled, 48GB memory (12x 4GB DDR3-1333 REG ECC), 160GB SATA 7200RPM HDD, Red Hat* Enterprise Linux Server 5.5 for x86_64 with kernel 2.6.35.10. Source: Intel internal testing as of Apr 2011. Score : 159.40 Gflops. Source Intel SSG TR#1224 For the HPC suite: Baseline: 2S Intel® Xeon® X5690 HPC suite geometric mean of application measurements by vertical (CAE, Energy, FSI, Life Science, NWS), actual performance will vary by workload. Based on Intel internal measurements as of October 2011 using an Intel® Xeon® 5600 processor platform with two Intel® Xeon® X5690, Turbo Enabled, Best Hyper-Threading configuration, 48GB DDR3-1333 memory, Red Hat* EL5-U5. New Configuration: 2S Intel® Xeon® E5-2680 HPC suite geometric mean of application measurements by vertical (CAE, Energy, FSI, Life Science, NWS), actual performance will vary by workload. Based on Intel internal measurements as of October 2011 using an Intel® Canoe Pass platform with two Xeon® E5-2680 (C0 step), Turbo Enabled, Best Hyper-Threading configuration, 32 GB DDR3-1600 memory, Red Hat* Enterprise Linux 6.1, 2.6.39.3 kernel

42

intel insights at sc11 · pdf file• per blade : 2 sockets intel xeon processor 5600 2.67...

Documents