intel insights at sc11 · pdf file• per blade : 2 sockets intel xeon processor 5600 2.67...
TRANSCRIPT
Copyright 2011, Intel, All rights reserved
Intel Insights at SC11
Illuminate Insight Solve
Intel® Technical Computing Group
Dr. Rajeeb Hazra General Manager
Intel Technical Computing Group
Copyright 2011, Intel, All rights reserved
Legal Disclaimers Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.
Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.
SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjbb, SPECjvm, SPECWeb, SPECompM, SPECompL, SPEC MPI, SPECjEnterprise* are trademarks of the Standard Performance Evaluation Corporation. See http://www.spec.org for more information. TPC-C, TPC-H, TPC-E are trademarks of the Transaction Processing Council. See http://www.tpc.org for more information.
Intel® Virtualization Technology requires a computer system with an enabled Intel® processor, BIOS, virtual machine monitor (VMM) and, for some uses, certain platform software enabled for it. Functionality, performance or other benefits will vary depending on hardware and software configurations and may require a BIOS update. Software applications may not be compatible with all operating systems. Please check with your application vendor.
Hyper-Threading Technology requires a computer system with a processor supporting HT Technology and an HT Technology-enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. For more information including details on which processors support HT Technology, see here
Intel® Turbo Boost Technology requires a Platform with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration. Check with your platform manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost
No computer system can provide absolute security under all conditions. Intel® Trusted Execution Technology (Intel® TXT) requires a computer system with Intel® Virtualization Technology, an Intel TXT-enabled processor, chipset, BIOS, Authenticated Code Modules and an Intel TXT-compatible measured launched environment (MLE). Intel TXT also requires the system to contain a TPM v1.s. For more information, visit http://www.intel.com/technology/security. In addition, Intel TXT requires that the original equipment manufacturer provides TPM functionality, which requires a TPM-supported BIOS. TPM functionality must be initialized and may not be available in all countries.
Intel® AES-NI requires a computer system with an AES-NI enabled processor, as well as non-Intel software to execute the instructions in the correct sequence. AES-NI is available on Intel® Core™ i5-600 Desktop Processor Series, Intel® Core™ i7-600 Mobile Processor Series, and Intel® Core™ i5-500 Mobile Processor Series. For availability, consult your reseller or system manufacturer. For more information, see http://software.intel.com/en-us/articles/intel-advanced-encryption-standard-instructions-aes-ni/
Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor series, not across different processor sequences. See http://www.intel.com/products/processor_number for details. Intel products are not intended for use in medical, life saving, life sustaining, critical control or safety systems, or in nuclear facility applications. All dates and products specified are for planning purposes only and are subject to change without notice
Copyright © 2011 Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon and Intel Core are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All dates and products specified are for planning purposes only and are subject to change without notice
2
Copyright 2011, Intel, All rights reserved
Technical Computing
Computing which serves an intellectual property design,
creation, simulation or analysis purpose…
…customers demand the capability and the capacity to
get results fast, and the reliability to get results right
Copyright 2011, Intel, All rights reserved
Universe of Usage Continuum of Solutions
INTEL INSIDE.
PURDUE.
Workstations
Supercomputers Servers
Cloud
Copyright 2011, Intel, All rights reserved
Technical Computing Group
High Performance Computing, Workstation,
and future Intel MIC® products
Roughly 1/3 of Intel’s Data Center Business
Driving future Workstation and HPC architectures
Copyright 2011, Intel, All rights reserved
A Pivotal Moment in the Virtuous Cycle A Universe of Opportunity
Technology Reaching
New Levels of
Performance
Technical Computing Indispensible
as
National and Corporate
Competency Globally
Exascale Requirements
and Technical Momentum
Accelerating
Copyright 2011, Intel, All rights reserved
State of Technical Computing Recession not slowing investment
China has for the first time unveiled a supercomputer using domestically developed microprocessor chips
Wall St. Journal – 10/28/11
The Tri-Lab Linux Capacity Cluster 2 (TLCC2) award is a multi-million and multi-
year contract HPC Wire – 6/08/11
The estimated investment [in TACC’s Stampede system] will be more than $50 million over
four years eWeek.com – 9/26/11
Russia invests in first petaflop capable supercomputer
Softpedia.com – 3/11/11
.
Copyright 2011, Intel, All rights reserved
Architecture presence in Top 500
8
Intel on the November 2011 Top 500 List
Source: Top500.org, November 2011
• Intel® Xeon® processor 5600 #1 processor generation on list (223 systems)
• Intel® Xeon® processor E5 family in 10 spots on the list including #15 (LLNL) on the list
• Publically announced top10 systems coming soon.. Genci Curie, LRZ, IFERC
• 85% of New Systems on Top500 list 93% of New CHINA systems
• 5 systems in top 10 (2,4,5,7,9) (unchanged from june 2011)
• …. AND..,….
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Jun
'06
Nov
'06
Jun
'07
Nov
'07
Jun
'08
Nov
'08
Jun
'09
Nov
'09
Jun
'10
Nov
'10
Jun
'11
Nov
'11
Intel IBM Power AMD Other
Copyright 2011, Intel, All rights reserved 9
Breakthrough Performance
1 8 GT/s and 128b/130b encoding in PCIe* 3.0 specification is estimated to double the interconnect bandwidth over the PCIe* 2.0 specification
* Other names and brands may be claimed as the property of others
Intel® Xeon® Processor E5 Family
The Foundation of High Performance Computing
Max 152GF effective flops per socket,
91% efficiency
172 peak Gflops / socket – 2x improvement with Intel® AVX
**Industry’s first integrated PCI Express* 3.0 –
2x the bandwidth
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using
specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and
performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Intel Internal
measurements October 2011, See backup for configuration details. For more information go to http://www.intel.com/performance . Any difference in system hardware or software design or
configuration may affect actual performance. Copyright © 2010, Intel Corporation.
0
50
100
150
200
Eff GF Peak GF
SNB
ITL
Nov 2011 Top500 listing #105 SNB cluster, #20 ITL
Source: Top500.org, November 2011
ANNOUNCING
Copyright 2011, Intel, All rights reserved
1.3 1.4
1.5 1.5
1.7
1.0
2.1
X5690
Baseline
(3.46GHz, 6C)
Matrix
Multiplication
(Linpack)
Life
Sciences
CAE Energy FSI Numerical Weather
Intel® Xeon® Processor E5 Family
Synthetic Technical Computing Real-world applications
Increased Application Performance up to 1.7x
10
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Intel Internal measurements October 2011, See backup for configuration details. For more information go to http://www.intel.com/performance . Any difference in system hardware or software design or configuration may affect actual performance. Copyright © 2010, Intel Corporation.
Higher is
better
Relative Geometric Mean Scores by segment
Actual performance will vary by workload
Higher is better
Intel® Xeon® Processor E5 Family (8C, 2.7 GHz)
GENCI in the French and European HPC ecosystem
Alain Lichnewsky, CSO - GENCI
GENCI Grand équipement national de calcul intensif
• Settle the major equipment for the French computer centers for academic research
• Contribute to the organization of a European HPC ecosystem
• Promote the use of simulation and HPC in fundamental and industrial research
21/10/2011 TGCC / CURIE 12
1 %
49 %
20 %
20 %
10 %
Total capacity of Tier1 HPC ressources available for the french scientific community in 2011
From 20 to 620 TFlops
A 31x increase !
NEC SX8 Brodie (IDRIS) NEC SX8R-SX9 Mercure (CCRT)
Cluster BULL Titane (CCRT)
Cluster SGI ICE Jade (CINES) Cluster IBM Hera (CINES) Cluster BULL Titane (CCRT) Cluster BULL Platine (CCRT) Cluster BULL Curie fat nodes (TGCC) Cluster IBM Vargas (IDRIS)
IBM BG/P Babel (IDRIS)
= 399 TFlops, > 33 000 cores
Operating the PRACE Research Infrastructure since April 2010 PRACE AISBL created with 20 countries, head office in Brussels
• Association Internationale Sans But Lucratif
Now 21 country members, more to come soon
• France is represented by GENCI
PRACE is providing Tier-0 services since August 2010
• In Germany and in France since 2010
• In Italy and in Spain from 2012
Funding secured for 2010-2015 • 400 Million€ from France, Germany, Spain and Italy, provided as Tier-0 services on TCO basis
• 70+ Million€ from EC FP7 for preparatory and implementation, complemented by nearly 60 Million€ from PRACE members
14
PRACE The Partnership for Advanced Computing in Europe
TGCC / CURIE 21/10/2011
2 PFlops at the end of 2011
CURIE, a tool to concretize the French commitment in PRACE
More than 92 000 cores
Installed at TGCC : a technical and scientific environment of
the highest quality
360 TB of main memory
21/10/2011 TGCC / CURIE 15
15 PB attached storage @ 250 Go/s
120 racks, <200m2, 2.5MW
A deep dive into CURIE
21/10/2011 TGCC / CURIE 16
A fat nodes partitions
360 BULL S6010 servers • Per node: 4 sockets Intel Xeon Processor 7500 series 2.26 GHz, 128 GB memory, 1 TB disk, 1
QDR link
• Q2 2012 : 360 servers 90 fat nodes, BULL BCS switch (128 cores, 512 GB, same SSI)
• 105 TF peak performance, available since January 2011
An hybrid nodes partition
144 BULL B505 hybrid blades • Per blade : 2 sockets Intel Xeon processor 5600 2.67 GHz, 2 nVIDIA M2090, 128 GB SSD, 1QDR
• 200 TF peak performance, available since October 2011
A thin nodes partition
5040 BULL B510 blades • Per blade : 2 Xeon Processor E5 (2.7Ghz)
64 GB memory, 128 GB SSD, 1 QDR link
• 10 080 processors, 80 640 cores
• 1.7 PF peak performance, available beg. 2012
A first Intel based 2 PF system in the PRACE AISBL
17
Very good first early results on the first 30k Xeon E5 cores installed
• Memory controller + L3 management + DDR3 1600 MHz Stream Improvement
• Promising HPL performance :
• 90% efficiency on a single blade
• Between 83% and 87% expected on the full system
• Very good power management and DVFS implementation
• Maturity of the compilers and the MKL for AVX generation
Beyond Stream and HPL : A lot of key scientific challenges to address
Specific grand challenges about to start on CURIE
• Massive simulations in fusion, astrophysics, combustion and Life Sciences on the whole machine
CURIE open to PRACE users on March 1st, 2012
CURIE A tool for Science & Open Innovation
18
In the medicine field: molecular simulations for Alzheimer’s disease
• Solve some key issues of the chemistry of Alzheimer's disease:
• The formation of amyloid plaques, an hallmark of the disease
Role of metallic ions (Cu, Zn, Fe)
Comparison of 2 models, in the case of a human and mouse
Optimisation of the QMC=Chem code done with the French Intel Exascale Lab (ECR)
CURIE A tool for Science & Open Innovation
Ongoing Grand Challenge
on up to 50k cores of CURIE
Copyright 2011, Intel, All rights reserved
Intel’s Many Core and Multi-core Engines
19
• Die Size not to scale
Intel Xeon processor:
• Foundation of HPC Performance
• Suited for full scope of workloads
• Industry leading performance/watt for serial & highly parallel workloads.
MIC Architecture:
• Optimized for highly parallelized compute intensive workloads
• Common software tools with Xeon enabling efficient app readiness and performance tuning
Multi-core Intel® Xeon® processor at 2.26-3.5 GHz
Many Integrated Cores at 1-1.2 GHz
Copyright 2011, Intel, All rights reserved
Texas Advanced Computing Center
20
Die Sizes not to scale
Planned January 2013 – “Stampede”
10 petaflop Intel® Xeon/Intel ® MIC
National Cyber Infrastructure Super Computer
Operational October 2011
Targeting codes in Biology, AstroPhysics,
Computational Fluid Dynamics (CFD)
TACC: Dell Experimental Cluster
8 Intel® Xeon® nodes / 8 Intel® MIC nodes
Copyright 2011, Intel, All rights reserved
Sandia National Laboratories
21
Operational October 2011 Evaluating highly parallel programming models
Computation fluid dynamics application enabled in a few days; results targeted in 3D
“Arthur” Experimental Cluster
42 Intel® Xeon® nodes / 84 Intel® MIC nodes
Die Sizes not to scale
Copyright 2011, Intel, All rights reserved 22
Jeff Nichols
Acting Director
National Center for Computational Sciences /
National Leadership Computing Facility
Oak Ridge National Laboratory
Driving leadership technology in parallel computing
– High performance and parallelism - regardless of
architecture
Experience with Intel® MIC Architecture
− Programmability and performance promise
Jeff Nichols Associate Laboratory Director Computing and Computational Sciences
Robert Harrison Director, Joint Institute for Computational Sciences
[email protected] [email protected]
Opportunities and
Challenges Posed by
Exascale Computing
- ORNL's Plans and Perspectives
Managed by UT-Battelle for the U.S. Department of Energy
Our vision for sustained leadership
and scientific impact
• Provide world’s most powerful open resources for scalable computing and simulation, data and analytics, and infrastructure for science
• Follow a well-defined path for maintaining world leadership in these critical areas
• Attract the brightest talent and partnerships from all over the world
• Deliver leading-edge science relevant to missions of DOE and key federal and state agencies
• Invest in cross-cutting partnerships with industry
• Provide unique opportunity for innovation based on multiagency collaboration
• Invest in education and training
Managed by UT-Battelle for the U.S. Department of Energy
Partnerships are key
to sustained leadership
• International
• Interagency (multiagency)
• Multi-institutional
• Multi-disciplinary
• Industrial / vendor
• Experiment, theory, and simulation
• Facility and R&D
Executing the vision
Scalable Compute Data
Applications Chemistry, climate, fusion, combustion, energy grid, . . .
Discrete-event, real-time, sensors, social networks Research
and development co-design end stations
Algorithms and analytics
Solvers, 13 “motifs,” clock-constrained information fusion, geospatial-temporal
Tools and middleware
Compilers, performance optimization, global arrays, streaming, tagging, smart data, agent-based
Hardware systems and associated software
Heterogeneous multicore, massively threaded, visualization systems, mobile systems, workflow, operating systems, programming models, . . .
Acquisition
Deployment
Operations Infrastructure
Power, cooling, networking, hierarchical storage, cyber security, experimental systems, . . .
Managed by UT-Battelle for the U.S. Department of Energy
Partnerships are essential to success
• Co-design of applications, computational environment, and platforms
• Application teams with dual responsibility
– Mission/science
– Exascale co-design
• Simulation environment characterized by broad community participation
– Common across all applications and platforms
– Leveraging open source software and product support
– Supporting both evolutionary and revolutionary approaches
• Long-term industry partnerships
– Leverage and influence business plans of vendor partners
– Joint R&D and leveraged community efforts reduce vendor risk
– Minimum of 2 tracks provides competitiveness, risk reduction, and architectural diversity, as does deployment of at least 2 platform generations to get to exascale
National Institute for
Computational Sciences
Joint Institute for Computational Sciences
University of Tennessee & ORNL
Funded by the National Science Foundation (NSF)
Operates Kraken, a 1.17 PetaFLOP Cray XT5 that is the NSF’s most productive supercomputer
Major partner in the NSF’s Extreme Science and Engineering Discovery Environment (XSEDE)
Managed by UT-Battelle
for the U.S. Dept. of Energy
Application Acceleration
Center of Excellence (AACE)
Joint Institute for Computational Sciences
Director: R. Glenn Brook ([email protected])
Prepare national supercomputing community for effective and efficient use of future architectures – Research, education, outreach
– Future technology evaluation and development in partnership with multiple vendors
Intel® Many Integrated Core (Intel® MIC) Partner – Multiple Knights Ferry prototype cards in multiple
platforms (e.g., clusters from Cray, Appro, …)*
– Evaluate as future technology for NSF applications
– Collaborate with Intel in testing and platform design
*See demo platforms in ORNL/UT
and Intel booths at SC11
Experience with Knights Ferry
design and development kit
Unparalleled productivity: in under 3 months – Ported all of NWChem (chemistry), ENZO (astro.),
ELK (mat. sci.), MADNESS (app. math.), MPI, GA, …
– Correct ports in less than one day each
– Circa 5M LOC (Fortran 77/90, C, C++, Python)
– MPI, Global Arrays, …
Most of this software does not run on GPGPUs and probably never will due to cost and complexity
Demonstrated execution modes: – Native mode: KNF is fully networked Linux system
– Offload mode: KNF is an attached accelerator
– Reverse offload mode: KNF in native mode offloads to host
– Cluster mode: parallel application distributed across multiple KNF and hosts using MPI
Initial scaling results
Eulerian inviscid flow on unstructured grid,
five variables per grid point. Speedup from
Sod shock test problem run in native
mode OMP on the KNF using both single
and double precision [Brook and Hulguin]
1
2
4
8
16
32
1 2 4 8 16 32 64 128
Sp
eed
up
Number of Threads
Single Precision
Double Precision
Double Precision (2x Problem Size)
Ideal
92% – 96% of ideal Speedup
73% – 99% of ideal Speedup
1
2
4
8
16
32
1 2 4 8 16 32
Sp
eed
up
Number of Threads
Observed
Ideal
ENZO-C astrophysics/cosmology 1283
non-AMR pure MPI mode (1-process
per core) [Harkness]
Productivity and performance
• Productivity
• Fully portable standards-compliant source code
• Complete environment for scientific computing including robust, fast, math libraries (e.g., MKL)
• Rich vector intrinsics with full predication
• Performance
• Embarrassingly parallel code trivially runs well
• Established MPI + OMP/threads + scalar-vector programming model enables vastly more complex algorithms to run efficiently
• But Knights Ferry is not a product
• Need a faster & bigger platform to buy in quantity
Copyright 2011, Intel, All rights reserved Source and Photo: http://en.wikipedia.org/wiki/ASCI_Red
Knights Corner
Copyright 2011, Intel, All rights reserved
Knights Corner Performance
Copyright 2011, Intel, All rights reserved
Shattering Barriers Crossing 1 sustained TeraFlops
ASCI Red: 1TF
1997 First System 1 TF Sustained
9298 Pentium II Xeon
OS: Cougar
72 Cabinets
Knights Corner: 1TF
2011 First Chip 1 TF Sustained
1 22nm Chip
OS: Linux
1 PCI express slot
Source and Photo: http://en.wikipedia.org/wiki/ASCI_Red
ANNOUNCING
Copyright 2011, Intel, All rights reserved
Software Development and System Environment
36
• Die sizes not to scale
Intel® Xeon® Processor
Intel® Many Integrated
Core Architecture
Same Comprehensive Set of Tools Established HPC Operating
System
Application Source Code Builds with a Compiler Switch
No Changes Required!
Copyright 2011, Intel, All rights reserved
Exascale Systems Design
38
Needs A Multi-disciplinary Approach
Power Management
Microprocessor
Parallel Software
Interconnect
Reliability and Resiliency
Memory
Copyright 2011, Intel, All rights reserved
Intel Exascale Labs - Europe
ExaScale Computing
Research Center, Paris
Performance and scalability of
Exascale applications
Exascale Cluster Lab
Jülich
Exascale cluster scalability
and reliability
Space weather prediction
Architectural simulation and visualization
Numerical kernels
Exascience Lab
Leuven
Scalable Run Time System
Exascale tools
New algorithms
Intel -BSC Exascale Lab, Barcelona
Strong commitment to advance computing leading edge: Intel collaborating with HPC community & European researchers 4 labs in Europe, Exascale computing is the central topic
Signed Collaboration agreement
Copyright 2011, Intel, All rights reserved 40
Intel Xeon processor E5 will serve as
foundation of HPC in 2012
Intel MIC architecture with industry-leading
performance and a familiar and useful usage
model
Our partnership ramp in Exascale
41
Copyright © 2011, Intel Corporation. All rights reserved.
Legal Disclaimers Linpack and HPC suite
For the 2.1X Linpack claim, I would use these as footnotes. Baseline: 2S Intel® Xeon® E5-2680 score of 342.7Gflops based on Intel internal measurements as of 7 September 2011 using an Intel® Rose City platform with two Intel® Xeon® processor E5-2680, Turbo Enabled, EIST Enabled, Hyper-Threading Enabled, 64 GB memory (8 x 8GB DDR3-1600), Red Hat* Enterprise Linux Server 6.1 beta for x86_6 New Configuration: Intel® Xeon® 5600 processor platform with two Intel® Xeon® Processor X5690 (6-Core, 3.46GHz, 12MB L3 cache, 6.4GT/s, B1-stepping), EIST Enabled, Turbo Boost enabled, Hyper-Threading Disabled, 48GB memory (12x 4GB DDR3-1333 REG ECC), 160GB SATA 7200RPM HDD, Red Hat* Enterprise Linux Server 5.5 for x86_64 with kernel 2.6.35.10. Source: Intel internal testing as of Apr 2011. Score : 159.40 Gflops. Source Intel SSG TR#1224 For the HPC suite: Baseline: 2S Intel® Xeon® X5690 HPC suite geometric mean of application measurements by vertical (CAE, Energy, FSI, Life Science, NWS), actual performance will vary by workload. Based on Intel internal measurements as of October 2011 using an Intel® Xeon® 5600 processor platform with two Intel® Xeon® X5690, Turbo Enabled, Best Hyper-Threading configuration, 48GB DDR3-1333 memory, Red Hat* EL5-U5. New Configuration: 2S Intel® Xeon® E5-2680 HPC suite geometric mean of application measurements by vertical (CAE, Energy, FSI, Life Science, NWS), actual performance will vary by workload. Based on Intel internal measurements as of October 2011 using an Intel® Canoe Pass platform with two Xeon® E5-2680 (C0 step), Turbo Enabled, Best Hyper-Threading configuration, 32 GB DDR3-1600 memory, Red Hat* Enterprise Linux 6.1, 2.6.39.3 kernel
42