center for research on multicore computing (crmc) overview ken kennedy rice university...

16
Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University http://www.cs.rice.edu/~ken/Presentations/CRMC06.pdf

Upload: roberta-chase

Post on 27-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for Research on Multicore Computing (CRMC)

Overview

Ken KennedyRice University

http://www.cs.rice.edu/~ken/Presentations/CRMC06.pdf

Page 2: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

CRMC Overview

• Initial Participation from Three Institutions— Rice: Ken Kennedy, Keith Cooper, John Mellor-Crummey, Scott Rixner

— Indiana: Geoffrey Fox, Dennis Gannon— Tennessee: Jack Dongarra

• Activities— Research and prototype development— Community building

– Workshops and meetings— Other outreach components (separately funded)

• Planning and Management— Coordinated management and vision-building

– Model: CRPC

Page 3: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Management Strategy

• Pioneered in CRPC and honed in GrADS/VGrADS/LACSI

• Leadership forms a broad vision-building team— Problem identification— Willingness to redirect research to address new challenges

— Complementary research areas– Willingness to look at problems from multiple

dimensions— Joint projects between sites

• Community-building activities— Focused workshops on key topics

– CRPC: TSPLib, BLACS/ScaLAPACK– LACSI: Autotuning

— Informal standardization– CRPC: MPI and HPF

• Annual planning cycle— Plan, research, report, evaluate,…

Page 4: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Research Areas I

• Compilers and programming tools— Tools: Performance analysis and prediction (HPCToolkit)— Transformations: memory hierarchy and parallelism— Automatic tuning strategies

• Programming models and languages— High-level languages: Matlab, Python, R, etc— HPCS languages— Programming models based on component integration

• Run-time systems— Core run-time data movement library— Integration with MPI

• Libraries— Adaptive, reconfigurable libraries optimized for multicore systems

Page 5: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Research Areas II

• Applications for multicore systems— Classical parallel/scientific applications— Commercial applications with advice from industrial partners

• Interface between software and architecture— Facilities for managing bandwidth (controllable caches, scratch memory)

— Sample-based profiling facilities— Heterogeneous cores

• Fault tolerance— Redundant components— Diskless checkpointing

• Multicore emulator— Research platform for future systems

Page 6: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Performance Analysis and Prediction

• HPC Toolkit (Mellor-Crummey)— Uses sample-based profiling combined with binary analysis to report performance issues (recompilation not required)– How to extend to multicore environment?

• Performance prediction (Mellor-Crummey)— Currently using a performance prediction methodology that accurately accounts for memory hierarchy– Reuse-distance histograms based on training data,

parameterized by input data size– Accurately determines frequency of miss at each

reference— Extension to shared-cache multicore systems (underway)

Page 7: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Bandwidth Management• Multicore raises computational power rapidly

— Bandwidth onto chip unlikely to keep up

• Multicore systems will feature shared caches— Replaces false sharing with enhanced probability of conflict misses

• Challenges for effective use of bandwidth— Enhancing reuse when multiple processors are using cache— Reorganizing data to increase density of cache block use— Reorganizing computation to ensure reuse of data by multiple cores– Inter-core pipelining

— Managing conflict misses– With and without architectural help

• Without architectural help— Data reorganization within pages and synchronization to minimize conflict misses– May require special memory allocation run-time

primitives

Page 8: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Conflict Misses

• Unfortunate fact:— If a scientific calculation is sweeping across strips of > k arrays on a machine with k-way associativity and

— All k strips overlap in one associativity group, then– Every access to the overlap group location is a miss

2

1

3

On each outer loop iteration, 1 evicts 2 which evicts 3 which evicts 1In a 2-way associative cache, all are misses!

This limits loop fusion, a profitable reuse strategy

Page 9: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Controlling Conflicts: An Example

• Cache and Page Parameters— 256K Cache, 4-way set associative, 32-byte blocks

– 1024 associativity groups— 64K Page

– 2048 cache blocks— Each block in a page maps to a unique associativity group– 2 different lines in a page map to the same

associativity group

• In General— Let A = number of associativity groups in cache— Let P = number of cache blocks in a page— If P ≥ A then each block in a page maps to a single associativity group– No matter where the page is loaded

— If P < A then a block can map to A/P different associativity groups– Depending on where the page is loaded

Page 10: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Questions

• Can we do data allocation precisely within a page so that conflict misses are minimized in a given computation?— Extensive work on minimizing self-conflict misses— Little work on inter-array conflict minimization— No work, to my knowledge, on interprocessor conflict minimization

• Can we synchronize computations so that multiple cores do not interfere with one another?— Even reuse blocks across processors

• Might it be possible to convince vendors to provide additional features to help control cache, particularly conflict misses— Allocation of part of cache as a scratchpad— Dynamic modification of cache mapping

Page 11: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Parallelism

• On a shared-cache multicore chip, running the same program using multiple processors has a major advantage— Possibility for reuse of cache blocks across processors— Some chance for controlling conflict misses

• How can parallelism be found and exploited?— Automatic methods on scientific languages

– Much progress was made in the 90s— Explicit parallel programming and thread management paradigms– Data parallel (HPF, Chapel)– Partitioned global address space (Co-Array Fortran,

UPC)– Lightweight threading (OpenMP,CCR)

— Software synchronization primitives— Integration of parallel component libraries

– Telescoping languages– Parallel Matlab

Page 12: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Automatic Tuning

• Following ATLAS

• Tuning generalized component libraries in advance— For different platforms— For different contexts in the same platform

– May wish to chose a variant that uses a subset of the cache that does not conflict with the calling program

• Extensive work at Rice and Tennessee— Heuristic search combined with compiler models cut tuning time

— Many transformations: unroll-and-jam, tiling, fusion, etc.– Interact with one another

• New challenges for multicore— Tuning of on-chip multiprocessors to use shared (and non-shared) memory hierarchy effectively

— Management of on-chip parallelism and threading

Page 13: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Other Compiler Challenges

• Multicore chips used in scalable parallel machines— Multiple kinds of parallelism: on-chip, within an SMP group, distributed memory

• Heterogeneous multicore chips (Grid on a chip)— In the Intel roadmap— Challenge: decomposing computations to match strengths of different cores– Static and dynamic strategies may be required– Performance models for subcomputations on different

cores– Interaction of heterogeneity and memory hierarchy– Staging computations through shared cache

Workflow steps running on different cores

• Component-composition programming environments— Graphical or construction from scripts

Page 14: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Compiler Infrastructure

• D System Infrastructure— Includes full dependence analysis— Support for high-level transformations

– Register, cache, fusion— Support for parallelism and communication management

– Originally used for HPF

• Telescoping Languages Infrastructure— Constructed for Matlab compilation and component integration

— Constraint-based type analysis– Produces type-jump functions for libraries

— Variant specialization and selection— Applied to parallel Matlab project

• Both currently distributed under BSD-style license (no GPL)

• Open64 Compiler Infrastructure— GPL License

Page 15: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Proposal

• An NSF Center for Research on Multicore Computing— Modeled after CRPC

– Core research program– Multiple participating institutions

— Research– Compilers and tools– Architectural modifications, supported by simulation– Run-time systems and communication/synchronization– Driven by real applications from NSF community

— Big community outreach program– Specific topical workshops

— Major investment from Intel

• Coupled with Multicore Computing Research Program— Designed to foster a vibrant community of researchers

Page 16: Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University ken/Presentations/CRMC06.pdf

Center for High Performance Software Research

Leverage

• DOE SciDAC Projects— Currently proposed: A major Enabling Technology Center

– Kennedy: CScADS (includes infrastructure development)

— Participants in several other relevant SciDAC efforts– PERC2, PModels

• LACSI Projects— Subject to ASC budget

• Chip Vendors— Intel, AMD, IBM (we have relationships with all)

• Microsoft

• HPCS Collaborations— New languages and tools must run on systems using multicore chips

• New NSF Center?— Community development as with CRPC— Major contribution from Intel