Download - The Stanford Pervasive Parallelism Lab
The Stanford
Pervasive Parallelism
Lab
A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M.
Rosenblum, S. Thrun
Pervasive Parallelism LaboratoryStanford University
The Looming Crisis Software developers will soon face systems with
> 1 TFLOP of compute power
20+ of cores, 100+ hardware threads
Heterogeneous cores (CPU+GPUs), app-specific accelerators
Deep memory hierarchies
Challenge: harness these devices productively Improve performance, power, reliability and security
The parallelism gap Yawning divide between the capabilities of today’s
programming environments, the requirements of emerging applications, and the challenges of future parallel architectures
The Stanford Pervasive Parallelism Laboratory
Goal: the parallel computing platform for 2012 Make parallel programming practical for the masses Algorithms, programming models, runtimes, and architectures for scalable parallelism (10,000s of threads)
Parallel computing a core component of CS education
PPL is a combination of Leading Stanford researchers across multiple domains
Applications, languages, software systems, architecture Leading companies in computer systems and software
Sun, AMD, Nvidia, IBM, Intel, HP An exciting vision for pervasive parallelism Open laboratory; all result in the open-source
The PPL Team Applications
Ron Fedkiw, Vladlen Koltun, Sebastian Thrun
Programming & software systems Alex Aiken, Pat Hanrahan, Mendel Rosenblum
Architecture Bill Dally, John Hennessy, Mark Horowitz, Christos Kozyrakis, Kunle Olukotun (Director)
The PPL Team Research expertise
Applications: graphics, physics simulation, visualization, AI, robotics, …
Software systems: virtual machines, GPGPU, stream programming, transactional programming, speculative parallelization, optimizing compilers, bug detection, security,…
Architecture: multi-core & multithreading, scalable shared-memory, transactional memory hardware, interconnect networks, low-power processors, stream processors, vector processors, …
Commercial success MIPS & SGI, Rambus, VMware, Niagara processors,
Renderman, Stream Processors Inc, Avici, Tableau, …
Guiding Principles Top down research
App & developer needs drive system High-level info flows to low-level system
Scalability In hardware resources (10, 000s threads) Developer productivity (ease of use)
HW provides flexible primitives Software synthesizes complete solutions
Build real, full system prototypes
The PPL VisionVirtual Worlds
Autonomous Vehicle
Financial Services
Physics DSL
Scripting DSL
Probabilistic DSL
Analytics DSL
Rendering DSL
Parallel Object Language
Hardware Architecture
OOO Cores SIMD Cores Threaded Cores
Scalable Interconnects
PartitionableHierarchies
Scalable Coherence
Isolation & Atomicity
Pervasive Monitoring
Common Parallel Runtime
Explicit / Static Implicit / Dynamic
The PPL VisionVirtual Worlds
Autonomous Vehicle
Financial Services
Physics DSL
Scripting DSL
Probabilistic DSL
Analytics DSL
Rendering DSL
Parallel Object Language
Hardware Architecture
OOO Cores SIMD Cores Threaded Cores
Scalable Interconnects
PartitionableHierarchies
Scalable Coherence
Isolation & Atomicity
Pervasive Monitoring
Common Parallel Runtime
Explicit / Static Implicit / Dynamic
Demanding Applications
Leverage domain expertise at Stanford CS research groups & national centers for scientific
computing From consumer apps to neuroinformatics
PPL
Media-XMedia-X
NIH NCBCNIH NCBC
DOE ASCDOE ASC
EnvironmentalScience
EnvironmentalScience
GeophysicsGeophysics
Web, MiningStreaming DBWeb, MiningStreaming DB
GraphicsGames
GraphicsGames
MobileHCI
MobileHCI
Existing Stanford research center
Seismic modeling
Existing Stanford CS research groups
AI/MLVisionAI/MLVision
Virtual Worlds Application Next-gen web platform
Immersive collaboration Social gaming Millions of players in vast landscape
Challenges Client-side game engine Server-side world simulation AI, physics, large-scale rendering Dynamic content, huge datasets
More at http://vw.stanford.edu/
Autonomous Vehicle Application
Cars that drive autonomously in traffic
Save lives & money Improve highway throughput Improve productivity
Challenges Client-side sensing, perception,
planning, & control Server-side data merging, pre-processing,
& post-processing, traffic control, model generation
Real-time, huge datasets
More at http://www.stanfordracing.org
The PPL VisionVirtual Worlds
Autonomous Vehicle
Financial Services
Physics DSL
Scripting DSL
Probabilistic DSL
Analytics DSL
Rendering DSL
Parallel Object Language
Hardware Architecture
OOO Cores SIMD Cores Threaded Cores
Scalable Interconnects
PartitionableHierarchies
Scalable Coherence
Isolation & Atomicity
Pervasive Monitoring
Common Parallel Runtime
Explicit / Static Implicit / Dynamic
Domain Specific Languages (DSL)
Leverage success of DSL across application domains SQL (data manipulation), Matlab (scientific), Ruby/Rails
(web),…
DSLs higher productivity for developers High-level data types & ops tailored to domain
E.g., relations, triangles, matrices, … Express high-level intent without specific implementation
artifacts Programmer isolated from details of specific system
DSLs scalable parallelism for the system Declarative description of parallelism & locality patterns
E.g., ops on relation elements, sub-array being processed, … Portable and scalable specification of parallelism
Automatically adjust data structures, mapping, and scheduling as systems scale up
DSL Research & Challenges Goal: create the tools for DSL development
Initial DSL targets Rendering, physics simulation, analytics,
probabilistic computations
Challenges DSL implementation embed in base PL
Start with Scala (OO, type-safe, functional, extensible)
Use Scala as a scripting DSL that ties multiple DSLs DSL-specific optimizations telescoping compilers
Use domain knowledge to optimize & annotate code Feedback to programmers ? …
The PPL VisionVirtual Worlds
Autonomous Vehicle
Financial Services
Physics DSL
Scripting DSL
Probabilistic DSL
Analytics DSL
Rendering DSL
Parallel Object Language
Hardware Architecture
OOO Cores SIMD Cores Threaded Cores
Scalable Interconnects
PartitionableHierarchies
Scalable Coherence
Isolation & Atomicity
Pervasive Monitoring
Common Parallel Runtime
Explicit / Static Implicit / Dynamic
Common Parallel Runtime (CPR)
Goals Provide common, portable, abstract target for all DSLs
Write once, run everywhere model Manages parallelism & locality
Achieve efficient execution (performance, power, …) Handles specifics of HW system
Approach Compile DSLs to common IR
Base language + low-level constructs & pragmas Forall, async/join, atomic, barrier, …
Per-object capabilities Read-only or write-only, output data, private, relaxed coherence, …
Combine static compilation + dynamic management Explicit management of regular tasks & predictable patterns
Implicit management of irregular parallelism
CPR Research & Challenges
Integrating & balancing opposing approaches Task-level & data-level parallelism Static & dynamic concurrency management Explicit & implicit memory management
Utilize high-level information from DSLs The key to overcoming difficult challenges
Adapt to changes in application behavior, OS decisions, runtime constraints
Manage heterogeneous HW resources
Utilize novel HW primitives To reduce overhead of communication, synchronization, … To understand runtime behavior on specific HW & adapt to it
The PPL VisionVirtual Worlds
Autonomous Vehicle
Financial Services
Physics DSL
Scripting DSL
Probabilistic DSL
Analytics DSL
Rendering DSL
Parallel Object Language
Hardware Architecture
OOO Cores SIMD Cores Threaded Cores
Scalable Interconnects
PartitionableHierarchies
Scalable Coherence
Isolation & Atomicity
Pervasive Monitoring
Common Parallel Runtime
Explicit / Static Implicit / Dynamic
Hardware Architecture @ 2012 The many-core chip
100s of cores OOO, threaded, & SIMD
Hierarchy of shared memories Scalable, on-chip network
The system Few many-core chips Per-chip DRAM channels Global address space
The data-center Cluster of systems
L2 Memory L2 Memory
L1L1
L1L1 L1
L2 Memory L2 Memory
L3 Memory
L1
L1L1
L1L1 L1 L1
L1 L1 L1 L1L1 L1
I/ODRAM
CTL
TCTC
TC OOOTC
SIMDTC
TCTC OOO
TCSIMD
OOO SIMD OOO SIMDTC TC
Architecture Challenges Heterogeneity
Balance of resources, granularity of parallelism
Support for parallelism & locality management
Synchronization, communication, … Explicit Vs. implicit locality management Runtime monitoring
Scalability On-chip/off-chip bandwidth & latency Scalability of key abstractions (e.g., coherence)
Beyond performance Power, fault-tolerance, QoS, security, virtualization
Architecture Research Revisit architecture & micro-architecture for
parallelism Define semantics & implementation of key primitives Communication, atomicity, isolation, partitioning,
coherence, consistency, checkpoint Fine-grain & bulk support
Software synthesizes primitives into execution systems
Streaming system: partitioning + bulk communication Thread-level spec: isolation + fine-grain communication Transactional memory: atomicity + isolation + consistency Security: partitioning + isolation Fault tolerance: isolation + checkpoint + bulk
communication
Challenges: interactions, scalability, cost, virtualization
Architecture Research Software-managed HW primitives
Exploit high-level knowledge from DSLs & CPR E.g., scale coherence using coarse-grain techniques
Coarse-grain in time: force coherence only when needed
Coarse-grain in space: object-based, selective coherence
Support for programmability & management Fine-grain monitoring, HW-assisted invariants Build upon primitives for concurrency Efficient interface to CPR
Scalable on-chip & off-chip interconnects High-radix network Adaptive routing
Research Methodology Conventional approaches are still useful
Develop app & SW system on existing platforms Multi-core, accelerators, clusters, …
Simulate novel HW mechanisms
Need some method that bridges HW & SW research Makes new HW features available for SW research Does not compromise HW speed, SW features, or scale Allows for full-system prototypes
Needed for research, convincing for industry, exciting for students
Approach: commodity chips + FPGAs in memory system Commodity chips: fast system with rich SW environment FPGAs: prototyping platform for new HW features Scale through cluster arrangement
MemoryMemoryMemoryMemory
MemoryMemory MemoryMemory
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
FARM: Flexible Architecture Research Machine
MemoryMemoryMemoryMemory
MemoryMemory MemoryMemory
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
FPGAFPGA
SRAMSRAM
IOIO
FARM: Flexible Architecture Research Machine
MemoryMemoryMemoryMemory
MemoryMemory MemoryMemory
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
FPGAFPGA
SRAMSRAM
IOIO
GPU/Stream
FARM: Flexible Architecture Research Machine
MemoryMemoryMemoryMemory
MemoryMemory MemoryMemory
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
Core 0Core 0 Core 1Core 1
Core 2Core 2 Core 3Core 3
FPGAFPGA
SRAMSRAM
(scalable)
InfinibandOr PCIe InterconnectI
OIO
FARM: Flexible Architecture Research Machine
Example FARM Uses Software research
SW development for heterogeneous systems Code generation & resource management
Scheduling system for large-scale parallelism Thread state management & adaptive control
Hardware research Scalable streaming & transactional HW
FPGA extends protocols throughout cluster Scalable shared memory
FPGA provides coarse-grain tracking Hybrid memory systems Custom processors & accelerators HW support for monitoring, scheduling, isolation,
virtualization, …
Conclusions PPL: a full system vision for pervasive
parallelism Applications, programming models, software systems, and
hardware architecture
Key initial ideas Domain-specific languages Combine implicit & explicit management Flexible HW features