john feo architect microsoft corporation symp02. bad sequential code will run faster on a faster...
TRANSCRIPT
Architecting Scalable and Responsive Applications
John FeoArchitectMicrosoft Corporation
SYMP02
The Free Lunch Is Over
Bad sequential code will run faster on a faster processorBad parallel code WILL NOT run faster on more cores
1 2 4 8 16 320
0.5
1
1.5
2
2.5
3
Speedup
Speedup
Just using parallel code is not enough
How to think about different levels of parallelism
How to architect parallel algorithms Optimization techniques
Agenda
No – I can compute my problem, today and tomorrow, in a reasonable time
Yes – I need to compute N instances of my problem (throughput) Animated film Portfolio simulations
Yes – I need to compute one instance of my problem in time T (capability) Director frames Intelligent avatars Speech recognition
Do I Need Parallelism?
Master sends out inputs and gathers results Increase # of processors to decrease time to
solution Sequential worker code is okay, but only if
problem instances are balanced; otherwise,
fit on single computing element; otherwise,
Throughput Computing
Memory
CPU
Memory
CPU
Memory
CPU
Memory
CPU
Render a short sequence of movie frames quickly and accurately Don’t keep the expensive director waiting Get enough accuracy to enable the
right decision Create more challenging game adversaries
Faster More skills Better accuracy More intelligence
Capability Computing
Problem always as big as the machine Write parallel code that effectively
uses all resources Decompose by task or data Communicate shared values Synchronize access to shared values
Challenges Of Capability Computing
Optimize for critical resource
You want enough parallelism to keep the machine busy
Enough, But Not Too Much…
too much parallelism may
increase communication/ synchronization
costs
watch out for resources
consumed by waiting
tasks
over decomposeto improve
load balance
rely on MS runtime to schedule
and manage threads
decompose to scale with
problem size and processor count
Decompose program by operations
Task Parallelism
Audio
Video
UINetwork
Avatars
Weapons
Vehicles
High level, big chunks Communication is coarse grain Synchronization is minimal Number of tasks is small
May not scale May not load balance Take advantage of data parallelism within tasks
The Good And Bad Of Task Parallelism
Think of application as a set of elements
Lots of parallelism (easy to load balance) Scales with problem size and # processors Just for loops, so similar to sequential code
Data Parallelism
for all pixels for all triangles
Most data parallel applications can be parallelized at different levels Cubes, planes, columns, cells Graphs, nodes, edges Volumes, objects, rays, pixels
Levels Of Data Parallelism
Sequential Matrix Multiply
for(0, N, i => { for(0, M, j => { for(0, N, k => { C[i][j] += A[i][k] * B[j][k]; }) }) })
Parallel Matrix Multiply
Parallel.For(0, N, i => { for(0, M, j => { for(0, N, k => { C[i][j] += A[i][k] * B[j][k]; }) }) })
More Parallel Matrix Multiply
Parallel.For(0, N * M, ij => { i = ij / N; j = ij % N; for(0, N, k => { C[i][j] += A[i][k] * B[j][k]; }) })
Reduce Loads And Stores
Parallel.For(0, N * M, ij => { i = ij / N; j = ij % N; double sum = 0.0; for(0, N, k => { sum += A[i][k] * B[j][k]; })
C[i][j] = sum; })
Permute An Array Of Integers
for(0, NumberOfSwaps, k => { int i = (int) (N * Rnd.NextDouble()); int j = (int) (N * Rnd.NextDouble()); int temp; temp = X[i]; X[i] = X[j]; X[j] = temp; })
Parallel Permutation
Parallel.For(0, NumberOfSwaps, k =>
{
int i = (int) (N * Rnd.NextDouble());
int j = (int) (N * Rnd.NextDouble());
// lock elements i and j, and swap
if (i == j) continue;
else if (i > j) Swap(i, j);
lock (X + i)
{
lock (X + j)
{
int temp = X[i];
X[i] = X[j];
X[j] = temp;
}
}
})
Maybe We Can Make It Easy…
Parallel.For(0, NumberOfSwaps, k => { int i = (int) (N * Rnd.NextDouble()); int j = (int) (N * Rnd.NextDouble()); transaction { int temp = X[i]; X[i] = X[j]; X[j] = temp; } })
Use the best parallel algorithm Use hardware accelerators Use special instructions (SSE) Push parallelism as far out as possible Cut-off recursion Accumulate locally Pre-allocate memory Collapse and fuse loops Use parallel data structures Remove I/O from inside parallel regions
Optimize, Optimize, Optimize
Given a graph G(V, E) and nodes s and t, does there exist a path from s to t?
s is the source, t is the sink Two common solutions:
Horizon iterative outer loop, data parallel inner loop user managed “list of nodes”
Recursion data parallel loop with recursive call rely on runtime system to manage program
Breadth-first Search
Horizon Method
3
1
6
4
8s
2
s1 3 52 4 6 87 9 t
5
9
A
D
B
C
E
t
7
Our Code For Horizon Method
private BlockingCollection<node> _horizon =
new BlockingCollection<node>();
_horizon.Add(source);
Parallel.ForEach(_horizon.GetConsumingEnumerable(), n =>
{
for (n.Neighbors(), nn =>
{
if (nn.NotVisited())
{
if (nn == sink)
{_horizon.CompleteAdding(); break;}
else
{try {_horizon.Add(nn);} catch() {break;}}
}
})
})
Not Visited
public boolean NotVisited() {
boolean flag = false;
if (visited == 0) { lock(&visited) { if (visited == 0) {flag = true; visited = 1;} } }
return flag;
}
Recursive Method
3
1
6
4
8s
2s
5
9
A
D
B
C
E
t
71 3 5
2 4 6 8
9 7 t
Our Code For Recursive Method
public static boolean BFS(node source) { CancellationTokenSource found; _BFS(source, found);
return found.IsCancellationRequested; }
Our Code For Recursive Method
private static void _BFS (node n, CancellationTokenSource found) { if (found.IsCancellationRequested) return;
Parallel.ForEach(n.Neighbors(), nn => { if (nn.NotVisited()) { if (nn == sink) found.Cancel(); else _BFS(nn, found); } }) }
Find the best parallel algorithm Find the right level of parallelism Think data parallel first Use parallel data structures Minimize shared data and synchronization Optimize, optimize, optimize
Summary
Bad sequential code will run faster on a faster processorBad parallel code WILL NOT run faster on more cores
Learn more about Parallel Computing at:
MSDN.com/concurrencyAnd download Parallel Extensions to
the .NET Framework!
Connected Visual Computing
Jerry Bautista, PhDDirector, Microprocessor Technology ManagementIntel Corporation
SOCIAL NETWORKING
USER-GENERATED
CONTENT
VISUALCOMPUTING
Internet Trends Converging
BROADBANDCONNECTIVITY
MOBILE COMPUTING
Physics-based Animation
Expressive Faces
Video Search
Computer Vision
Ray-traced Graphics
Look realAct realFeel real
Visual Computing – 3D and moreApplications that use every available FLOP
Next: Connected Visual ComputingBringing VC to connected usage models
3D Digital Entertainment
Virtual Worlds
Creating newdigital worlds
Multiplayer Games
InternetData
PeopleEverywhere
The ActualWorld
CO
NN
ECTED
CO
NN
ECTED
CONNECTED
RichVisual
I nterfaces
LIMITED RICH
Better content quality, social interaction – a better user experience
StaticWeb Web 2.0 CVC
Real-world datavisualization
Enhancing theactual world
Earth Mapping
Augmented Reality
Social networking, collaboration, online gaming, online retail, and more.
Simulated environments
All company and/or product names may be trade names, trademarks and/or registered trademarks of the respective owners with which they are associated.
Virtual Worlds
Multiplayer Online Games3D Cinema
• Realistic, representative visuals both professional and user-generated
• Socialization, education, entertainment, collaboration
Data Visualization
West Nile VirusVisualization
Visualizing RealWorld Information --
Dust storm in Morocco
Virtual Colonoscopy
Sharing data and representing data in richer, more intuitive ways.
OpenSim N-body Simulation
All company and/or product names may be trade names, trademarks and/or registered trademarks of the respective owners with which they are associated.
Collaboration Environments
Virtual team roomsEnterprise-class environments to allow virtual teams to have realistic, natural interactions
Virtual information environmentsInformation space for documents,
app-sharing, and visualizations
Augmented RealityCombines real world info with data overlays
Virtual Instruction
Mobile Augmented Reality (MAR) particularly compelling
TextOverlays
2D/3D VisualOverlays
Visual Search
MapHybrids
Today 2010 2012 2014
Location Information Identification &Hyperlinking
Translation
Meeting The Challenges Of CVC
Platform Optimization
•Server, client demands•Network performance•Energy-efficiency
Distributed Computing
•Scaling•Client diversity•Programmability
Visual Content
•Interoperability•User creation
Mobile Experience
•Better connectivity, BW•Sensor integration
Rich Interaction Versus Complexity
Interactions Growing # of users
Scene complexity Growing # of objects
Realism Better object behavior
richness
com
plex
ity
interactioncomplexity
SERVERS: 10x More Work75%+ Time = Compute Intensive Work
TYPE SOFTWAREMAX CLIENTSPER SERVER
MMORPGS
VWs Second Life 160
WoW 2500
CLIENTS: 3x CPU, 20x GPU 65%+ Time = Compute Intensive Work
Second Life 70 35-75
NETWORK: 100x BandwidthMaximum Bandwidth Limited byServer to Client 0
50
100
25 50 75 100 125 150Time (In Seconds)
Ban
dwid
th(In
KB
/s)) Cached
Uncached
Sources: WoW data (source www.warcraftrealms.com), Second Life data (source Intel Linden Labs CTO-CTO meeting and www.secondlife.com), and Intel measurements.
Platform Performance Demands
APPLICATION% CPU
UTILIZATION% GPU
UTILIZATION
2D Websites 20 0-1
Google Earth 50 10-15
Scaling Performance: Parallelism
0
16
32
48
64
0 16 32 48 64
Cores
Par
alle
l S
pee
du
p
Production Fluid
Production Face
Production Cloth
Game Fluid
Game Rigid Body
Game Cloth
Marching Cubes
Sports Video Analysis
Video Cast Indexing
Home Video Editing
Text Indexing
Ray Tracing
Foreground Estimation
Human Body Tracker
Portifolio Management
Geometric Mean
Graphics Rendering – Physical Simulation -- Vision – Data Mining -- Analytics
Applications Scale Well
Intel® Thread Checker
Intel thread checker is an analysis tool that pinpoints hard-to-find threading errors like data races and deadlocks in 32-bit and 64-bit applications.
Intel® Thread Building Blocks
Intel threading building blocks (Intel TBB) is a C++ runtime library that abstracts the low-level threading details necessary for optimal multicore performance. implementation work.
Current Parallel Programming Products
SmokeA game framework to maximize core utilization
Framework built for Nehalem and future processors targeting N-threads
Uses real game technologies (Havok, FMOD, Ogre3D, DX9, etc.)
Well partitioned and configurable
*Other names and brands may be claimed as the property of others.
The FrameworkHow is the Smoke highly threaded?
Engine
ManagersFramework
Scheduler Parser Environment
Service
Platform
TaskScene CC Object CC
UScene
UObject UObject…
Systems
Definition Files
Interfaces
System
1. Scheduler manages system jobs
2. Change Control (CC) Manager minimizes thread synchronization
3. Data structured to support independent processing
4. System modularity (through interfaces)
5. Systems are specific to the demo (e.g. AI, physics, etc)
Ct Research A throughput programming model
TVEC<F32> a(src1), b(src2);TVEC<F32> c = a + b;c.copyOut(dest);
1 1 0 00 1 0 1 0 1 0 00 0 1 1
1 1 0 00 1 0 1 0 1 0 00 0 1 1+
Thread 4
0 0 1 1
0 0 1 1+
Thread 3
0 1 0 0
0 1 0 0+
Thread 2
0 0 0 1
0 0 0 1+
Thread 1
1 1 0 1
1 1 0 1+
Ct JIT Compiler: Auto-vectorization, SSE, AVX, LarrabeeCore 1
SIMD Unit
Core 2
SIMD Unit
Core 3
SIMD Unit
Core 4
SIMD Unit
Programmer Thinks Serially; Ct Exploits Parallelism
Ct Parallel Runtime: Auto-Scale to Increasing
Cores
User Writes Core Independent C++ Code
Global User Services, Agents, Data
Regional Simulation,Data assets, and Services
USERS ACT WORLD
REACTS
DISPLAYS REFRESH
CVC Processing
Loop
DATA PIPES: Potential Bottlenecks
“Light” Clients
Rendering & ReasoningServices Cloud
Sensors
Other CVC Environments
Visual Computing Clients
Distributed ComputingCVC Environment
Ecosystem Building Blocks
H/W Platforms (Server/Client)
S/W Platforms (Engines)
Service Providers
Device ManufacturersSalesOEMsGPU VendorsCPU Vendors
Content ToolsDevelopment ToolsS/W InfrastructureGame EnginesO/S
World OperatorsInfrastructureMarketing - Ad, Promotion…Digital Asset MarketplaceEnterprise Integration
A broad effort is required to fully enable CVC
Open Standards Accelerated The Internet
Proprietary
Proprietary
Proprietary
1993-1995
Browser
HTMLServer
HTTP
Walled Gardens Open Standards
*Other names and brands may be claimed as the property of others.
CVC Future ArchitectureCommon Building Blocks
Presentation
RenderingServices
A/V EffectsServices
UserFeedback
Behavior
UserInput/Control
ScriptedBehavior
GamePhysics
Support Services
Asset &Inventory
TransactionsIdentity
WORLD SIMULATOR
Communication
Sensors/Context
User Input & Control
Rendering
Audio/Visual Effects
VIEW (CLIENT)
Simulation, Synchronization
TransactionsCommunication
Identities & Assets
WORLD (SERVER)
Support Services
Asset &Inventory
Transactions
Identity
CommunicationBehavioralfunctions
UserInput/Control
ScriptedBehavior
Physics
Sensors/Context
Presentation
RenderingServices
A/V EffectsServices
UserFeedback
Tomorrow: More horizontal, open, building blocks
Today:Vertical
proprietary,CVC apps
CVC Future ArchitectureFrom Monolithic to Building Blocks
Example: OpenSim
Platform for “Creating and Deploying 3D Environments”
Diverse Dev Community Virtual World Service Providers IBM™, Microsoft™, Intel™
Highly modular architecture Protocols Physics Script Engines
Visual Content
Easy User-Generation Professional End-user
Interoperability Own, share “my” content
Scalable Delivery Pre-distribution
Just in time distribution, caching
Example: Simplifying Content CreationParameterized Content Research
Full Narrow
FULLNESS
Flat Round
FLATNESS
Square Triangle
SHAPE
Sharp RoundCHIN
3D Face Database
Create aFace Model
Simple ControlParameters
CustomizedFaces
Expression Modeling
Intel’s CVC Research Agenda
• Workload Characterization• Understanding platform demands • Optimizations for future platforms
• Scalable system, app architectures• Dynamic repartitioning of workloads• Execution on diverse clients
CHALLENGE RESEARCH
Platform Optimization
Distributed Computation
Mobile Experience
Visual Content
• Data-enhanced real world interaction • Mirror-world creation and navigation
• Parameterized Content • Easy User-generated 3D Content • Standards enabling content reuse
Summary
CVC apps offer a compelling user experience …however…some real challenges at the platform HW and
SW model Scaling through parallelism (many threads, many core) Several platform challenges – power, memory bandwidth,
heterogeneous HW integration, scalable compute resources, etc. Distributed computing – from cloud to handheld and everything in
between Programming models must make content creation, integration,
and context aware delivery/interaction “seamless and easy” Simultaneously cannot ignore legacy usages
Promising research results - early implementations address many of these challenges…tremendous opportunity for HW/SW architecture innovation for substantive end-user benefit
Evals & Recordings
Please fill
out your
evaluation for
this session at:
This session will be available as a recording at:
www.microsoftpdc.com
Please use the microphones provided
Q&A
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Observations:
• Significant client/server compute every cycle• Many aspects best computed on client • Extensive use of MIPS, FLOPS, threads• Partitioning depends on client capability, connectivity
Client (moving to TS) Tera-scale Server or Compute Cloud
User Inputs
Rendering
Audio/Visual Effects Animation Spatial Audio Smoke, Crowds, Fluids
Send Requested Update
World Simulation Collision Physics NPC Script Execution Simulation
Get Input
Display Updates
User Takes Action on the “World”
Collects ChangesFrom All Users
Resolves All Object Behaviors and Interactions
Generates A NewPer-User Model
MoreServerCompute
MoreClientCompute
AlwaysConnected
Processing Will Span Client/Server
• Combining location data, a camera, online satellite maps and social networking• Provides an enhanced view of the real world
Mirroring the Real World
OpenSim ArchitectureAt the center of interoperability innovation
Identity Inventory Assets PresenceWorld MapVoice
CORE INFRASTRUCTURE
DECENTRALIZED SIMULATORS
S
S S
S S
S S
S S
S S
S S S
S
S
S
S
WORLD MAP Simulator
CollisionDetection
ScriptEngine
ObjectModel Game
Engine
Virtual Worlds“Connected” Visual Computing
Users Collaborate & Play
Scenario Play
Virtual Teamroom
Users CreateWorld of Warcraft Avatar
Eiffel Tower in
Google Earth
Users Explore and Learn
Qwaq TreefortVirtual Room
Machinima Interactive Movies
Users Enhance the Actual World
West Nile VirusVisualization
Visualizing RealWorld Information --
Dust storm in Morocco
CVC apps will transform the Internet from 2D to 3D…but require LOTS of compute horsepower