amd accelerated computing -ufrj
DESCRIPTION
Apresentacao sobre computacao acelerada e APU - UFRJ - Mai/2011TRANSCRIPT
Agenda
X86 PROCESSOR EVOLUTION
THE GPU AS AN ACCELERATOR
ACCELERATED PROCESSING UNITS
INTRODUCTION TO OpenCL
Evolving x86 Processors
L3 Cache
AMD architecture “Istambul” six-core diagram
PCI-e
Chipset
HyperTransport
Memory
Controller
Hyper
Transport
CROSSBAR
Lower memory
latency
Balanced
caches
Fast full-duplex
bus
Native
six-core
processor
2
L2
3
L2
4
L2
5
L2
6
L2
1
L2
4P/24-core system example very good scalability
One memory controller for every processor
Full-duplex Hyper Transport links (up to 5.2GHz)
Bus Optimization: HT Assist (Cache Probe Filtering)
Still the only available 4P system with Direct Connect Architecture
MEM
OR
Y M
EM
OR
Y
MEM
OR
Y M
EM
OR
Y
Direct Connect Architecture 1.0 Balanced and Scalable Design to Support up to 6 Cores
2 M
EM
ORY
CH
AN
NELS 2
MEM
ORY
CH
AN
NELS
2 M
EM
ORY
CH
AN
NELS 2
MEM
ORY
CH
AN
NELS
8 DIMMs per CPU
8 DIMMs per CPU
8 DIMMs per CPU
8 DIMMs per CPU
No front side bus
Integrated memory controller
HyperTransport™ technology
NUMA memory architecture
12 DIMMs per CPU
Direct Connect Architecture 2.0 Balanced and Scalable Design to Support up to 16 Cores* per CPU
• 1-hop between processors
• Up to 50% more DIMMs
• Four memory channels
• Up to 33% increase in CPU to CPU communication speed±
4 M
EM
ORY
CH
AN
NELS
12 DIMMs per CPU
12 DIMMs per CPU
12 DIMMs per CPU
4 M
EM
ORY
CH
AN
NELS
4 M
EM
ORY
CH
AN
NELS
4 M
EM
ORY
CH
AN
NELS
What is next for x86 CPUs
• More processor cores to come
(12, 16, 16 double cores)
• More memory channels (improves memory bandwidth per core)
• Improved IPC
(8 per cycle is a target)
Top500 list - beyond the petaflop
Datacenters in the USA will spend more
than $3 billion on energy in 2009
Garry Kasparov IBM Deep Blue
1997:
X
The World’s Most Powerful GPU
=
2011 GPU Architecture AMD Radeon™ HD 6900 Series
Dual graphics engines
New VLIW4 core architecture
Up to 24 SIMD engines
Up to 96 Texture Units
Upgraded render back-ends
Improved anti-aliasing performance
Fast 256-bit GDDR5 memory interface
Up to 5.5 Gbps
New GPU compute features
Designing very efficient GPUs Full load: 180W; Idle:27W
0
2
4
6
8
10
12
14
16
Nov-05 Jan-06 Sep-07 Nov-07 Jun-08 Oct-09
ATI Radeon™ X1800 XT
ATI Radeon™ X1900 XTX
ATI Radeon™ HD 2900 PRO
ATI Radeon™ HD 3870
ATI Radeon™ HD 4870
ATI Radeon™ HD 5870
7.50
4.56
4.50
2.24
2.21
0.92
2.01
1.06
1.07
0.42
GFLOPS/W
GFLOPS/mm2
14.47 GFLOPS/W
7.90 GFLOPS/mm2
Old and New in High Performance Computing
Old: Power is free, Transistors are expensive
New: Power expensive, Transistors free
(Can put more transistors on chip than can afford to turn on)
Old: Multiplies are slow, Memory access is fast
New: Multiplies fast, Memory slow
(up 200 clocks to DRAM memory, 4 clocks for FP multiply)
Old: Increasing Instruction Level Parallelism via compilers innovation
New: Explicit thread and data parallelism must be exploited
GPUs: more than just gaming
15
144
72
48
24
12
Radeon HD 5970
12 Cores
Hexa Core
Quad Core
Dual Core
Single Core
Processing power – millions of operations per second
2700
Wii Sports - Golf Oil exploration platform - 2010
Both use GPUs
DirectX® 11 Multi-Threading
Application, DirectX runtime, and DirectX driver can each run in separate
threads
Tasks like loading a texture or compiling a shader can execute in parallel
with main rendering thread
DirectX® 10 DirectX® 11
16
Today’s GPUs focused on
GAMING
ENTERTAINMENT
PRODUCTIVITY
DirectX® 11 Tessellation
Images courtesy of Unigine Corp.
No Tessellation Tessellation
DirectX® 10 DirectX® 11
18
5/26/2011
5/26/2011
Research companies already using
21
Oil exploration Wheather forecast Fluid Dynamics Nature simulation
AMD Balanced Platform
Delivers optimal performance for a wide range of
platform configurations
Other Highly Parallel Workloads
Graphics Workloads
Serial/Task-Parallel Workloads
CPU is excellent for running some algorithms
Ideal place to process if GPU is fully loaded
Great use for additional CPU cores
GPU is ideal for data parallel algorithms like image processing, CAE, etc
Great use for ATI Stream technology
Great use for additional GPUs
ATI Stream Technology is…
Heterogeneous: Developers leverage AMD GPUs and x86 CPUs for optimal application performance and user experience
High performance: Massively parallel, programmable GPU architecture delivers unprecedented performance and power efficiency
Industry Standards: OpenCL™ and DirectCompute 11 enable cross-platform development
Digital Content Creation
Engineering Sciences Government Gaming Productivity
Improvements already reached consumers
0%
10%
20%
30%
40%
50%
60%
70%
80%
Processor utilization
ATI
Stream
Adobe Flash plugin used by Youtube.com
Better image quality and video smoothness
Lower processor usage
GPU-accelerated video transcoding
Up to 6x faster when using an AMD graphics card
HD Video Ipod Video
Using four CPU Cores
CPU Usage: 100%
GPU Usage: 1%
Video Transcoding Sample No GPU Acceleration
CPU Usage: 100% Time to finish: 1h 52m Total Power: 0.23kW/h
GPU Usage: 1% Peak power: 145W Energy Price: $0.15 26
CPU Usage: 45%
GPU Usage: 35%
Video Transcoding Sample ATI GPU Acceleration
CPU Usage: 45% (100%) Time to finish: 26m (1h52m) Total Power: 0.11kW/h (0.23)
GPU Usage: 35% (1%) Peak power: 198W (145W) Energy Price: $0.07 ($0.15)
Using hundreds of Stream Processors
27
FUSION TECHNOLOGY
Today
TeraFLOPS-class GPU
Up to 2 billion transistors
Jogos em multiplos monitores
Video e audio Full HD
Multi-core CPU
~800 million transistors Multi-tasking
A new Era on performance evolution
Perf
orm
ance
Time
We are here
Pros:
Performance
Power efficient
Cons:
Software availability
Heterogeneous computing
Perf
orm
ance
Time x Cores
Challenge:
Power consumption
Software
Multi-Core
We are here
Challenge:
Power consumption
Complexity
?
Single-Core
Sin
gle
-thre
ad
Time
We are here
A new Era on performance evolution
Software Acceleration
Multi-Core Single-Core
Gaming
Multimedia
CP
U
GPU
Core efficiency
Putting all together – The Future is Fusion
Cache L3
PC
I-e
Chipset
HyperTransport
Memory
Controller
Hyper
Transport
CROSSBAR
2
L2
3
L2
4
L2
5
L2
6
L2
1
L2
RV500 GPU Core (2006) AMD “Istambul” six-core processor
Memory
Controller
Ring
Stop
Ring
Stop
Ring
Stop
Ring
Stop
Client Interface Client Interface
Client Interface Client Interface
Clien
t In
terf
ace
Clien
t In
terf
ace
Clie
nt In
terfa
ce
Clie
nt In
terfa
ce
Putting all together – The Future is Fusion
Cache L3
PC
I-e
Chipset
HyperTransport
Memory
Controller
Hyper
Transport
CROSSBAR
2
L2
3
L2
4
L2
5
L2
6
L2
1
L2
RV700 GPU Core (2008-2009) AMD “Istambul” six-core processor
Putting all together – The Future is Fusion
CROSSBAR
RV700 GPU Core AMD “Istambul” six-core processor C
RO
SS
BA
R
2011: welcome to the APU time!
GPU CPU
“Supercomputing power in a notebook platform whose battery lasts for a full day”
APU
One Design, Fewer Watts, Massive Capability
Discrete-level DirectX® 11
GPU
“Zacate” AMD
Fusion APU
75 sq. mm
18 watts
Northbridge Dual-Core
CPU + + =
66 sq. mm 13 watts
117 sq. mm 25 watts
59 sq. mm 8 watts
Graphics and Media Processing Efficiency Improvements
CPU Cores
GPU UVD
SB Functions
~7 GB/sec
~17 GB/sec
UNB
MC
~17 GB/sec
DDR3 DIMM Memory
CPU Chip
PCIe
Bandwidth pinch points and latency hold back the GPU capabilities
3X bandwidth between GPU and memory
Even the same sized GPU is substantially more effective in this configuration
Eliminate latency and power associated with the extra chip crossing
Substantially smaller physical foot print
Graphics requires memory bandwidth
to bring full capabilities to life
~27 GB/sec
~27 GB/sec
DDR3 DIMM Memory
APU Chip
PCIe
2010 IGP-based Platform 2011 APU-based Platform
GPU
CPU Cores
UVD
UN
B /
MC
“Ontario” & “Zacate” Architecture
APU
>2 x86 CPU Cores (40nm “Bobcat” core – 1 MB L2, 64-bit FPU)
>C6 and power gating
>Array of SIMD Engines
• DX11 graphics performance
• Industry leading 3D and graphics processing
>3rd Generation Unified Video Decoder
>H.264, VC1, DixX/Xvid format
>DDR3 800-1066, 2 DIMMs, 64 bit channel
>BGA package
Display and I/O
>Two dedicated digital display interfaces
• Configurable externally as HDMI, DVI, and/or Display Port
• Also supports a single link LVDS for internal panels
>Integrated VGA
>5x8 PCIe®
> “Hudson” Fusion Controller Hub
Working together OpenCL
ATI Stream SDK: OpenCL™ For Multicore x86 CPUs and GPUs
The Power of Fusion: Developers leverage heterogeneous architecture to deliver superior user experience
• First complete OpenCL™ development platform
• Certified OpenCL 1.0 compliant by the Khronos Group
• Write code that can scale well on multi-core CPUs and GPUs
• AMD delivers on the promise of OpenCL™, with both high-performance CPU and GPU technologies
• Available for download now as part of ATI Stream SDK beta program – includes documentation, samples, and developer support
http://developer.amd.com/
OpenCL™: Game-Changing Development Enabling Broad Adoption of GP-GPU Capabilities
Industry standard API: Open, multiplatform development platform for heterogeneous architectures
The power of Fusion: Leverages CPUs and GPUs for balanced system approach
Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc.
Fast track development: Ratified in December; AMD is the first company to provide a complete OpenCL solution
Momentum: Enormous interest from mainstream developers and application ISVs
More stream-enabled applications across all markets
Open Standards:
Vendor specific Cross-platform limiters
• Apple Display Connector
• 3dfx Glide
• Nvidia CUDA
• Nvidia Cg
• Rambus
• Unified Display Interface
Digital Visual Interface
OpenCL™ DirectX®
Certified DP JEDEC
Maximize Developer Freedom and Addressable Market
Vendor neutral Cross-platform enablers
OpenGL®
Comparing OpenCL™ and DirectX® 11 DirectCompute
How will developers choose between OpenCL™ and DirectX® 11 DirectCompute?
Feature set is similar in both APIs
DirectX® 11 DirectCompute
Easiest path to add compute capabilities to existing DirectX applications
Windows Vista® and Windows® 7 only
OpenCL™
Ideal path for new applications porting to the GPU for the first time
True multiplatform: Windows®, Linux®, MacOS
Natural programming without dealing with a graphics API
Anatomy of OpenCL™
Language Specification
• C-based cross-platform programming interface
• Subset of ISO C99 with language extensions - familiar to developers
• Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error
• Online or offline compilation and build of compute kernel executables
• Includes a rich set of built-in functions
Platform Layer API
• A hardware abstraction layer over diverse computational resources
• Query, select and initialize compute devices
• Create compute contexts and work-queues
Runtime API
• Execute compute kernels
• Manage scheduling, compute, and memory resources
OpenCL Example
Scalar
void square(int n, const float *a, float *result) { int i; for (i=0; i<n; i++) result[i] = a[i] * a[i]; }
Data-Parallel
kernel dp_square (const float *a, float *result) { int id = get_global_id(0); result[id] = a[id] * a[id]; } // dp_square executes oven “n” work-items
Summary
46
X86 PROCESSOR EVOLUTION
THE GPU AS AN ACCELERATOR
ACCELERATED PROCESSING UNITS
INTRODUCTION TO OpenCL http://developer.amd.com
Obrigado!
Obrigado!