performance optimization for mobile memory sub-systems
TRANSCRIPT
1
Performance Optimization for Mobile Memory Sub-Systems
Saira Malik MemCon 2015
2
What to/ How to Optimize ?
Need: Better QoS
Power
Bandwidth
Latency
Performance Exploration
Understanding Application Demands
Providing great Quality-Of-Service
3
§ ARM mobile based SoC components and memory traffic requirements
§ Memory controller traffic optimization challenges § Role of memory controller and interconnect in performance
optimization § Quality of Service (QoS) mechanisms § Performance results – block and system level § Summary and ongoing work
Overview
4
Example ARM-based Mobile Sub-system
GIC-500I/O Coherent
Masters
Cortex-A72 Cortex-A53
Peripherals
MMU-500
MMU-500
LPDDR3/LPDDR4 DRAM
HNC-Beckett
Mali-T880 GPU
CoreLink CCI-500Snoop Filter
Mali-V550Video
Mali-DP550Display
HNC-Beckett
ELA
STM
PTM
Arc
her
DMC-500
DMC-500
DMC-500
DMC-500
MemorySystem
(Integrated TrustZone)
Cache Coherent Interconnect
MEM Ctrl
MEM Ctrl
MEM Ctrl
MEM Ctrl
Snoop Filter
Cortex Big CPU
Cortex Little CPU
Memory Management Unit (MMU)
MMU
Mali GPU
Interrupt Controller
Mali Video Mali Display
Hybrid Interconnect
Non-coherent Interconnect
5
System Performance Analysis Flowchart
Define Use Cases - Mobile
- Camera - Display/Video - CPU
Understand Agent Requirements Analyze Workload (Operation specific)
Performance contracts
Requirements
Loss Profile
Latency Sensitive
High BW Power Budget
Latency Contracts
Avg. vs worst case Latency
Determine Traffic Profiles
- Traffic Nature - Address Distribution
BW Requirements
Characterize Agents
6
Memory Traffic Characterization
Agent Traffic Type Addressing Traffic
Nature Latency Contract with
System
CPU Cache miss, write backs
Random Low BW, Persistent and urgent
Low Low latency (LL)
GPU Large packet compute or gaming data
Random, large contiguous access
Large bursts of traffic with quiet period
Variable Min guaranteed bandwidth (HB)
Display Continuous image data streaming
Longer bursts with spatial locality
Real time, non-disrupt able traffic in large bursts with quiet period
Sensitive. Med-High (based on buffer size)
Max guaranteed latency for real-time operation (RT)
7
Role of Memory Sub-System
GIC-500I/O Coherent
Masters
Cortex-A72 Cortex-A53
Peripherals
MMU-500
MMU-500
LPDDR3/LPDDR4 DRAM
HNC-Beckett
Mali-T880 GPU
CoreLink CCI-500Snoop Filter
Mali-V550Video
Mali-DP550Display
HNC-Beckett
EL
AS
TM
PT
MA
rch
er
DMC-500
DMC-500
DMC-500
DMC-500
MemorySystem
(Integrated TrustZone)
Cache Coherent Interconnect
MEM Ctrl
MEM Ctrl
MEM Ctrl
MEM Ctrl
Snoop Filter
Cortex Big CPU
Cortex Little CPU
Memory Management Unit (MMU)
MMU
Mali GPU
Interrupt Controller
Mali Video
Mali Display
Hybrid Interconnect
Non-coherent Interconnect
Memory Controller Characteristics: • Fastest access to Memory • Secure access • Max. system performance/constrained
energy budget • Max utilization of DRAM
• Highest possible bytes/ACT • Parallelize access (Multiple banks) • Minimum Turnaround time • Per-bank refreshes
• Obey DRAM timing parameters • Optimize for its critical resource
(DRAM)
8
Memory Sub-System: SoC Challenges
• Agents – Different perf. Requirements • Cannot treat all requests equally • Has an Integrated Role
• Effective Resource allocation – system intent • Close integration with system • Optimized with interconnect • Reduce Bottlenecks
• Should take informed decisions • Trade memory efficiency for system efficiency
Memory Sub-system Integrated in SoC Solve a Multi-objective problem to
satisfy all IP traffic types simultaneously in all conditions
9
System Performance Challenge/How to Solve?
§ How to efficiently arbitrate amongst agents that have different perf. Requirements?
§ Apply QoS Mechanisms
§ Ensures traffic demands § Avoid congestion § Throttling to bound memory capacity
§ What are the steps involved?
Interpret system characteristics
Translate into quantitative attributes
Categorize traffic streams
Apply QoS Mechanisms
Measure it
Desired system performance
achieved ?
Change parameters
Yes No
10
Translation to QoS Attributes
§ Translate agent characteristics into measureable attributes
§ Assign agents a priority value (Qv) § Use AXI AxQoS signal
§ Each request entering DMC has Qv assigned
§ Provides capability to understand priority § DMC to honor Qv for arbitration § Maps Q-values to a Qv band Fabric Master Fabric QoS
Reserved 15
- 14
Disp-0 13
Disp-1 12
Video 11
Big-CPU/Little-CPU 10
GPU 7
High-High High
Medium
Low
15
12
11
8
7
4
3
0
11
QoS Mechanisms and their Measurement
§ Key QoS mechanism supported by ARM SoCs
1. Static QoS 2. Priority Escalation 3. Regulations
a) Macro-regulation based on System Traffic b) Micro-regulation via deadline scheduling c) Complete regulation via QoSACCEPT
protocol
§ Measurement Test Setup § Unit Level Experiments
§ Single DMC § BFMs to inject equal traffic with different Qv § Random traffic simulation § Tested for un-congested and congested
environments
§ System Level Experiments § Multiple DMC + CCI + rest of mobile sub-system § RTL running in emulation environment with real-
world traffic from standard benchmarks
§ Definitions § Non-congested Environment
Fair allocation of DMC buffer slots Fair allocation of DRAM bank resources
§ Congested Environment Qv is used to test for arbitration and resource allocation
12
1. Static QoS Arbitration
Non-Congested System
Congested System
Key Issue: Low priority requests can suffer starvation !
13
2. Priority Escalation
§ To permit liveliness Qv should be permitted to change throughout
§ Enable priority Escalation § Tracks aging on transactions § Raises priority § Prevents starvation
§ Measurements (Test Setup)
Congested System with Escalation
• Priority escalation improves transaction latency response for transactions with low QoS settings
Key Issue: No guarantee for deadline requirements. No dynamic control.
14
GIC-500I/O Coherent
Masters
Cortex-A72 Cortex-A53
Peripherals
MMU-500
MMU-500
LPDDR3/LPDDR4 DRAM
HNC-Beckett
Mali-T880 GPU
CoreLink CCI-500Snoop Filter
Mali-V550Video
Mali-DP550Display
HNC-Beckett
EL
AS
TM
PT
MA
rch
er
DMC-500
DMC-500
DMC-500
DMC-500
MemorySystem
(Integrated TrustZone)
Cache Coherent Interconnect
MEM Ctrl
MEM Ctrl
MEM Ctrl
MEM Ctrl
Snoop Filter
Cortex Big CPU
Cortex Little CPU
Memory Management Unit (MMU)
MMU
Mali GPU
Interrupt Controller
Mali Video Mali Display
Hybrid Interconnect
Non-coherent Interconnect
§ Applies regulations at different levels § Source (Agents/Interconnect) § Destination (Memory sub-system)
§ Overrides source Qv based on system needs
§ Allows a transaction stream to modify its priority over time to achieve target service at minimal cost to the system
3. Qv Regulation
Macro-Regulation
Micro-Regulation
15
3a. Macro-Regulation at System Level : Example Scenario
Low Latency (LL)
Real Time (RT)
High Bandwidth (HB)
High-High
High
Medium
Low
• Low Latency traffic macro-regulated from H to M • CPU swamps system -Stop getting special treatment
• Real Time is HH and regulated to manage deadline • Get deterministic latency through network
• Differentiate deadline through Qv
• High Bandwidth is L macro-escalated to M • Default to low priority
• Escalate if not achieving sufficient service
Approach can used to infer state of system Can set policy to achieve goals in context of system
status and associated costs
16
3b. Micro Regulation – DMC Low Latency (LL)
Real Time (RT)
High Bandwidth (HB)
High- High
High
Medium
Low
• Real Time is remapped to M on ingress
until it times-out • Remap based on timeout latency
• Qv/QID to override and manage QoS overtime
All requests are escalated over time to avoid starvation
17
Programmed Latency Deadline for RT Master
Latency of LL Master Remains Lower Prior to RT Deadline
Micro-Regulation via Deadline Scheduling
Deadline arbitration with inc. LL and RT Traffics
Real Time
High BW
Low Latency
Low latency (LL) transactions have better latency response than other transactions until latency deadline approaches Latency deadline forces near 100% of Real-time (RT) transactions to complete ahead of LL transactions
18
GIC-500I/O Coherent
Masters
Cortex-A72 Cortex-A53
Peripherals
MMU-500
MMU-500
LPDDR3/LPDDR4 DRAM
HNC-Beckett
Mali-T880 GPU
CoreLink CCI-500Snoop Filter
Mali-V550Video
Mali-DP550Display
HNC-Beckett
EL
AS
TM
PT
MA
rch
er
DMC-500
DMC-500
DMC-500
DMC-500
MemorySystem
(Integrated TrustZone)
Cache Coherent Interconnect
MEM Ctrl
MEM Ctrl
MEM Ctrl
MEM Ctrl
Snoop Filter
Cortex Big CPU
Cortex Little CPU
Memory Management Unit (MMU)
MMU
Mali GPU
Interrupt Controller
Mali Video
Mali Display
Hybrid Interconnect
Non-coherent Interconnect
3c. QoSAccept
§ Issue: CPU Latency not bounded when Real Time traffic exceeds
§ Solution: § QoSAccept Threshold visibility from DMC
to the interconnect § Master (Interconnect) decides which
transactions to forward to slave (DMC) based on the QoS levels acceptable by slave
Ensures non-blocking access for real-time and minimizes CPU latency
19
Test Experiment on System with QoSAccept
Test Setup § 1x CPUs and 3x GPUs, interleaved § Shareable traffic: CPU 100%, GPU 30% § Random address, Read only, 0% hit § Used 3x IO masters to saturate DMC with RT-traffic § Threshold value for QoSACCEPT = 12
Clock cycles
All with QoSAccept
GPU-on, GPU =11, CPU=15
GPU-on, GPU =11, CPU=11
GPU - off
CCI Internal Latency for CPU accesses
QoS Accept
GPU GPU QoS Value
CPU QoS Value
OFF OFF - 11
OFF ON 11 11
OFF ON 11 15
ON ON 11 15
QoSACCEPT enables optimized CPU latency within interconnect with GPU traffic ON
55%
83%
20
§ Ongoing work § Effect of enabling both macro and micro regulation on overall system efficiency § Better QoS management by obtaining system heuristics
§ As a Systems-IP company we realize it’s a System Level issue and not just a DMC issue. § We are trying to enable schemes and providing ways through our IPs to help our
customers to build better and efficient high performance products
Summary
20
21
Thank You
22
Performance runs on ARM RTL Emulation
0
10
20
30
40
50
60
70
80
90
100
0 9 19 29 40 49 59 68 81 89
Ban
dwid
th (
%)
Total DPU BW (%)
Dual Display BW sweep
Total DPU
GPU_BW
LCPU_BW
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70 80 90
DP
U F
IFO
Usa
ge (
%)
Mas
ter
Rd
Late
ncy
Total DPU BW (%)
Dual Display BW sweep DPU0_RdLat LCPU_RdLat
GPU_RdLat DPU0_FIFO
Well managed DPU latency with heavy CPU and GPU traffic GPU BW is impacted by CPU as expected (lower priority)