![Page 1: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/1.jpg)
A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling
Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang, and Guangming Tan
Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS)
ISPASS 2012April 2, 2012
![Page 2: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/2.jpg)
Background• Memory behavior is the key factor of the performance of a
program.• Understanding memory behavior is significant for identifying
the bottleneck of both architecture and application.• For example,
– TLB is an essential component of memory system– Applications’ working set tends to be larger and lager, leading to
serious TLB miss– Study 1: that TLB miss can degrade system performance by 5~14%
[Bhargava’08]
– Study 2: a large number of TLB misses in multi-threaded programs are redundant and predictable, which implies the optimization potential. [Bhattacharjee’08]
Done by memory profiling
![Page 3: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/3.jpg)
Memory Profiling• Memory profiling is to collect memory behavior
information during the execution of programs. • Profiling can be performed for – different hardware components – at different software levels
TLB/Cache/DRAMObjects (Array, List etc.)Function
ApplicationWhole System
![Page 4: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/4.jpg)
Object Memory Profiling• Object refers to a group of
data stored as a unit [Wu’04]
– Distinguish regular patterns from mixed and irregular traces
• Valuable for optimization– Memory trace compression– Data layout– Object-level prefetching– Cache partition [Soft-OLP, PACT 2009]
Whole SystemTraces
ApplicationTraces
Object Trace
Irregular
Regular
![Page 5: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/5.jpg)
Current Profiling Approaches• Existing approaches – Compiler-driven: re-compile/re-link, source code – Instrumentation: heavy overhead– Simulation: accuracy problem, slow– Performance Counter: lack of detailed information
• All cannot observe page table walks due to TLB Miss
• We propose a hybrid hardware/software approach for object memory profiling– Accurate: real application & real system– Lightweight– Track page table walks at object-level
![Page 6: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/6.jpg)
Outline
• Background
• Design and Implementation
• Experimental Results
• Conclusion
![Page 7: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/7.jpg)
An OverviewObject Access
PatternMatrix (VA: 0x1f05000)0x1f05000
0x1f060000x1f07000
……0x1f150000x1f160000x1f17000
……0x1f250000x1f26000
……
VirtualAddress Trace
0x398f24a0x398f24b0x398f24c
……0x1af4aa0x1af4a60x1af4a8
……0x38d2cfc0x38d2cfd
……
Physical Address Trace
![Page 8: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/8.jpg)
HMTT• Hybrid Memory Trace Toolkit
– A DDR3 SDRAM compatible memory trace monitoring system – Adopts hardware snooping technology
DIMM plugged on the other side
PCIE Cable Connector
Memory Trace:<time_stamp, r/w, phy_addr>
Advantages:• Platform independent• Negligible overhead• Full-system real memory
traces, including OS, page table walks
![Page 9: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/9.jpg)
Challenges (1)
• How to translate physical address trace to virtual address trace of a specific process?
• Modify OS kernel to obtain page table
• Lookup a phy_addr in the dumped page table
• Generate virtual trace of each process
![Page 10: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/10.jpg)
Challenge (2)• How to synchronize hardware and software
when an page table update occurs in kernel?
• Physical Page allocation/Free in kernel
• Trigger annotations in OS VM module
• Update dumped page table
• Send a sync_tag to hardware
![Page 11: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/11.jpg)
Challenge (3)
• How to translate virtual address to objects without modifying source codes?
matrix = malloc(0x1000)
Object:matrix
Virtual Address Space
matrix = mymalloc(0x1000)
Object-VAMapping Table
• The role of malloc() is to map VA to object
• Use dynamic library overwrite to replace malloc()
![Page 12: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/12.jpg)
Put them all togetherObject Access
PatternMatrix (VA: 0x1f05000)0x1f05000
0x1f060000x1f07000
……0x1f150000x1f160000x1f17000
……0x1f250000x1f26000
……
VirtualAddress Trace
0x398f24a0x398f24b0x398f24c
……0x1af4aa0x1af4a60x1af4a8
……0x38d2cfc0x38d2cfd
……
Physical Address Trace
Object-VAMapping Table
Dumped Page Tablesync_tag
sync_tag
page walk
page walk
Use page table to distinguish three types of memory access• Sync_tag update page table• Access page table itself page table walk due to TLB miss • Other memory access virtual address
![Page 13: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/13.jpg)
Evaluation Methodology
ProcessorIntel Xeon E5504, 2.0GHz,
2 Sockets, 4 Cores per Socket (8 core in total)
Private CacheL1
D-Cache: 32KB, 8-way, 64Byte/line I-Cache: 32KB, 4-way, 64Byte/Line
L2 256KB, 8-way, 64Byte/line
Shared Cache L3 4MB, 16-way, 64Byte/line
TLB(private)
DTLB064 entries for 4-KByte pages
32 entries for huge pages (2MByte)
TLB1 512 entries for 4-KByte pages
MemoryDDR3-800 RDIMM, dual-rank, plugged into Socket 0, 4GB
0.25GB reserved for HMTT configuration and buffer3.75GB system available
Operating System CentOS 5.3, Linux kernel 2.6.32.18
BenchmarksMultithreaded PARSEC 2.1
A custom hybrid MPI/pthread implemented BFS of Graph500-1.2
![Page 14: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/14.jpg)
Validation• For SpMV benchmark (CSR) :
y = ax * xhost
Our system is able to distinguish regular access pattern from irregular pattern
• Micro-benchmark: —The error is less than 2%
![Page 15: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/15.jpg)
Overhead
• Two main overhead:– Dumping page table traces: + dump_pt– Dumping object-VA mapping: + dump_obj• Monitoring objects >= 4KB: result in most memory references
0.96
0.98
1
1.02
1.04
1.06 Origin +dump_pt +dump_obj
Nor
mal
ized
Ove
rhea
d
<1%<2%
![Page 16: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/16.jpg)
Case Study 1: BFS (Breadth-First Search)
• column object got about 71% of page walks key object• Optimization: use huge page for column object
– Speedup: about 12% for 8-thread, 8% for 128-thread
1 2 4 8 16 32 64 1280.8
0.9
1
1.1
1.2
1.3
1.4w/o hugetlb w/ hugetlb
Number of Threads
Nor
mal
ized
Spee
dup
8.18%
1 2 4 32 1280%
20%
40%
60%
80%
100%
120%
rowstarts column pred oldqnewq visited
Number of Threads
Perc
enta
ge o
f Pag
e W
alks
![Page 17: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/17.jpg)
Case Study 2: Canneal (PARSEC)• Cache-aware simulated annealing (SA) to
minimize the routing cost of a chip design• Two objects contribute most of the memory
accesses: _elements and _location
_elements_r _elements_w
_location_r _location_w others0E+002E+084E+086E+088E+081E+09 1 2 4 8
Main Objects in Canneal
Num
ber o
f mem
ory
requ
ests
The memory access almost do not change while increasing thread number.
![Page 18: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/18.jpg)
Case Study 2: Canneal
1 2 4 80E+00
1E+08
2E+08
3E+08total _elements _locations
Number of Threads
Num
ber o
f Pag
e W
alks
• _elements object contributes the most of the increased page walks
• Put the _elements object into huge page to reduce TLB miss Speedup: about 5% for 8-thread
1 2 4 80.9
0.95
1
1.05
1.1 w/o hugetlbw/ hugetlb
Number of Threads
Nor
mal
ized
Spee
dup
![Page 19: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/19.jpg)
A Visual Demo of the HMTT
![Page 20: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/20.jpg)
Conclusion
• We have designed and implemented a hybrid hardware/software approach to conduct object-relative memory profiling.– Accurate: real application & real system– Lightweight– Track page table walks at object-level
• We demonstrate two case studies to show how the approach can help users better understand memory behavior and optimize performance.
• We intend to use this approach to analyze virtual machine on real machines.
![Page 21: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/21.jpg)
Thanks!&Questions?
![Page 22: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/22.jpg)
Extra Slides
![Page 23: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/23.jpg)
Memory Profiling Approaches
Accurate Detailed Low overhead
Page walks+
Instrument √ √ × ×
Simulator * √ × ×
Performance Counter √ × √ *
Compiler √ √ √ ×
Hybrid H/S √ √ √ √
Note: √-Yes, ×-No, *-Maybe
![Page 24: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/24.jpg)
Reverse Page Table
• Physical address pid, virtual address
0
1
2
3
...
N-1
Vaddr1 pid1 ... Vaddrk pidk
Vaddr1' Pid1' ...
...
Vaddr” Pid” ...
Physical page number
Index
![Page 25: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/25.jpg)
Validation
Obj Read Write Rate Per Error
a0 4,194,370 0 4:0 4:0 0%
a1 4,194,310 1,048,576 4:1 4:1 0%
a2 4,194,369 2,096,927 4:2 4:2 0%
a3 4,194,303 3,087,379 4:2.94 4:3 2.04%
a4 4,194,436 4,149,586 4:3.96 4:4 1.01%
Access objects with different pattern: • a0: all read accesses, forward• a1: 3/4 read and 1/4 write accesses, forward• a2: 2/4 read and 2/4 write accesses, forward• a3: 1/4 read and 3/4 write accesses, backward• a4: all write accesses, backward
a0
a4
Size 256MB, access step 64B, requests: 4M
![Page 26: A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang,](https://reader030.vdocuments.net/reader030/viewer/2022032605/56649e7b5503460f94b7bb64/html5/thumbnails/26.jpg)
HMTT Configuration Space• A reserved physical memory region• Can be accessed by source codes and binary codes