capo - university of illinois urbana-champaign
Post on 03-Jan-2022
4 Views
Preview:
TRANSCRIPT
Capo: A Software-Hardware Interface for Practical
Deterministic Multiprocessor Replay
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
Pablo Montesinos, Matthew Hicks, Samuel T. King and Josep Torrellas
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Motivation: Time Travel
Allows us to visit and recreate past states and events in computer
Wide range of uses:
Debugging
Security
Enabled by using Deterministic Replay of Execution
2
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Phase I: Initial Execution (a.k.a Recording)
Execute and record certain non-deterministic events into log
Sources of non-determinism: interrupts, memory access interleaving ...
Phase II: Replay
Restore to a previous checkpoint
Re-execute and use log to force software down the same execution path
How Deterministic Replay Works3
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
SW-Based Deterministic Replay
Flexible, integrate well with rest of SW stack
Very slow or non-applicable to multiprocessor execution:
Software is slow at capturing memory access interleaving
4
Library
Compiler
Virtual Machine
Operating System
Virtual Machine Monitor
SW-based schemes
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
HW-Based Deterministic Replay of Multiprocessors
HW can record interleaving of shared-memory accesses effectively:
Small Memory Access Interleaving Log
Little overhead
Limitation: integration with SW stack is poor
5
Library
Compiler
Virtual Machine
Operating System
Virtual Machine Monitor
Hardware-based schemes
SW-based schemes
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Limitations of HW-Based Replay of Multiprocessors
Past proposals focused only on HW primitive for recording and replaying
How does it integrate with the SW stack?
Cannot separate SW being recorded/replayed from the rest
Paradox: where does the SW that manages the logs go?
Require complex VMM or simulator to replay execution
Can’t mix recording, replay and normal execution simultaneously in the machine
6
We must adapt HW-based replay systems and carefully integrate them with SW in order to make
HW-based replay practical
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Capo Contributions
SW-HW interface for practical HW-assisted deterministic replay
Works with any HW-based replay system
Replay Sphere: new abstraction
Isolates SW that is being recorded (replayed) from the rest
Separates the responsibilities of the HW and the SW components
CapoOne: Linux-based prototype
7
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Replay Sphere: Set of threads recorded andreplayed as a unit and their address space
Only user-mode threads run inside spheres
Threads inside a sphere: R-threads
Replay spheres and processes:
R-threads that share memory must run within same sphere
Many processes can run within the same sphere
Replay Sphere 2
Replaying
Replay Sphere: Isolating Processes8
Replay Sphere 1
Recording
OS
Replay Sphere 1
Recording
Thread103
Thread128
Thread26
R-thread1
R-thread2
R-thread 1
Thread39
R-thread3
CPU1
CPU2
CPU3
CPU4
Replay HW
Replay Sphere Manager
CPU1
CPU3
CPU2
CPU4
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Replay Sphere: Separating Responsibilities
HW:
Records memory access interleaving of R-threads running within same sphere
Produces per-sphere Memory Access Interleaving Log
Enforces same memory access interleaving during replay
SW (Replay Sphere Manager):
Logs the other sources of non-determinism that affect the sphere
Produces per-sphere Input Log
Includes system call return values, signals, data copied into the sphere...
Injects data from log into sphere during replay
9
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Other Replay Sphere Manager Responsibilities
Assign the same virtual memory addresses during recording/replay
Assign the same IDs to R-threads during recording/replay
Manage Memory Access Interleaving Log and Input Log
Manage replay HW resources
10
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Capo’s HW Interface
Works with any HW-based replay system
Per-processor R-Thread Control Block:
Sphere ID register
R-Thread ID register
Per-sphere Replay Sphere Control Block:
Mode register: specifies whether the sphere is recording or replaying
Log pointers: insert to / remove from Memory Access Interleaving Log
11
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Virtualizing the Replay HW
Replay sphere manager schedules spheres into hardware contexts
12
Replay Sphere Manager
Sphere 1 Recording
Sphere 2Replaying
Sphere 3Recording
(ready)
Replay Sphere Control Block
Replay SphereControl Block
HW
SWLog 1 Log 2
Log 3
Replay SphereControl Block
Replay Sphere Control Block
ModeLog Pointers
ModeLog Pointers
(running) (running)
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Three Key Challenges
Ensuring deterministic interleaving when OS copies data into a sphere
Using fewer processors during replay than were used during recording
Emulating vs. re-executing system calls
13
3
2
1
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
OS Copies Data into Spheres
Problem: interleaving between OS copies and R-threads not recorded
Solution: insert copy_to_user into sphere:
HW can log memory access interleaving
copy_to_user exits sphere once copy is over
14
OS
Replay Sphere 1 - Recording
B
Y
E
\0
Replay Sphere Manager
R-thread 1 R-thread 2
copy_to_user
H
I
!
\0
X = buf[2]buf[3] = Y
read(&buf)
Log 1Log 1
buf
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Replaying with a Lower Processor Count
Problem: R-thread that should replay next log entry not scheduled in CPU
Solution 1: HW detects problem and raises interrupt
Efficient, but it requires additional HW and SW support
Solution 2: SW inspects Interleaving Log and tries to prevent problem
Not trivial, requires changes to OS scheduler
Solution 3: Do nothing, simply wait for OS to schedule R-thread
Simple, but can hurt performance
15
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
DeLorean HW Replay System
Ubuntu Linux
CapoOne: First Capo Implementation
Simulated replay HW:
DeLorean HW system [Montesinos ISCA’08]
Augmented with Capo’s HW interface
Modified 2.6.24 Linux kernel
Supports replay spheres, R-threads
New, deterministic copy_to_user
Split Replay Sphere Manager:
User-level component based on ptrace
Kernel-level component schedules spheres and R-threads
16
Replay Sphere 1
Recording
R-thread1
R-thread2
Replay Sphere Manager
Log 1
ptrace
new_copy_to_user
ReplaySphere
Manager
Replay Sphere Control Blocks R-thread Control Blocks
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Also in the Paper
CapoOne implementation details
Lessons learned during CapoOne’s development
Emulating vs. Re-Executing System calls
Using Capo with different HW-Based replay systems
17
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
CapoOne Evaluation Setup
Two HW configurations
Simulated DeLorean HW replay system (SIMICS): 4 x86 processors
Real hardware: 4-Core x86 Intel processor without DeLorean HW
SW: Ubuntu 7.10 with Replay Sphere Manager
Modified 2.6.24 Kernel
Benchmarks:
Scientific Benchmarks: SPLASH-2
System benchmarks: Apache, Compilation
18
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Overall Log Size
Memory Access Interleaving Log takes most of the space
Small overall log: 3.17 bits/kilo-instruction
19
0
1
2
3
4
SPLASH2-avg Apache-avg Compilation-avg
Log
Siz
e (b
its/k
ilo-in
stru
ctio
n)
Memory Access Interleaving LogInput Log
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Recording Performance
Moderate overhead: 21% for SPLASH2 and 41% average for system apps
Minimal timing distortion for debugging concurrency defects
20
0
1
2
SPLASH2-avg Apache-avg Compilation-avg
Nor
mal
ized
Exe
cutio
n Ti
me
Standard executionptrace’s interposition overheadRest of Replay Sphere Manager
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Replay Performance: SPLASH-2
Emulating system calls reduces cycles during replay
Replay takes only 80% more cycles
R-Threads must wait for their turn to commit
21
0
1
2
Record Replay
Nor
mal
ized
Cyc
les Execution
Stall
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Conclusions
Capo enables practical replay of execution for systems with replay HW
The Replay Sphere is a powerful abstraction
Enable mixing recording, replay and standard execution
CapoOne: first Capo prototype
Working system
Good performance (21-41% recording overhead, 80% replay overhead)
Good for debugging concurrency defects
22
Capo: A Software-Hardware Interface for Practical
Deterministic Multiprocessor Replay
Department of Computer ScienceUniversity of Illinois at Urbana-Champaign
Pablo Montesinos, Matthew Hicks, Samuel T. King and Josep Torrellas
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
CapoOne: Basic HW Operation24
Proc 0 Commit Request Commit Request
7 3
Ok Ok
CS LogA Size R-threadID = 7 R-threadID = 3
Chunk A
Proc 1
CS LogB Size
Chunk BArbiter
Two new per-processor registers: RSID, R-threadID
Arbiter now supports concurrent spheres
Manages an Interleaving Log for each of them
RSIDR-threadID
0
7
RSID
R-threadID
1
3
InterleavingLogs
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Today’s Agenda: Towards the Perfect Replay System
DeLorean: New hardware replay engine
Very efficient multiprocessor support
Vastly improved log requirements
Capo: New SW-HW interface for replay
Makes HW-based replay systems practical
Evaluation
Future work
25
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
DMA
DMA Log
Network
Overall DeLorean System
Proc 0
PI Log
CS Log
Chunk A Arbiter
Interrupt Log
I/O Log
Proc 1
CS Log
Chunk B
Interrupt Log
I/O Log
26
Interrupt, I/O and DMA logs are common to other HW-based schemes
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
CapoOne: HW Implementation
No need for DeLorean’s Interrupt Log, DMA Log nor I/O Log
PI Log becomes the per-sphere Interleaving Log
CS Log becomes a per-R-thread Log
Chunks only have instructions from one application (or the kernel)
27
DMA
NetworkProc 0
PI Log
CS Log
Chunk A Arbiter
Proc 1
CS Log
Chunk B
DMA Log
Interrupt Log
I/O Log
Interrupt Log
I/O Log
Interleaving Log
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Replay Sphere 1
Replaying
Replay Sphere 1
Replaying
Emulating vs Re-Executing System Calls
During replay, the RSM emulates most system calls:
RSM injects return values from Sphere Input Log, squashes outputs
Some have to be re-executed
Thread management (clone)
Address space modification (mprotect)
28
R-thread1
R-thread2
read()Log 1 fork()
OS
RSM
sys_fork
thread674
R-thread3
Pablo Montesinos Capo: Practical Deterministic Replay of Multiprocessors
Implicit Dependencies
R-thread changes mapping or protection of address space, and another R-thread uses this changed address space
RSM can express these dependencies to hardware so these interactions can be recorded
29
Page Table
Sphere 1
R-thread 1
mprotect X
R-thread 2
while(1){ *x = *x+1}
CPU 1 CPU 2
top related