university of colorado at boulder core research lab fastforward for efficient pipeline parallelism:...
TRANSCRIPT
University of Colorado at Boulder
Core Research Lab
FastForward for Efficient Pipeline Parallelism:FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free QueueA Cache-Optimized Concurrent Lock-Free Queue
Tipp Moseley and Manish Vachharajani
University of Colorado at Boulder
2008.02.21
John Giacomoni
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Why?Why?Why Pipelines?Why Pipelines?
• Multicore systems are the future
• Many apps can be pipelined if the granularity is fine enough
– ≈ < 1 µs
– ≈ 3.5 x interrupt handler
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Fine-GrainFine-GrainPipelining ExamplesPipelining Examples
• Network processing:– Intrusion detection (NID) – Traffic filtering (e.g., P2P filtering)– Traffic shaping (e.g., packet prioritization)
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Network ProcessingNetwork ProcessingScenariosScenarios
Link Mbps fps ns/frame
T-1 1.5 2,941 340,000
T-3 45.0 90,909 11,000
OC-3 155.0 333,333 3,000
OC-12 622.0 1,219,512 820
GigE 1,000.0 1,488,095 672
OC-48 2,500.0 5,000,000 200
10 GigE 10,000.0 14,925,373 67
OC-192 9,500.0 19,697,843 51
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Core-PlacementsCore-Placements
4x4 NUMA Organization(ex: AMD Opteron Barcelona)
APP
IP OP
Dec Enc
APP
IP
APP
OP
IP
Dec
App
Enc
OP
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
ExampleExample3 Stage Pipeline3 Stage Pipeline
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
ExampleExample3 Stage Pipeline3 Stage Pipeline
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
CommunicationCommunicationOverheadOverhead
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
GigE
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
GigE
Lamport 160ns
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
Lamport 160ns
Hardware 10ns
GigE
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
CommunicationCommunicationOverheadOverhead
Locks 320ns
Lamport 160ns
Hardware 10nsFastForward 28ns
GigE
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
More Fine-GrainMore Fine-GrainPipelining ExamplesPipelining Examples
• Network processing:– Intrusion detection (NID) – Traffic filtering (e.g., P2P filtering)– Traffic shaping (e.g., packet prioritization)
• Signal Processing– Media transcoding/encoding/decoding– Software Defined Radios
• Encryption– Counter-Mode AES
• Other Domains– Fine-grain kernels extracted from sequential applications
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
FastForwardFastForward
• Cache-optimized point-to-point CLF queue1.Fast
2.Robust against unbalanced stages
3.Hides die-die communication
4.Works with strong to weak memory consistency models
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Lamport’sLamport’sCLF Queue (1)CLF Queue (1)
lamp_enqueue(data) {
NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;
head = NH;
}
lamp_dequeue(*data) {
while (head == tail) {}
*data = buf[tail];
tail = NEXT(tail);
}
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Lamport’sLamport’sCLF Queue (2)CLF Queue (2)
lamp_enqueue(data) {
NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;
head = NH;
}
head tail
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
AMD OpteronAMD OpteronCache ExampleCache Example
M
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Lamport’sLamport’sCLF Queue (2)CLF Queue (2)
lamp_enqueue(data) {
NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;
head = NH;
}
head tail
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Lamport’sLamport’sCLF Queue (3)CLF Queue (3)
lamp_enqueue(data) {
NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;
head = NH;
}
head
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
Observe how cachelines will still ping-pong.What if the head/tail comparison was eliminated?
tail
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
FastForwardFastForwardCLF Queue (1)CLF Queue (1)
lamp_enqueue(data) {
NH = NEXT(head);
while (NH == tail) {};
buf[head] = data;
head = NH;
}
ff_enqueue(data) {
while(0 != buf[head]);
buf[head] = data;
head = NEXT(head);
}
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
buf[1]buf[0]
FastForwardFastForwardCLF Queue (2)CLF Queue (2)
ff_enqueue(data) {
while(0 != buf[head]);
buf[head] = data;
head = NEXT(head);
}
head
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
tail
Observe how head/tail cachelines will NOT ping-pong.BUT, “buf” will still cause the cachelines to ping-pong.
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
FastForwardFastForwardCLF Queue (3)CLF Queue (3)
ff_enqueue(data) {
while(0 != buf[head]);
buf[head] = data;
head = NEXT(head);
}
head
buf[0] buf[1] buf[2] buf[3]buf[4] buf[5] buf[6] buf[7]
buf[ ] buf[ ] buf[ ] buf[n]
tail
Solution: Temporally slip stages by a cacheline.N:1 reduction in coherence misses per stage.
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Slip TimingSlip Timing
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Slip TimingSlip TimingLostLost
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Maintaining SlipMaintaining Slip(Concepts)(Concepts)
• Use distance as the quality metric– Explicitly compare head/tail– Causes cache ping-ponging– Perform rarely
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Maintaining SlipMaintaining Slip(Method)(Method)
adjust_slip() {
dist = distance(producer, consumer);
if (dist < *Danger*) {
dist_old = 0;
do {
dist_old = dist;
spin_wait(avg_stage_time * (*OK* - dist));
dist = distance(producer, consumer);
} while (dist < *OK* && dist > dist_old);
}
}
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
ComparativeComparativePerformancePerformance
Lamport FastForward
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Thrashing andThrashing andAuto-BalancingAuto-Balancing
FastForward (Thrashing) FastForward (Balanced)
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
CacheCacheVerificationVerification
FastForward (Thrashing) FastForward (Balanced)
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
On/Off DieOn/Off DieCommunicationsCommunications
M
On-die communication
Off-die communication
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
On/Off-dieOn/Off-diePerformancePerformance
FastForward (On-Die) FastForward (Off-Die)
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
ProvenProvenPropertyProperty
• “In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.”
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
WorkWorkin Progressin Progress
• Operating Systems– 27.5 ns/op
• 3.1 % cost reduction vs. reported 28.5 ns
– Reduced jitter
• Applications– 128bit AES encrypting filter
• Ethernet layer encryption at 1.45 mfps• IP layer encryption at 1.51 mfps• ~10 lines of code for each.
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Gazing intoGazing intothe Crystal Ballthe Crystal Ball
Locks 320ns
Lamport 160ns
Hardware 10nsFastForward 28ns
GigE
University of Colorado at Boulder
Core Research LabUniversity of Colorado at Boulder
Core Research Lab
Shared Memory Accelerated QueuesNow Available!
http://ce.colorado.edu/core