a lock-free, cache-efficient multi-core synchronization mechanism for line-rate network traffic...

26
A Lock-Free, Cache-Efficient A Lock-Free, Cache-Efficient Multi-Core Synchronization Multi-Core Synchronization Mechanism for Line-Rate Network Mechanism for Line-Rate Network Traffic Monitoring Traffic Monitoring Patrick P. C. Lee 1 , Tian Bu 2 , Girish Chandranmenon 2 1 The Chinese University of Hong Kong 2 Bell Labs, Alcatel-Lucent April 2010

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

A Lock-Free, Cache-Efficient Multi-A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Core Synchronization Mechanism for Line-Rate Network Traffic MonitoringLine-Rate Network Traffic Monitoring

Patrick P. C. Lee1, Tian Bu2, Girish Chandranmenon2

1The Chinese University of Hong Kong2Bell Labs, Alcatel-Lucent

April 2010

Page 2: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

2

OutlineOutline

Motivation

MCRingBuffer, a multi-core ring buffer

Parallel network monitoring prototype

Conclusions

Page 3: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

3

Network Traffic MonitoringNetwork Traffic Monitoring

Monitoring data streams in today’s networks is essential for network management:

Accounting resource provisioning failure diagnosis intrusion detection/prevention

Goal: achieve line-rate monitoring Monitoring speed must keep up with link bandwidth (i.e.,

prepare for the worst)

Challenges: Data volume keeps increasing (e.g., to Gigabit scales) Single CPU systems may no longer support line-rate

monitoring

Page 4: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

4

Can Multi-Core Help?Can Multi-Core Help?

Can multi-core architectures help line-rate monitoring? Parallelize packet processing

The answer should be “yes”…… yet, exploiting full potential of multi-core is still challenging

Inter-core communication has overhead: Upper layer: protocol messages Lower layer: thread synchronization in shared data structures

coreraw packets

CPU Quad-core CPU

core

corecore

core

Single-core case Multi-core case

raw packets

Page 5: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

5

Can Multi-Core Help?Can Multi-Core Help?

Multi-core helps only if we minimize inter-core communication overhead

Let’s focus on minimizing thread synchronization Benefit a broad class of multi-threaded network monitoring

applications

Page 6: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

6

Our ContributionOur Contribution

Why lock-free? Allows concurrent thread accesses

Why cache-efficient? Saves expensive memory accesses

We embed the mechanism to MCRingBuffer, a lock-free, cache-efficient shared ring buffer tailored for multi-core architectures

Design a lock-free, cache-efficient multi-core synchronization mechanism for high-speed

network traffic monitoring

Page 7: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

7

Producer/Consumer Problem Producer/Consumer Problem

Classical OS problem

Ring buffer: bounded buffer with fixed number of slots

Thread synchronization: Producer inserts elements when buffer is not full Consumer extracts elements when buffer is not empty First-in-first-out (FIFO): inserted elements and extracted

elements in the same order

Ring buffer

element

Producer Consumer

Page 8: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

8

Producer/Consumer ProblemProducer/Consumer Problem Ring buffer in multi-core context:

L1 cache

Producer Consumer

core core

L1 cache

CPU

L2 cache

Control variables

Ring buffer

Memory

System bus

Thread synchronization operates on control variables. Make the operations as cache-friendly as possible.

Page 9: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

9

Lamport’s Lock-Free Ring Lamport’s Lock-Free Ring BufferBuffer

Operate on control variables: read and write, which resp. point to next read and write slots

readwrite

0N-1

Insert(T element)1: wait until NEXT(write) != read2: buffer[write] = element3: write = NEXT(write)

Extract(T* element)1: wait until read != write2: *element = buffer[read]3: read = NEXT(read)

[Lamport, Comm. of ACM, 1977]

NEXT(x) = (x + 1) % N

Page 10: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

10

Previous WorkPrevious Work

FastForward [Giacomoni et al., PPoPP, 2008]: couple data/control operations need a special NULL data element defined by applications

Hardware-primitive ring buffers support multiple-producers/multiple-consumers use hardware synchronization primitives (e.g., compare and

swap) Hardware primitives are expensive in general

Page 11: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

11

MCRingBuffer OverviewMCRingBuffer Overview

Goal: use Lamport’s ring buffer as a building block to further minimize cost of thread synchronization

Properties: Lock-free: allow concurrent accesses of producer and

consumer Cache-efficient: improve cache locality of synchronization Generic: no assumption on data types and insert/extract

patterns Deployable: works on general-purpose multi-core CPUs

Components: Cache-line protection Batch updates of control variables

Page 12: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

12

MCRingBuffer AssumptionsMCRingBuffer Assumptions

Assumptions inherited from Lamport’s ring buffer: single-producer/single-consumer reading/writing read/write are atomic memory accesses follow sequential consistency

Page 13: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

13

Cache-line ProtectionCache-line Protection Cache is in unit of cache lines

False sharing occurs when two threads access different variables on the same cache line

Cache line invalidated when a thread modifies a variable Cache line reloaded from memory when a thread reads a

different variable, even unchanged

cache

read write N

N (ring buffer size) is reloaded from memory even if it’s constant

read/write modified frequently for thread synchronization

Page 14: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

14

Cache-line ProtectionCache-line Protection

Add padding bytes to avoid false sharing

cache

read write

N

cachePad1

cachePad2

int readint writechar cachePad1[CL–2*sizeof(int)]int Nchar cachePad2[CL–sizeof(int)]

CL = cache line size

Page 15: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

15

Cache-line ProtectionCache-line Protection Use cache-line protection to minimize memory

accessescache

read write

localWrite

cachePad1

cachePad2nextRead

localRead cachePad3nextWrite

N cachePad4

Shared variables

Consumer’s local variables

Producer’s local variables

Constants

Shared variables are main controls of synchronization

Use local variables to “guess” shared variables

Goal: minimize freq. of reading shared control variables

Page 16: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

16

Batch Updates of Control Batch Updates of Control VariablesVariables

Intuition: nextRead/nextWrite are the positions where to read/write Update read/write after batchSize reads/writes

buffer[nextWrite] = elementnextWrite = NEXT(nextWrite)wBatch++if (wBatch >= batchSize) { write = nextWrite wBatch = 0}

*element = buffer[nextRead]nextRead = NEXT(nextRead)rBatch++if (rBatch >= batchSize) { read = nextRead rBatch = 0}

Producer Consumer

Goal: minimize freq. of writing shared control variables

Page 17: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

17

Batch Updates of Control Batch Updates of Control VariablesVariables

Limitation: read/write advanced on per-batch basis

elements may not be extracted even buffer is not empty

However, if elements are raw packets in high-speed networks, read/write will be updated regularly

Page 18: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

18

Correctness of MCRingBufferCorrectness of MCRingBuffer

Correctness based on Lamport’s ring buffer:

Lamport’s: Insert only if write – read < N Extract only if read < write

We prove for MCRingBuffer: Insert only if nextWrite – nextRead < N Extract only if nextRead < nextWrite

Details in the paper.

Page 19: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

19

EvaluationEvaluation Hardware: Intel Xeon 5355 Quad-core

sibling cores: pair of cores sharing L2 cache non-sibling cores: pair of cores not sharing L2 cache

Ring buffers: LockRingBuffer: lock-based ring buffer BasicRingBuffer: Lamport’s ring buffer MCRingBuffer:

batchSize = 1: cache-line protection batchSize > 1: cache-line protection + batch control updates

Metrics: Throughput: number of insert/extract pairs per second Number of L2 cache misses: number of cache-line reload

operations

Page 20: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

20

Experiment 1Experiment 1 Throughput vs. element size

Sibling cores Non-Sibling cores

MCRingBuffer with batchSize > 1 has a higher throughput gain (up to 5x) for smaller element size

buffer capacity = 2K elements

Page 21: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

21

Experiment 2Experiment 2

Throughput vs. buffer capacity

Sibling cores Non-Sibling cores

MCRingBuffer’s throughput invariant with large enough buffer capacity

element size = 128 bytes

Page 22: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

22

Experiment 3Experiment 3

Code profiling from Intel VTune Performance Analyzer

BasicRingBuffer MCRingBuffer(batchSize = 50)

# core cycles 1130M / 1097M 137M / 113M

# retired instructions 358M / 287M 231M / 219M

# L2 cache misses 746K / 808K 102K / 80K

Metric numbers for 10M inserts/extractselement size = 8 bytes, capacity = 2K elements

MCRingBuffer improves cache locality

Page 23: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

23

Recap of EvaluationRecap of Evaluation

MCRingBuffer improves throughput in various scenarios:

Different data sizes Different buffer capacities Sibling/non-sibling cores

MCRingBuffer has higher throughput gain via: careful organization of control variables careful accesses to control variables

MCRingBuffer’s gain does not require any special insert/extract patterns

Page 24: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

24

Parallel Traffic MonitoringParallel Traffic Monitoring

Applying MCRingBuffer to parallel traffic monitoring

Dispatcher

SubAnanlyzer

MainAnanlyzer

SubAnanlyzer

rawpackets

SubAnanlyzer

ring buffer

decoded packets state reports

Page 25: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

25

Parallel Traffic MonitoringParallel Traffic Monitoring

Dispatch stage: Decode raw packets Distribute decoded packets by

(srcIP, dstIP)

SubAnalysis stage: Local analysis on address pairs e.g., 5-tuple flow stats, vertical

portscans

MainAnalysis stage: Global analysis: aggregate

results of all SubAnalyzers e.g., source’s volume,

horizontal portscans

Dispatch SubAnalysis MainAnalysis

Evaluation results:MCRingBuffer helps scale uppacket processing throughput(details in paper)

Page 26: A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring Patrick P. C. Lee 1, Tian Bu 2, Girish Chandranmenon

26

Take-away MessagesTake-away Messages

Proposed a building block for parallel traffic monitoring: a lock-free, cache-efficient synchronization mechanism

Next question: How do we apply MCRingBuffer to different network

monitoring problems?