low latency & mechanical sympathy issues and solutions

26
Low latency & Mechanical Sympathy: Issues and solutions Jean-Philippe BEMPEL @jpbempel Performance Architect http://jpbempel.blogspot.com © ULLINK 2015 – Private & Confidential

Upload: jean-philippe-bempel

Post on 14-Aug-2015

109 views

Category:

Software


1 download

TRANSCRIPT

Low latency & Mechanical Sympathy:Issues and solutionsJean-Philippe BEMPEL @jpbempelPerformance Architect http://jpbempel.blogspot.com

© ULLINK 2015 – Private & Confidential

UL BRIDGE: Low latency order router

December 16, 2014© ULLINK 2014 – Private & Confidential 2

• pure Java SE application

• FIX protocol / Direct Market Access connectivity (TCP)

• transform to an intermediate form (UL Message)

• process & route orders in less than 100us

Low Latency

December 16, 2014© ULLINK 2014 – Private & Confidential 3

• Process under 1ms

• No (Disk) I/O involved

• Hardware architecture has a major impact on performance

• Memory is the new disk

• Mechanical Sympathy

4

GC pauses

• Minor GC & Major GC are Stop the World (STW)

• Major: None during trading hours. Java heap sized accordingly

GC pauses

December 16, 2014© ULLINK 2014 – Private & Confidential 5

• Minor: Some tricks to reduce both frequency & pause time.Frequency: increase Young GenMinor pause time factors are:

Refs traversingObject Copy (survivor or tenured)Card scanning

• Allocation should be controlled carefully

• Reduce frequency of minor gc occurrence

• Help to achieve objectives in terms of percentile above 99%

GC pauses

December 16, 2014© ULLINK 2014 – Private & Confidential 6

Why not CMS algorithm?• Promotion allocation (Free List) increase minor pause

time• Major GC fallback unpredictable (fragmentation,

promotion failure, …)

Azul?• We are partners• Best GC algorithm in the world• Our tests show maximum pause time of 2ms• Execution performance not as good as HotSpot but ...

GC pauses

December 16, 2014© ULLINK 2014 – Private & Confidential 7

8

Power Management

• CPU embeds technologies to reduce power usage

• Frequency scaling during execution (P states)P0 = nominal frequency

• Deep sleep modes during idle phases (C states)C0 = Running, C1 = idle

• latency to wakeupcat /sys/devices/system/cpu/cpu0/cpuidle/state3/latency

200

• few orders per day, most of the time in idle

Power Management

December 16, 2014© ULLINK 2014 – Private & Confidential 9

• Disable those special states at BIOS level

• Some OS drivers are also very aggressive (intel_idle)

Power Management

December 16, 2014© ULLINK 2014 – Private & Confidential 10

Power Management

December 16, 2014© ULLINK 2014 – Private & Confidential 11

Power Management

December 16, 2014© ULLINK 2014 – Private & Confidential 12

Power Management

December 16, 2014© ULLINK 2014 – Private & Confidential 13

Power Management

December 16, 2014© ULLINK 2014 – Private & Confidential 14

15

NUMA

• Non-Uniform Memory Access

• Starts from 2 cpus (sockets)

• 1 Memory controller per socket (Node)

• Local Memory vs remote memory

NUMA

December 16, 2014© ULLINK 2014 – Private & Confidential 16

NUMA

December 16, 2014© ULLINK 2014 – Private & Confidential 17

• Avoid remote memory access

• Avoid scheduling and allocation on different node

• Bind process on one CPU only (numactl)Restrictive but effective

NUMA

December 16, 2014© ULLINK 2014 – Private & Confidential 18

19

Cache pollution

Cache Pollution

December 16, 2014© ULLINK 2014 – Private & Confidential 20

Cache Pollution

December 16, 2014© ULLINK 2014 – Private & Confidential 21

• CPU L3 shared cache across cores

• Non critical threads can “pollute” cache • eviction of data required by critical threads• => adds latency to re-access those data

• Need to isolate critical threads to non-critical ones

Cache Pollution

December 16, 2014© ULLINK 2014 – Private & Confidential 22

• OS Scheduling cannot work in this case

• Need to pin manually threads on cores

• Thread affinity allow us to do this

• Identify our critical and non-critical threads in application

Cache Pollution

December 16, 2014© ULLINK 2014 – Private & Confidential 23

• Other processes can also pollute L3 cache

• Need to isolate all processes away from critical threads

• The whole CPU need to be dedicated

• No OS scheduling should be performed (isolcpus)

Cache Pollution

December 16, 2014© ULLINK 2014 – Private & Confidential 24

• Can optimize a process without changing your code

• Know your hardware and your OS

• This is Mechanical Sympathy

Conclusion

December 16, 2014© ULLINK 2014 – Private & Confidential 25

Q&A

Jean-Philippe BEMPEL @jpbempelPerformance Architect http://jpbempel.blogspot.com

© ULLINK 2015 – Private & Confidential