java concurrent optimization: concurrent queue

并发队列篇

作者：周忱 | 数据平台-DXP微博：@MinZhou

邮箱：[email protected]

Java并发编程优化之阻塞队列

关于我• 花名:周忱(chén)• 真名:周敏• 微博: @MinZhou• Twitter: @minzhou• 2010年6月加入淘宝• 曾经淘宝Hadoop&Hive研发

组Leader• 目前云梯跨机房临时工• Hive Contributor• 自由、开源软件热爱者

Data eXchange Platform| zhouchen.zm

队列是什么？



队列的运用



ArrayBlockingQueue & LinkedBlockingQueue



• BlockingQueue

• ArrayBlockingQueue: 数组实现

• LinkedBlockingQueue: 链表实现

• Ops约300万

队列的性能问题



• Linked list is the EVIL of performance

• 在head, tail和size三个变量的写冲突

• put/take和offer/poll上的大锁

• GC问题

单Writer原则



方法时间(ms)

单线程 long 300

单线程 volatile long 4,700

单线程 AtomicLong(CAS ) 5,700

双线程 AtomicLong(CAS ) 18,000

单线程synchronized + long 10,000

双线程synchronized + long 118,000

• 一个变量递增500,000,000次所需时间

第一步:环形队列



• 没有写冲突, 不需要上锁, 甚至不需要CAS

• 采用volatile关键字让对方线程可见

• 不需要维护size

• Ops约1100万

内存屏障



• Load Buffer

• Store Buffer

• CPU串行化指令

– CPUID

– SFENCE

– LFENCE

– MFENCE

• Lock系指令

第二步:lazySet



• AtomicXXX.lazySet()保证StoreStore

• 但不保证StoreLoad

• 保证最终一致性

• 一个轻量的volatile

• Unsafe.putOrderedXXX

• Ops约1700万

"This is a niche method that is sometimes useful when fine-tuning code using non-blocking data structures. The semantics are that the write is guaranteed not to be re-ordered with any previous write, but may be reordered with subsequent operations(or equivalently, might not be visible to other threads) until some other volatile write or synchronizing action occurs).“

--Doug Lea

第三步:求模优化



• & (k pow 2) - 1 替代%

• Ops约2200万

public boolean offer(final E e) {

…buffer[(int) (currentTail % buffer.length)] = e;…

}

public boolean offer(final E e) {

…buffer[(int) currentTail & mask] = e;…

}

False Sharing



第四步:去除伪共享



• Ops约4000万

public class PaddedAtomicLong extends AtomicLong {private static final long serialVersionUID = 1L;

public PaddedAtomicLong() {}

public PaddedAtomicLong(final long initialValue) {super(initialValue);

}

public long p1, p2, p3, p4, p5, p6;}

CPU Cache



内存排布对性能的影响



• 测试

– 顺序读取内存数据

– 在一个内存页内随机, 然后转到另外的页内随机

– 全随机访问

• https://gist.github.com/coderplay/4453283

https://gist.github.com/coderplay/4453283

https://gist.github.com/coderplay/4453283

Cache Line



cat /sys/devices/system/cpu/cpu0/cache/index0/*

第五步:优化内存排布



• 使用Direct ByteBuffer• 使用Unsafe使页对齐• 内存连续

• Ops约6800万

第六步:yield() vs LockSupport.parkNanos(1)



• 减少StoreLoad

• 减少CPU相干性的噪声,从而提高cache命中

• Ops约1亿1000万

其它优化



• 环形队列预分配,零GC

• 批量生产及消费

• Wait free

• Ops可达2亿2000万!

• CPU亲缘

思考



• 多消费者

• 多生产者

工具



• top

• vmstat

• lscpu

• perf

• Valgrind tools suite

• OProfile

• SystemTap

• numactl

• Intel Vtune

• Intel PTU

• Intel PCM + ksysguard

• MAT

代码



$git clone https://github.com/coderplay/javaopt.git

$java –cp bin javaopt.queue.QueuePerfTest n

https://github.com/coderplay/javaopt.git

https://github.com/coderplay/javaopt.git

推荐读物



• What every programmer should know about memory

• Intel® 64 and IA-32 Architectures Software Developer Manuals

• The Art of Multiprocessor Programming

• The JSR-133 Cookbook for Compiler Writers (Java Memory Model)

• 本人博客: http://coderplay.javaeye.com

Q & A



作者：周忱 | 数据平台-DXP微博：@MinZhou

邮箱：[email protected]

java concurrent optimization: concurrent queue

Technology