efficient parallel learning of word2vec
TRANSCRIPT
Efficient Parallel Learning of Word2Vec
Jeroen B. P. Vuurens1, Carsten Eickhoff2, and Arjen P. de Vries3
1The Hague University of Applied Science
2ETH Zurich
3Radboud University Nijmegen
June 24, 2016
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 1 / 14
Word2Vec
Simple method for low-dimensional feature representation of words
Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )
Recently very popular
Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14
Word2Vec
Simple method for low-dimensional feature representation of words
Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )
Recently very popular
Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14
Word2Vec
Simple method for low-dimensional feature representation of words
Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )
Recently very popular
Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14
Word2Vec
Simple method for low-dimensional feature representation of words
Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )
Recently very popular
Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14
More is more. . .
Figure courtesy of http://deepdist.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 3 / 14
Parallel Training
Shared model θ
Parallel SGD threads
I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Parallel Training
Shared model θ
Parallel SGD threads
I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xi
I Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θ
I Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θ
I Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Parallel Training
Shared model θ
Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock
Lots of waiting. . .
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 4 / 14
Hogwild!
Simply skip the locking:
I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Hogwild!
Simply skip the locking:
I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Hogwild!
Simply skip the locking:I Draw a random training example xi
I Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Hogwild!
Simply skip the locking:I Draw a random training example xiI Read current state of θ
I Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Hogwild!
Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Hogwild!
Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 5 / 14
Parallel Word2Vec
Intel Xeon CPU E5-2698 v3, 32 cores
Original C implementation + Gensim
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14
Parallel Word2Vec
Intel Xeon CPU E5-2698 v3, 32 cores
Original C implementation + Gensim
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14
Parallel Word2Vec
Intel Xeon CPU E5-2698 v3, 32 cores
Original C implementation + Gensim
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14
Parallel Word2Vec
Intel Xeon CPU E5-2698 v3, 32 cores
Original C implementation + Gensim
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 6 / 14
Hierarchical Softmax
Binary Huffman tree
V − 1 internal nodes
Each word w is represented by a number of binary decisions
The tree’s top nodes are part of most paths
Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14
Hierarchical Softmax
Binary Huffman tree
V − 1 internal nodes
Each word w is represented by a number of binary decisions
The tree’s top nodes are part of most paths
Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14
Hierarchical Softmax
Binary Huffman tree
V − 1 internal nodes
Each word w is represented by a number of binary decisions
The tree’s top nodes are part of most paths
Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14
Hierarchical Softmax
Binary Huffman tree
V − 1 internal nodes
Each word w is represented by a number of binary decisions
The tree’s top nodes are part of most paths
Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14
Hierarchical Softmax
Binary Huffman tree
V − 1 internal nodes
Each word w is represented by a number of binary decisions
The tree’s top nodes are part of most paths
Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14
Zipf’s Law
Figure courtesy of http://wugology.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 8 / 14
Cached Huffman Trees
Cache the top c nodes in the tree
Every thread works on their stale copy of these top nodes
Update cache every u terms
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14
Cached Huffman Trees
Cache the top c nodes in the tree
Every thread works on their stale copy of these top nodes
Update cache every u terms
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14
Cached Huffman Trees
Cache the top c nodes in the tree
Every thread works on their stale copy of these top nodes
Update cache every u terms
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14
Cached Huffman Trees
Cache the top c nodes in the tree
Every thread works on their stale copy of these top nodes
Update cache every u terms
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 9 / 14
Efficiency
Python/Cython implementation of cached Huffman trees
Same problem at c = 0
Significantly better performance at c = 31
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14
Efficiency
Python/Cython implementation of cached Huffman trees
Same problem at c = 0
Significantly better performance at c = 31
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14
Efficiency
Python/Cython implementation of cached Huffman trees
Same problem at c = 0
Significantly better performance at c = 31
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14
Efficiency
Python/Cython implementation of cached Huffman trees
Same problem at c = 0
Significantly better performance at c = 31
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 10 / 14
Cache Size
Consistent improvements for all c ≤ 31
Best results for 1 ≤ u ≤ 10
Too large choices of u degrade model quality
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14
Cache Size
Consistent improvements for all c ≤ 31
Best results for 1 ≤ u ≤ 10
Too large choices of u degrade model quality
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14
Cache Size
Consistent improvements for all c ≤ 31
Best results for 1 ≤ u ≤ 10
Too large choices of u degrade model quality
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14
Cache Size
Consistent improvements for all c ≤ 31
Best results for 1 ≤ u ≤ 10
Too large choices of u degrade model quality
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 11 / 14
Effectiveness
Stable model quality
Slight quality edge for Gensim implementation
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14
Effectiveness
Stable model quality
Slight quality edge for Gensim implementation
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14
Effectiveness
Stable model quality
Slight quality edge for Gensim implementation
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 12 / 14
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodes
I Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodes
I 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodes
I Zipf’s Law
Caching few top nodes
I 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodes
I 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodes
I 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodesI 4x speed-up
I Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodesI 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Conclusion
Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law
Caching few top nodesI 4x speed-upI Constant model quality
Try it yourself: http://cythnn.github.io
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 13 / 14
Thank You!
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 14 / 14
Thank [email protected]
J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 14 / 14