efficient parallel learning of word2vec

Efficient Parallel Learning of Word2Vec

Jeroen B. P. Vuurens1, Carsten Eickhoff2, and Arjen P. de Vries3

1The Hague University of Applied Science

2ETH Zurich

3Radboud University Nijmegen

June 24, 2016

J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 1 / 14

Word2Vec

Simple method for low-dimensional feature representation of words

Beneficial properties:I UnsupervisedI Semantics-preserving (up to a point. . . )

Recently very popular

Figure courtesy of T. Mikolov et al.J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 2 / 14

Word2Vec

More is more. . .

Figure courtesy of http://deepdist.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 3 / 14

Parallel Training

Shared model θ

Parallel SGD threads

I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Lots of waiting. . .

Parallel Training

Shared model θ

Parallel SGD threads

I Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xi

I Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θ

I Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θ

I Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

I Release lock

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Parallel Training

Shared model θ

Parallel SGD threadsI Draw a random training example xiI Acquire a lock on θI Read θI Update θ ← (θ − α∇L(fθ(xi ), yi ))I Release lock

Hogwild!

Simply skip the locking:

I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

Hogwild!

Simply skip the locking:

I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

Hogwild!

Simply skip the locking:I Draw a random training example xi

I Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θ

I Update θ ← (θ − α∇L(fθ(xi ), yi ))

Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

Hogwild!

Simply skip the locking:I Draw a random training example xiI Read current state of θI Update θ ← (θ − α∇L(fθ(xi ), yi ))

Parallel Word2Vec

Intel Xeon CPU E5-2698 v3, 32 cores

Original C implementation + Gensim

Parallel Word2Vec

Hierarchical Softmax

Binary Huffman tree

V − 1 internal nodes

Each word w is represented by a number of binary decisions

The tree’s top nodes are part of most paths

Figure courtesy of X. RongJ. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 7 / 14

Binary Huffman tree

Zipf’s Law

Figure courtesy of http://wugology.com/J. Vuurens et al. Efficient Parallel Learning of Word2Vec June 24, 2016 8 / 14

Cached Huffman Trees

Cache the top c nodes in the tree

Every thread works on their stale copy of these top nodes

Update cache every u terms

Efficiency

Python/Cython implementation of cached Huffman trees

Same problem at c = 0

Significantly better performance at c = 31

Efficiency

Cache Size

Consistent improvements for all c ≤ 31

Best results for 1 ≤ u ≤ 10

Too large choices of u degrade model quality

Cache Size

Effectiveness

Stable model quality

Slight quality edge for Gensim implementation

Effectiveness

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodes

I Frequent memory accesses to top nodesI Zipf’s Law

Caching few top nodes

I 4x speed-upI Constant model quality

Try it yourself: http://cythnn.github.io

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodes

I Zipf’s Law

Conclusion

Hierarchical Softmax scales badly beyond 4-8 nodesI Frequent memory accesses to top nodesI Zipf’s Law

Conclusion

Caching few top nodesI 4x speed-up

I Constant model quality

Conclusion

Caching few top nodesI 4x speed-upI Constant model quality

Conclusion

Caching few top nodesI 4x speed-upI Constant model quality

Thank You!

Thank [email protected]

efficient parallel learning of word2vec

Technology