bpemb: tokenization-free pre-trained 0.71ex subword ... › publication › heinzer... · bpemb:...

1
BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages Benjamin Heinzerling 1,2 Michael Strube 2 1 AIPHES 2 Heidelberg Institute for Theoretical Studies Myxomatosis -osis “state, abnormal condition, or action” -oma / -omato Forming nouns indicating tumors or masses myxo From Ancient Greek uxa, “mucus” Computational Approximations to Morphological Analysis 1. Split into subwords characters: m, y, x, o, m, a, t, o, s, i, s ngrams, e.g.: myx, yxo, xom, oma, mat, ato, tos, osi, sis FastText: myx + yxo + xom + oma + mat + ato + tos + osi + sis + myxo + yxom + xoma + omat + mato + atos + tosi + osis + myxom + . . . + xomato + omatos + matosi + atosis byte pairs 2. Learn function that infers word meaning from subwords Byte-Pair Encoding (BPE) (Gage, 1994) ABABCABCD Most frequent pair: A B Merge pair A B to X: XXCXCD Most frequent pair: X C Merge pair X C to Y: X Y Y D Symbol table: X: A B Y: X C BPE for Text (Sennrich et al., 2016) ABABCABCD Most frequent pair: A B Merge symbol pair A B into AB: AB AB C AB C D Most frequent pair: AB C Merge pair AB C into ABC: AB ABC ABC D Symbol table: A B: AB AB C: ABC BPE Applied to English Wikipedia t t a a h e he i n in t he the er s on c re o w is an in ... ough series int ai stit ery ister ... igo osis jose ... omato sis ... Subword-based Entity Typing myxomatosis myx omat osis 1 2 3 4 5 6 7 8 9 /sickness Results: English Entity Typing Words Characters FastText BPEmb 0.2 0.3 0.4 0.5 0.6 Entity Typing Accuracy Average RNN CNN Unsupervised Segmentation with BPE Merge ops Byte-pair encoded text 1000 to y od a station is a r ail way station on the ch ¯ u¯o main l ine 3000 to y od a station is a railway station on the ch ¯ u¯o main line 10000 toy oda station is a railway station on the ch ¯ u¯o main line 50000 toy oda station is a railway station on the ch¯ u¯o main line 100000 toy oda station is a railway station on the ch¯ u¯o main line Tokenized toyoda station is a railway station on the ch¯ u¯o main line 10000 [ D o% . ( JR o% ) ?% = ¿ j D 25000 [ D o%. ( JR o% ) ?% =¿jD 50000 [ D o%. ( JR o% ) ?%=¿jD Tokenized [D o% . * JR o% + ?%= ¿ jD 5000 ( (H ) / 3Ø o = 6 S +K 10000 ( ( H ) / 3Ø o = 6 S+K 25000 ( (H ) / 3Ø o = 6 S+K 50000 ( (H ) / 3Ø o = 6S+K Tokenized * (H + / 3 Ø o = 6 S + K Download Embeddings and BPE Models in 275 Languages https://github.com/bheinzerling/bpemb Acknowledgements: This work has been supported by the German Research Foundation as part of the Research Training Group “Adaptive Preparation of Information from Heterogeneous Sources” (AIPHES) under grant No. GRK 1994/1.

Upload: others

Post on 03-Jul-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BPEmb: Tokenization-free Pre-trained 0.71ex Subword ... › publication › heinzer... · BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages Benjamin Heinzerling1,2

BPEmb: Tokenization-free Pre-trainedSubword Embeddings in 275 LanguagesBenjamin Heinzerling1,2 Michael Strube2

1AIPHES 2Heidelberg Institute for Theoretical Studies

Myxomatosis

-osis“state, abnormal condition, oraction”

-oma / -omatoForming nouns indicatingtumors or masses

myxoFrom Ancient Greek muxa,“mucus”

Computational Approximations to Morphological Analysis

1. Split into subwordscharacters: m, y, x, o, m, a, t, o, s, i, sngrams, e.g.: myx, yxo, xom, oma, mat, ato, tos, osi, sisFastText: myx + yxo + xom + oma + mat + ato + tos + osi +

sis + myxo + yxom + xoma + omat + mato + atos + tosi + osis +myxom + . . . + xomato + omatos + matosi + atosis

byte pairs2. Learn function that infers word meaning from subwords

Byte-Pair Encoding (BPE)(Gage, 1994)

A B A B C A B C D

Most frequent pair: A B

Merge pair A B to X:X X C X C D

Most frequent pair: X C

Merge pair X C to Y:X Y Y D

Symbol table:X: A BY: X C

BPE for Text(Sennrich et al., 2016)

A B A B C A B C D

Most frequent pair: A B

Merge symbol pair A B into AB:AB AB C AB C D

Most frequent pair: AB C

Merge pair AB C into ABC:AB ABC ABC D

Symbol table:A B: ABAB C: ABC

BPE Applied to English Wikipedia

t → ta → a

h e → hei n → in

t he → theers

onc

re

owisanin

. . .oughseriesintai

stitery

ister. . .igoosisjose. . .

omatosis. . .

Subword-based Entity Typing

myxomatosis

myx

omat

osis

[x1 x2 x3

]

[x4 x5 x6

]

[x7 x8 x9

]/sickness

Results: English Entity Typing

Words Characters FastText BPEmb

0.2

0.3

0.4

0.5

0.6

Entity

TypingAccuracy

Average

RNN

CNN

Unsupervised Segmentation with BPE

Merge ops Byte-pair encoded text

1000 to y od a station is a r ail way station on the ch u o main l ine3000 to y od a station is a railway station on the ch u o main line

10000 toy oda station is a railway station on the ch u o main line50000 toy oda station is a railway station on the chu o main line

100000 toy oda station is a railway station on the chuo main lineTokenized toyoda station is a railway station on the chuo main line

10000 豐 田 站 是 東 日本 旅 客 鐵 道 ( JR 東 日本 ) 中央 本 線 的 鐵路 車站25000 豐田 站是 東日本旅客鐵道 ( JR 東日本 ) 中央 本 線的鐵路車站50000 豐田 站是 東日本旅客鐵道 ( JR 東日本 ) 中央 本線的鐵路車站

Tokenized 豐田站 是 東日本 旅客 鐵道 ( JR 東日本 ) 中央本線 的 鐵路車站

5000 豊 田 駅 ( と よ だ え き ) は 、 東京都 日 野 市 豊 田 四 丁目 にある10000 豊 田 駅 ( と よ だ えき ) は 、 東京都 日 野市 豊 田 四 丁目にある25000 豊 田駅 ( とよ だ えき ) は 、 東京都 日 野市 豊田 四 丁目にある50000 豊 田駅 ( とよ だ えき ) は 、 東京都 日 野市 豊田 四丁目にある

Tokenized 豊田 駅 ( と よ だ え き ) は 、 東京 都 日野 市 豊田 四 丁目 に ある

Download Embeddings and BPE Models in 275 Languages

https://github.com/bheinzerling/bpemb

Acknowledgements: This work has been supported by the German Research Foundation as part of theResearch Training Group “Adaptive Preparation of Information from Heterogeneous Sources” (AIPHES)under grant No. GRK 1994/1.