segmentation-free word embedding for … › anthology › d17-1080.pdfwikidata ( vrande ci c and...

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 767–772Copenhagen, Denmark, September 7–11, 2017. c©2017 Association for Computational Linguistics

Segmentation-Free Word Embedding for Unsegmented Languages∗

Takamasa OshikiriGraduate School of Engineering Science, Osaka University

RIKEN Center for Advanced Intelligence ProjectCyberAgent, Inc.

[email protected]

Abstract

In this paper, we propose a new pipelineof word embedding for unsegmented lan-guages, called segmentation-free word em-bedding, which does not require word seg-mentation as a preprocessing step. Unlikespace-delimited languages, unsegmentedlanguages, such as Chinese and Japanese,require word segmentation as a prepro-cessing step. However, word segmenta-tion, that often requires manually anno-tated resources, is difficult and expensive,and unavoidable errors in word segmen-tation affect downstream tasks. To avoidthese problems in learning word vectorsof unsegmented languages, we considerword co-occurrence statistics over all pos-sible candidates of segmentations basedon frequent character n-grams instead ofsegmented sentences provided by conven-tional word segmenters. Our experimentsof noun category prediction tasks on rawTwitter, Weibo, and Wikipedia corporashow that the proposed method outper-forms the conventional approaches that re-quire word segmenters.

1 Introduction

Word embedding, which learns dense vector rep-resentation of words from large text corpora, hasreceived much attention in the natural languageprocessing (NLP) community in recent years.It is reported that the representation of wordswell captures semantic and syntactic propertiesof words (Bengio et al., 2003; Mikolov et al.,

∗ This work was done while the author was at Shi-modaira laboratory, Division of Mathematical Science, Grad-uate School of Engineering Science, Osaka University, andMathematical Statistics Team, RIKEN Center for AdvancedIntelligence Project.

20 10 0 10 20

20

10

0

10

20

blue

MCBASARAK

GANTZ

MW

I'llswitch

SNOW

BLEACHPINKMONSTERRECYELLOW Free!

DAYS

C

FRAY

W3

DOLL

AKIRA

CLANNAD

!!

10

× SS

Wish

X1

!

3ReLIFEorange

countryshipfood

chemical compoundprofessionrailway station

mangaprefecture of Japanhuman

Figure 1: t-SNE projections of vector represen-tation of Japanese nouns that generated by ourproposed method without word dictionary. Theseproper nouns are color-coded according to its cat-egories which extracted from Wikidata.

2013; Pennington et al., 2014), and is useful formany downstream NLP tasks, including part-of-speech tagging, syntactic parsing, and machinetranslation (Huang et al., 2011; Socher et al., 2013;Sutskever et al., 2014).

In order to train word embedding models on araw text corpus, we have to do word segmentationas a preprocessing step. In space-delimited lan-guages such as English and Spanish, simple rule-based and co-occurrence-based approaches offerreasonable segmentations. On the other hands,these approaches are impractical for unsegmentedlanguages such as Chinese, Japanese, and Thai.Therefore, machine learning-based approaches arewidely used in NLP for unsegmented languages.Conditional random field (CRF)-based supervisedword segmentation (Kudo et al., 2004; Tsenget al., 2005) is still the most used one in Japaneseand Chinese NLP (Prettenhofer and Stein, 2010;Funaki and Nakayama, 2015; Ishiwatari et al.,

767

2015; Nakazawa et al., 2016).

However, there are some problems for thesesupervised word segmentation as a preprocess-ing step of a word embedding pipeline. First,they require language-specific manually anno-tated resources such as word dictionaries and seg-mented corpora. Since these manually annotatedresources are typically unavailable for domain-specific corpora (e.g. Twitter or Weibo cor-pora that contain many neologisms and informalwords), we have to create manually annotated re-sources if we need. Second, they cannot take ad-vantage of word occurrence frequencies in a cor-pus. Even though a certain proper noun (e.g. “老人と海” (The Old Man and the Sea)) occursfrequently in a corpus, word segmenters will con-tinue to split the proper noun erroneously (e.g. “老人/と/海” (a old man / and / a sea)) if it is notregistered in the word dictionary. Because of seg-mentation errors incurred by these problems, thedownstream word embedding model cannot learnvector representation of proper nouns, neologisms,and informal words.

In this paper, in order to learn word vectorsfrom a raw text corpus while avoiding the aboveproblems, we propose a new word segmentation-free pipeline for word embedding, referred to assegmentation-free word embedding (sembei). Ourframework first enumerates all possible segmen-tations (referred to as a frequent n-gram lattice)based on character n-grams that frequently oc-curred in the raw corpus, and then learns n-gramvectors from co-occurrence frequencies over thefrequent n-gram lattice. Using the general ideaof segmentation-free word embedding, we can ex-tend existing word embedding models. Specifi-cally, in this paper, we propose a segmentation-free version of the widely used skip-gram modelwith negative sampling (SGNS) (Mikolov et al.,2013), which we refer to as SGNS-sembei.

Although the frequent character n-grams nec-essarily include many non-words (i.e. n-gramsthat are not words), remarkably, our results showthat nearest neighbor search works well for fre-quent words and even proper nouns (e.g. near-est neighbors of n-gram “ドイツ” (Germany) are“中国” (China), “イギリス” (United Kingdom),etc.). This observation suggests that we can usethe proposed method for automatic acquisition ofsynonyms from large raw text corpora.

We conduct experiments on a noun category

prediction task on several corpora and observethat our method outperforms the conventional ap-proaches that use word segmenters. Fig. 1 showsa t-SNE projection of vector representation ofJapanese nouns which is learned from only a rawTwitter corpus. We can see that the proposedmethod can learn vector representation of thesenouns, and the learnt representation achieves goodseparation based on their categories.

2 Related Work

There are some representation models that do notrely on any segmenters. Dhingra et al. (2016) pro-posed a character-based RNN model for vectorrepresentation of tweets, and Schutze (2017) pro-posed a new text embedding method that learnsn-gram vectors from the corpus that segmentedrandomly and then constructs text embeddings bysumming up the n-gram vectors. In the field ofrepresentation learning for biological sequences(e.g. DNA and RNA), Asgari and Mofrad (2015)applied the skip-gram model (Mikolov et al.,2013) to fixed length fragments of biological se-quences. These methods mainly aim at learn-ing vector representation of texts or biologicalsequences instead of words or fragments of se-quences. On the other hand, in this paper, we focuson learning vector representation of words from araw corpus of unsegmented languages.

3 Conventional Approaches to WordEmbeddings

Word embedding is also commonly used inNLP for unsegmented languages (Prettenhofer andStein, 2010; Funaki and Nakayama, 2015; Ishi-watari et al., 2015). In these studies, they usuallysegment a raw corpus into words using a word seg-menter or a morphological analyzer, and then feedthe segmented corpus to word embedding models(e.g. the skip-gram model (Mikolov et al., 2013)or the GloVe (Pennington et al., 2014)) as in thecase of space-delimited languages. The flowchartof the above process is shown in the left part ofFig. 2.

3.1 The original SGNS

The original skip-gram model with negative sam-pling (Mikolov et al., 2013) (we refer to it asthe original SGNS) learns vector representation ofwords vw and their contexts vc that minimize the

768

Constructa word lattice

dictionary

raw corpus

Find theoptimum path

Wordembedding

Evaluateby tasks

Constructan n-gram lattice

frequentn-grams

raw corpus

n-gramembedding

Evaluateby tasks

Previous Proposed

word vectors

segmented text

word lattice

n-gram lattice

n-gram vectors

mor

phol

ogi

cal a

naly

sis

Figure 2: Flowcharts of previous and proposedpipelines. Morphological analyzers are in chargeof the shaded part. Our main idea is to replace aword dictionary with a set of frequent character n-grams, and omit the indentification of the optimumpath.

following objective function:

maximize{vw}∪{vc}

∑(w,c)∈D

log σ(v⊤w vc) +∑

(w,c)∈D′log σ(−v⊤w vc)

(1)

where σ(x) := (1 + e−x)−1, D is a multiset (bag)of positive samples (i.e. co-occurred pairs in thecorpus), and D′ is a multiset of negative sam-ples. This objective function is maximized usingstochastic gradient descent (SGD).

4 Segmentation-Free Word Embeddings

In this section, we first introduce the general ideaof segmentation-free word embeddings (sembei),and then propose a segmentation-free version ofthe SGNS.

While conventional word embedding ap-proaches learn word vectors from segmentedcorpora that provided by word segmenters,our approach learns n-gram vectors from rawcorpora, as in the right part of Fig. 2. In orderto learn n-gram vectors from a raw corpus ofunsegmented languages, we first construct afrequent n-gram lattice, which represents all pos-sible segmentations based on frequent charactern-grams of the corpus, in the same way as theconstruction of word lattices used in morpho-logical analysis. Then, we learn n-gram vectorsusing co-occurrence statistics over the frequent

n-gram lattice instead of segmented corpora as inconventional approaches.

4.1 Segmentation-Free Version of the SGNS

Here, we introduce a segmentation-free version ofSGNS, referred to as SGNS-sembei, as an appli-cation of the idea of segmentation-free word em-bedding. Our method simply optimizes the origi-nal SGNS’s objective function (1) with the slightmodification: changing the definition of the multi-set of positive samples D.

In SGNS-sembei, D is redefined as the multi-set of character n-gram pairs (w, c) where w andc occur adjacently in the corpus (i.e. w and c areconnected in the frequent n-gram lattice). In addi-tion, to discriminate co-occurrence with differentorder in the frequent n-gram lattice, we define con-textual words with their relative positions to thecenter word as the same way as Ling et al. (2015)did.

We also redefine the multiset of negative sam-ples D′ using D in the same way as the origi-nal SGNS, and then optimize the objective func-tion (1) using SGD.

Table 1: Examples of labels of entities (inJapanese, and in English for reference) and its cat-egories extracted from Wikidata.

label (ja) label (en) category

ドイツ Germany country二酸化炭素 carbon dioxide chemical compound消防士 firefighter professionアップルパイ apple pie food長友佑都 Yuto Nagatomo human

5 Experiment

In this section, we evaluate our method by thenoun category prediction task on Twitter, Weibo,and Wikipedia corpora.

The C++ implementation of the proposedmethod is available on GitHub1.

5.1 Settings

We used four raw text corpora: Wikipedia(Japanese), Wikipedia (Chinese), Twitter(Japanese), and Weibo (Chinese). The Wikipediacorpora consist of only a part of the Wikipedia

1https://github.com/oshikiri/w2v-sembei

769

Table 2: Micro F-scores (higher is better) and coverages [%] (in parentheses, higher is better).dictionary Japanese Chinese

default Wikidata Wikipedia Twitter Wikipedia Weibo

SGNS ✓ 0.896 (34) 0.761 (46) 0.889 (86) 0.766 (88)SGNS ✓ ✓ 0.945 (98) 0.867 (96) 0.891 (94) 0.765 (93)SGNS-sembei 0.949 (100) 0.870 (100) 0.891 (100) 0.811 (100)

dumps2 (dated on February 20th, 2017), whoseHTML tags are removed. The Weiboscopecorpus (Chau et al., 2013) consists of 226,841,122posts mainly in Chinese, and we use only a partof it. The Twitter corpus consists of 17,316,968Japanese tweets that were collected from October26th, 2016 until November 22nd, 2016 via theTwitter Streaming API. We removed hashtags,users’ id, and URL from Twitter and Weibocorpora. We extracted about 1,460k frequentn-grams3 as the frequent character n-grams forour proposed method.

We extracted the noun-category pairs from theWikidata (Vrandecic and Krotzsch, 2014) (Weused the dump dated January 9th, 2017) as fol-lows. We first extracted Wikidata entities whoseheadwords are also in the 1,460k frequent n-grams, and then extracted the Wikidata entitieswhose “instance of” properties are any of thepredetermined category set4, and then collectednames and their categories of the entities. Ex-amples of the extracted noun-category pairs areshown in Table 1.

We randomly split the noun-category pairsinto a train (60%) and a test (40%) set. Wetrained linear C-SVM classifiers (Hastieet al., 2009) with the train set to predictcategories from vector representation of thenouns. We performed a grid search over(C, classifier) ∈ {0.5, 1, 5, 10, 50, 100} ×{one-vs-one, one-vs-rest} of linear SVM usingthe train set for each vector representation, and

2We used {ja,zh}wiki-20170220-pages-articles1.xml inhttps://dumps.wikimedia.org

3In this experiment, we defined the frequentn-grams as the union of the top-kn frequent n-grams, where n and kn are the pre-specified num-bers. And we used n = 8, (k1, . . . , k8) =(10000, 300000, 300000, 300000, 200000, 200000, 100000,50000) for Japanese corpora, and n = 7, (k1, . . . , k7) =(10000, 400000, 400000, 300000, 200000, 100000, 50000)for Chinese corpora

4 {country, profession, ship, railway station, food, chem-ical compound, prefecture of Japan, manga, human } forJapanese, and {country, profession, television series, busi-ness enterprise, city, chemical compound, taxon, human} forChinese

reported the best scores on the test set.

5.2 Baseline Systems

We compared SGNS-sembei with the conven-tional approaches that use the original SGNS andword segmenters. To segment the raw corpora, weused the MeCab (Kudo et al., 2004) for Japanesecorpora and the Stanford Word Segmenter (Tsenget al., 2005) for Chinese corpora with their de-fault dictionaries5. And we ignored the words thatoccur less than 5 times. We also ran these base-line systems in an ideal setting: running the wordsegmenters with the default dictionaries and ad-ditional dictionaries that consist of the nouns ex-tracted in § 5.1.

We performed a grid search over (h, t, nneg) ∈{5, 8, 10} × {10−5, 10−4, 10−3} × {3, 10, 25}where h is the size of context window, t is the sam-pling threshold, and nneg is the number of negativesamples.

5.3 Results

In both the original SGNS and SGNS-sembei, wefixed the dimensionality of vector representationto 200 and the number of iterations to 5 in bothbaseline and our method. In SGNS-sembei, weused the number of negative samples nneg = 10,size of context window h = 1, initial learning rateαinit = 0.01.

The resulting micro F-scores and the cover-ages (i.e. the percentages of the noun-categorypairs whose nouns’ vector representation exists)are shown in Table 2, and the t-SNE (Maaten andHinton, 2008) projections of Japanese nouns vec-tors learned from the Twitter corpus are shown inFig. 1. We observed that our proposed methodoutperforms the conventional approaches that useword segmenters. Furthermore, the coverages ofour method were higher than those of the SGNSwith the default dictionary (especially in Japanese)and competitive to those of the SGNS with the de-fault dictionary and Wikidata (which is an ideal

5We use mecab-ipadic v2.7.0 for the MeCab anddict-chris6.ser.gz for the Stanford Word Segmenter.

770

setting) even though our method does not requireany manually annotated resources. We can alsosee that the learnt representation achieves goodseparation based on their categories as in Fig. 1.Nearest neighbor search using Twitter and Weibocorpora was also performed as preliminary exper-iments, and surprisingly, it worked well for fre-quent words as in Table. 3.

Table 3: Results of nearest neighbor search for fre-quent words

Language Query 3-Nearest Neighbors

Japaneseドイツ (Germany)

中国 (China), イギリス (UK),ポーランド (Poland)

酸素 (oxygen)水素 (hydrogen),鉄分 (iron),二酸化炭素 (carbon dioxide)

Chinese德国 (Germany)

美国 (USA), 英国 (UK), 法国(France)

羽毛球 (badminton)台球 (billiards),网球 (tennis),乒乓球 (pingpong)

6 Conclusion

We proposed segmentation-free word embeddingfor unsegmented languages. Although our methoddoes not rely on any manually annotated re-sources, experimental results of the noun categoryprediction task on several corpora showed that ourmethod outperforms conventional approaches thatrely on manually annotated resources.

As an anonymous reviewer suggested, a pos-sible direction of future work is to leverage an-other word segmentation approach which uses lin-guistic features, such as the Stanford Word Seg-menter (Tseng et al., 2005) with k-best segmenta-tions.

Acknowledgments

I would like to thank Hidetoshi Shimodaira,Thong Pham, Kazuki Fukui, and anonymous re-viewers for their helpful suggestions. This workwas partially supported by grants from Japan So-ciety for the Promotion of Science KAKENHI(JP16H02789) to HS.

ReferencesEhsaneddin Asgari and Mohammad R. K. Mofrad.

2015. Continuous distributed representation of bi-ological sequences for deep proteomics and ge-nomics. PLOS ONE, 10(11):1–15.

Yoshua Bengio, Rejean Ducharme, Pascal Vincent, andChristian Jauvin. 2003. A neural probabilistic lan-

guage model. Journal of Machine Learning Re-search, 3:1137–1155.

Michael Chau, Chung hong Chan, and King wa Fu.2013. Assessing censorship on microblogs inchina: Discriminatory keyword analysis and thereal-name registration policy. IEEE Internet Com-puting, 17:42–50.

Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick,Michael Muehl, and William Cohen. 2016.Tweet2vec: Character-based distributed represen-tations for social media. In Proceedings of the54th Annual Meeting of the Association for Com-putational Linguistics (Volume 2: Short Papers),pages 269–274, Berlin, Germany. Association forComputational Linguistics.

Ruka Funaki and Hideki Nakayama. 2015. Image-mediated learning for zero-shot cross-lingual doc-ument retrieval. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 585–590, Lisbon, Portugal. Asso-ciation for Computational Linguistics.

Trevor Hastie, Robert Tibshirani, and Jerome Fried-man. 2009. The Elements of Statistical Learning,2 edition. Springer New York.

Fei Huang, Alexander Yates, Arun Ahuja, and DougDowney. 2011. Language models as representationsfor weakly supervised nlp tasks. In Proceedings ofthe Fifteenth Conference on Computational NaturalLanguage Learning, pages 125–134, Portland, Ore-gon, USA. Association for Computational Linguis-tics.

Shonosuke Ishiwatari, Nobuhiro Kaji, Naoki Yoshi-naga, Masashi Toyoda, and Masaru Kitsuregawa.2015. Accurate cross-lingual projection betweencount-based word vectors by exploiting translatablecontext pairs. In Proceedings of the Nineteenth Con-ference on Computational Natural Language Learn-ing, pages 300–304, Beijing, China. Association forComputational Linguistics.

Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto.2004. Applying conditional random fields tojapanese morphological analysis. In Proceed-ings of the 2004 Conference on Empirical Meth-ods in Natural Language Processing, pages 230–237, Barcelona, Spain. Association for Computa-tional Linguistics. http://taku910.github.io/mecab/.

Wang Ling, Chris Dyer, Alan W Black, and IsabelTrancoso. 2015. Two/too simple adaptations ofword2vec for syntax problems. In Proceedings ofthe 2015 Conference of the North American Chap-ter of the Association for Computational Linguis-tics: Human Language Technologies, pages 1299–1304, Denver, Colorado. Association for Computa-tional Linguistics.

771

Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-sne. Journal of MachineLearning Research, 9(Nov):2579–2605.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-rado, and Jeff Dean. 2013. Distributed representa-tions of words and phrases and their compositional-ity. In Advances in Neural Information ProcessingSystems, pages 3111–3119.

Toshiaki Nakazawa, Chenchen Ding, Hideya Mino,Isao Goto, Graham Neubig, and Sadao Kurohashi.2016. Overview of the 3rd workshop on asian trans-lation. In Proceedings of the 3rd Workshop on AsianTranslation (WAT2016), pages 1–46, Osaka, Japan.The COLING 2016 Organizing Committee.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for wordrepresentation. In Proceedings of the 2014 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1532–1543, Doha, Qatar. Associ-ation for Computational Linguistics.

Peter Prettenhofer and Benno Stein. 2010. Cross-language text classification using structural corre-spondence learning. In Proceedings of the 48thAnnual Meeting of the Association for Computa-tional Linguistics, pages 1118–1127, Uppsala, Swe-den. Association for Computational Linguistics.

Hinrich Schutze. 2017. Nonsymbolic text representa-tion. In Proceedings of the 15th Conference of theEuropean Chapter of the Association for Computa-tional Linguistics: Volume 1, Long Papers, pages785–796, Valencia, Spain. Association for Compu-tational Linguistics.

Richard Socher, John Bauer, Christopher D. Manning,and Ng Andrew Y. 2013. Parsing with composi-tional vector grammars. In Proceedings of the 51stAnnual Meeting of the Association for Computa-tional Linguistics (Volume 1: Long Papers), pages455–465, Sofia, Bulgaria. Association for Computa-tional Linguistics.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.Sequence to sequence learning with neural net-works. In Advances in Neural Information Pro-cessing Systems 27, pages 3104–3112. Curran As-sociates, Inc.

Huihsin Tseng, Pichuan Chang, Galen Andrew, DanielJurafsky, and Christopher Manning. 2005. A Condi-tional Random Field Word Segmenter for SIGHANbakeoff 2005. In Proceedings of the Fourth SIGHANWorkshop on Chinese Language Processing, volume171. Jeju Island, Korea. http://nlp.stanford.edu/software/segmenter.shtml.

Denny Vrandecic and Markus Krotzsch. 2014. Wiki-data: A free collaborative knowledgebase. Com-mun. ACM, 57(10):78–85.

772

segmentation-free word embedding for … › anthology › d17-1080.pdfwikidata ( vrande ci c and...

Documents