a novel deep learning method for obtaining bilingual corpus...

Research ArticleA Novel Deep Learning Method for Obtaining Bilingual Corpusfrom Multilingual Website

ShaoLin Zhu123 Xiao Li12 YaTing Yang 12 Lei Wang12 and ChengGangMi12

1The Xinjiang Technical Institute of Physics amp Chemistry Chinese Academy of Sciences Urumqi China2Key Laboratory of Speech Language Information Processing of Xinjiang Urumqi China3University of Chinese Academy of Sciences Beijing China

Correspondence should be addressed to YaTing Yang yangytmsxjbaccn

Received 3 April 2018 Accepted 10 December 2018 Published 10 January 2019

Academic Editor Emilio Insfran Pelozo

Copyright copy 2019 ShaoLin Zhu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Machine translation needs a large number of parallel sentence pairs to make sure of having a good translation performanceHowever the lack of parallel corpus heavily limitsmachine translation for low-resources language pairsWe propose a novelmethodthat combines the continuous word embeddings with deep learning to obtain parallel sentences Since parallel sentences are veryinvaluable for low-resources language pair we introduce cross-lingual semantic representation to induce bilingual signals Ourexperiments show that we can achieve promising results under lacking external resources for low-resource languages Finally weconstruct a state-of-the-art machine translation system in low-resources language pair

1 Introduction

Parallel corpus is one of the most invaluable resources formany multilingual natural language processing applicationsEspecially statistical machine translation (SMT) and neuralmachine translation (NMT) and parallel corpora play apivotal role in those applications Thus many approacheshave been proposed to obtain bilingual corpus automaticallyThose can be roughly divided into two categories (i) the firstmethod is to directly crawl web content and usemultifeaturesto filter parallel webpages content Those features mainlycontain tokens in URLs anchor around link image altHTML tags [1ndash5] The method recognizes parallel pages bycomputing the similarity of those features (ii) the other selectparallel sentences by building classifiers Those classifiersmainly consist of maximum entropy Bayesian SVM andneural networks [2 6ndash10] The two methods have beendemonstrated promising to obtain bilingual corpus in somelanguage pairs However the methods only adapt somespecific website or no-low resources languages pairs Thenlow-resources languages have still been underachieving toharvest bilingual corpus

We believe that two major challenges limit obtainingbilingual parallel corpus for low-resources languages First

dynamic websites have a more complicated structure andit is difficult to filter parallel corpus by recognizing multi-features For example previous work obtains parallel mainlanguage corpus fromWikipedia andTwitter [4 11]Howevermany news websites have a more complicated structure thanWikipedia and Twitter Second although the classifier is agood solution to select parallel corpus from numerous noisedata the number of parallel sentences has an impact on theclassifier [6 12] There is not enough training parallel corpusto construct a classifier in low-resource language pairs

Recently with the surge of continuous vector repre-sentation and extensive application of deep learning aninteresting approach is induce bilingual semantic clues frommonolingual data with the new neural-network inspiredvector representation Following those methods we couldalleviate the limit of low-resource language pair to obtainparallel bilingual corpus In this paper we use this con-tinuous word embedding to induce bilingual representationby establishing cross-lingual mapping Then using thosebilingual representations find some parallel sentences Thisstep avoids the effect of HTML structure as the currentwebsite is developed into dynamic modules Finally weconstruct a bidirectional recurrent neural network (LSTM-BiRNN) classifier to extract parallel sentences We use the

HindawiMathematical Problems in EngineeringVolume 2019 Article ID 7495436 7 pageshttpsdoiorg10115520197495436

2 Mathematical Problems in Engineering

List of URLs Website crawler

L1 docs

L2 docs

Monolingual data

Tools of cut sentences

sentenceL1

sentenceL2

L1 Word embedding

L2 Word embedding

HTML TagsURLs ofanalyzer

Candidatedocuments

Exposing bilingual

signal

Bilingual lexicon

Parallel sentences

Figure 1 The architecture of obtaining parallel sentences

parallel corpus obtaining in word-overlap model to trainthis classifier and perform extracting process To justify theeffectiveness of the proposed approach we obtain Uyghur-Chinese parallel corpus from multilingual websites to trainSMT systems and show improvements in BLEU (BLEUbilingual evaluation understudy) scores Our experimentsalso show that we can achieve promising results by removingthe need of any specific feature engineering or externalresources

2 Related Works

The amount of information available on the Internet isexpanding rapidly and many works attempt to constructtraining corpus for machine translation from websites Avariety of approaches have been proposed to extract parallelsentences from web Those approaches can be divided intotwo strategies

First many approaches treat collecting parallel sentencesas a text classification problem [6 13 14] such as SVMclassifier and neural network classifier For example [6]proposed a latest siamese bidirectional recurrent neuralnetwork to construct a state-of-the-art classifier and detectparallel sentences They remove the need of any domainspecific feature engineering or relay on multiples models andonly raw parallel sentences However parallel sentences alsoare a very invaluable corpus for some low-resources languagepairs So this excellent methodmaybe is not suitable for somelow-resources applications

Second many other works use the HTML structure of theweb pages URLs and image alt et al to detect possible parallelsentences [1 3 15] For instance [7] use the links betweentranslated articles in Wikipedia to crawl parallel sentences orwords These methods have proven to be useful for specificwebsite the real challenge is to find strategies that allow toextend them to crawl the Web in an unsupervised fashion

Espla-Gomis et al developed an excellent tool namelyBitextor a freeopen-source tool for harvesting parallel datafrommultilingual websites It is highly modular and is aimed

at allowing users to easily obtain segment-aligned parallelcorpora from the Internet It mainly obtained parallel sen-tences by comparing HTML structure of the documents andthe number of aligned words in bilingual lexicon The usersonly provide a bilingual lexicon and the system can contrastparallel data quickly automatically The real challenge is that abilingual lexicon is not easy to obtain for some low-resourceslanguage pairs

3 Methodology

Thefirst step of obtaining parallel corpus is harvesting sourceof data We use a web crawler to harvest monolingual dataand construct the continuous word representation Followingworks of [3] we use multifeatures to get the candidate dataThen we extend the works of [16 17] to learn bilingual signalWith the help of bilingual signal we can induce parallelsentences The general architecture of obtaining parallelcorpus is presented in Figure 1

31 Crawling Web-Data and Candidate Documents The firststep of harvesting bilingual parallel corpus is using web-crawler to download data However unlike the perviousworks that downloaded a mirror of a webpage we onlydownload texts that do not contain html tags As the currentwebsite is developed into module the same theme pagesusually have the same HTML structure

Whenwe perform the process of downloading we use theScrapy toolkit (httpspypipythonorgpypiScrapy) (it iswritten inPython) It is an excellent toolkit that allows user setspecific content to crawl The next step is selecting candidatedocument pair As we all know a website contains hundredsof thousand documents and if we match the whole websitethe matching procedure is very low and imprecise In orderto solve this problem we borrow the idea of [2] that addsa window of time The main characteristic of news websiteis time and every webpage has publication time The sametopic documents often are reported in a period by different

Mathematical Problems in Engineering 3

language Thus we use a heuristic with which we assume itis more possible that documents with similar content havepublication dates that are close to each other Therefore eachquery is fact run only against source documents publishedwithin a window of some days around the publication date ofthe target query document In this procedure we set the sizeof window as three Then each query can search only fewerdocuments and get a higher precision In next section we willintroduce how to identify two multilingual documents thatare parallel

32 Inducing Bilingual Signal In this paper we follow worksof [16 17] that induce bilingual lexicon from non-paralleldata In order to learning bilingual lexicon frommonolingualcorpus we must construct bilingual semantic representationHowever unlike the usual task that learns a precise bilinguallexicon our objective is harvestingmore bilingual signal frommultilingual data In this step we care more about recallrather than precision Our objective function is

T (119882119894119881119904 119882119895

119881119905) = 120572T119898119900119899119900 + 120573T119898119886119905119888ℎ (1)

where 119882119894119881119904 is the one word in vocabulary of 119881119904 while thereverse direction follows by symmetry for 119882119895

119881119905 At the same

time in order to normalizeT(119882119894119881119904 119882119895

119881119905) we set the sum of 120572

and 120573 as 1 Parameters 120572 and 120573 mainly explain the influenceof the monolingual and bilingual components

Unlike the usual monolingual term T119898119900119899119900 explainingregularities in monolingual corpora we use the term explainthe translation probability of two words each other Sincesimilar words in semantic are closer in distance we can revealmore translation pairs by measuring distance of two wordsfrom seeds If the two words are closer from one seed indistance it is more likely translated each other

T119898119900119899119900 = T119904119898119900119899119900 + T

119905119898119900119899119900 (2)

T119904119898119900119899119900 = min

⟨119904119904119905119905⟩isind

10038171003817100381710038171003817119882119894119881119904 minus 119882119904119904119881119904

10038171003817100381710038171003817 (3)

T119905119898119900119899119900 = min

⟨119904119904119905119905⟩isind

10038171003817100381710038171003817119882119894119881119905 minus 119882119905119905119881119905

10038171003817100381710038171003817 (4)

Our monolingual term T119898119900119899119900 encourages embeddings ofword translation pairs from a seed lexicon d to move closer119882119894119881119904 and 119882119904119904119881119904 are the two source words in the seed lexiconT119904119898119900119899119900 computed the semantic similarity between the wordsi and ss For the target we have the same definition

Our matching term T119898119886119905119888ℎ can expose how source-to-target words translateThematching term actually can induce

T119898119886119905119888ℎ = argmax119904isin119881119904 119886119899119889 119904notind

[M119904119905] cos (119882119904119881119904 119882119905119881119905) (5)

As we learn bilingual signal from monolingual corpus itmeans that source word vector and target word vector aretrained independently each other The two vectors are not inone vector space In order to solve this problem we followthe method of [18] to convert the monolingual vector space

to a share space Our objective is optimizing the cross-lingualmatching regulizer

M119904119905 = sum119894

sum119895

11988611989411989510038171003817100381710038171003817119908119904119894 minus 119908119905119895

100381710038171003817100381710038172

(6)

= (R119878 minus R119879)⊤119860 (R119878 minus R

119879) (7)

In above formula we use119860 as the similarity matrix of twowords where 119886119894119895 encodes the translation score of word 119894 insource with word 119895 in target 119908119904119894 is the K-dimensional wordembedding which is stacked to form a (VK)-dimensionalmatrix R

Using a simple example to explain this procedure assumethat we have an English lexicon perform believe talk anda Chinese lexicon zhixing shixing jiaotan an English-Chinese lexicon conduct jinxing Assuming that ithave already conducting the step (1)(2) we can calculatethat perform is closer withconduct in source andzhixing shixing are closer with jinxing in target Wecan add perform zhixing perform shixing into originallexicon

33 Parallel Sentences Identification For our particular situ-ation that is seriously low-resource language pairs althoughclassifier is goodmethod to identify parallel sentence we havenot enough parallel sentences to train this classifier So inthe initial stage we use a word-overlap model filter to selectparallel sentences The word-overlap model must borrow abilingual lexicon and the parallel sentences can be identifiedby the number of co-occurring word pairs This process canbe represented

score (d119904 d119905) = d119904 cap d119905min (d119904d119905)

(8)

From the above we could conclude that inducing bilin-gual signal is an important step Using bilingual lexicon wecan quickly calculate the alignment word for two sentencesIn order to ensure getting massive parallel sentences wemust get a high coverage bilingual lexicon However we maynot get a large coverage bilingual lexicon and the resultis that we can only get a little parallel corpus using word-overlap model We can also watch this in the experiments bySection 4 In order to get a large parallel sentences and highaccuracy we use classifier to get more parallel data in nextThe classifier already has proved that it is an excellent methodto extract parallel sentencesWe follow thework of [6] to traina BiRNN (bidirectional recurrent neural network) classifierOur neural network architecture is illustrated as Figure 2

Like the most previous approaches that train neuralnetwork classifier using parallel sentences our method alsoconvert the sentences into vectors However unlike usingword vector as input we use fixed-size sentences vectors asinput For the ReLu layer we can define as

x119894 = sigm (w119894s119894 + b) (9)


Fully connected layers

ReLU

Bi-LSTM

ℎb1

ℎbi ℎb

j ℎbn

ℎf1 ℎ

fi ℎ

fj ℎ

fn

Ｑxℎ

Ｑ＜xℎ

x1 xi xj xn

tmt1sns1

y

Figure 2 Architecture for bidirectional recurrent neural networksThe fully connected layers predict the probability of parallel sentencepair

For the BiRNN layer it contains feed-forward and feed-backward neural networks layer This can be described by

ℎ119891119894 = 0 (119908119891119909ℎ

119909119894 + 119908119891ℎℎ

ℎ119894minus1 + 119887119891) (10)

ℎ119887119894 = 0 (119908119887119909ℎ119909119894 + 119908119887ℎℎℎ119894minus1 + 119887119887) (11)

h = ℎ119891119894 + ℎ119887119894 (12)

For prediction a sentence pair can be identified as parallelif the probability exceeds a threshold We can compute asfollows

119910 =

1 119894119891 119901 (y = 1 | h) gt 1205900 119900119905ℎ119890119903119908119894119904119890

(13)

4 Experiments

To assess the effectiveness of our method we compare itin different setting against the baseline As we mainly payattention to solve low-resource problem and construct low-resource language pair translation system As an instantiationof this goal we conduct a detailed study on Uyghur-Chinese

41 Experiment Setup

411 Data In our experiments the actual systems obtainbilingual parallel corpus from three multilingual websiteTianShan (httpwwwtscn) RenMin (httpwwwpeoplecomcn) and KunLun (httpwwwxjkunluncn) web-site We only retain those webpages of documents hav-ing more than 20 words The statistics of the preprocessedcorpora is given in Table 1 Pay attention that we onlyselect the length of sentence that exceeds 10 words The dataof our experiment is available at httpspanbaiducoms1EePrHOjhuN-jTb-vNiSgTA

Table 1 Experiment set statistics

Websites languages webpages sentences

TianShan Chinese 249238 3839000Uyghur 48907 427000

RenMin Chinese 451972 5500000Uyghur 99578 590000

KunLun Chinese 44046 641000Uyghur 27419 324000

412 Evaluations and Ground Truth In order to carry outan objective evaluation for obtained parallel sentences weperform two methods to evaluate The first is translationaccuracy which is the proportion of truly parallel sentencepairs among all obtained sentences pairs As we obtain datafrom open data platform we canrsquot get a standard transla-tion language pairs to compute the translation accuracy ofobtained parallel sentences So we use manually evaluatethe accuracy of a random sample of the obtained parallelsentences In experiments we randomly select 500 obtainedparallel sentences to conduct manual evaluation Another isthat use obtained parallel sentences to construct machinetranslation system and the BLEU score as an evaluationmetric

413 Baseline For comparison we use a parallel sentencesextraction system Bitextor The system is a freeopen-sourcetool for harvesting parallel data from multilingual websitesUser is required to provide one or more URLs of websitesto be processed the two languages for which the parallelcorpus will be produced and a bilingual lexicon in thesetwo languages This system can automatically analyze thestructure of webpages and obtain parallel data by bilinguallexicon Thus we alter various size bilingual lexicons to testhow it affects the obtaining of parallel sentences

Another problem is evaluating classifier We all knowthat we must use parallel sentences to train classifier andusing the classifier predicts parallel sentences The size ofparallel sentences affects the classifier performance So werespectively select different number of parallel sentences totrain classifier and test it

42 Effect of Bilingual Lexicon Size In order to investigate theeffectiveness of our system for obtaining parallel sentencesin low-resource language pairs we respectively run oursand Bitextor with different bilingual sice As the Bitextordoes not use the time window as a feature to select paralleldata we use this feature in our system So we use the timewindow to filter the results of Bitextor in order to keepexperimental consistency We record the performance byvarying the lexicon size to conduct training process shownin Table 2 The table is for Uyghur-Chinese

In experiments we use 600 1 500 5 000 10 000entries of lexicon to conduct the obtaining parallel sentencesprocess From Figures 3 and 4 we can immediately seethe important role the bilingual lexicon plays in obtainingparallel sentences process We observe that Bitextor does not


Table 2 The size and accuracy of obtaining parallel sentences in different number of training corpus

Model The number of training parallel sentences2000 5000 10000 20000 40000

LSTM size 13000 33000 65000 92000 126000accuracy 06 071 078 081 082

C-BiRNN size 14000 28000 58000 86000 121000accuracy 058 063 068 070 072

Lexicon

BitextorOurs

100008000600040002000

070

075

080

085

Prec

ision

090

Figure 3 Precision of result as the entries of bilingual lexicon

BitextorOurs

Lexicon100008000600040002000

10000

20000

30000

Size

40000

50000

60000

70000

Figure 4 Size of result as the entries of bilingual lexicon

obtain parallel sentences when the size of bilingual lexiconis very small However we see that our system can get avery objective result under low resources We can easily findthat Bitextor has a very unstable performance with differentlexicon size However ours can keep relatively stable resultsno matter the number and accuracy of obtaining parallelsentences When the lexicon entry only is a little such as

600 ours can get a very objective number and accuracy ofparallel sentences This result combined with the inadequateperformance of the baseline conforms to our expectationthat obtain parallel sentences for low-resources languagepairs

From Figures 3 and 4 we can obtain a lot of parallelsentences with a high accuracy However we still find thatwe canrsquot obtain enough parallel sentences for actual naturallanguage processing such as SMT We analyze two factorsaffecting the number of obtaining parallel sentences (1)despite constructing word embeddings and using methodsin Section 3 we can find more bilingual signals Howeverwe only retain words that occur at least 1000 times (Lowertimes threshold make the accuracy very low and low timeword cannot get a fine word embedding) and it seriouslylimits obtaining the size of bilingual signal (2) We set atime window to filter noisy it makes large candidate not beobtained in final parallel sentences

43 Effect of Parallel Sentences for Classifier ExperimentsUsing bilingual signal we can obtain a certain number ofparallel From Section 42 we can see that the number ofobtaining parallel sentences is not large enough Construct-ing bilingual classifier is a good method to filter bilingualparallel sentences from monolingual corpus In this sectionwe will discuss the classifier how affect the obtaining parallelsentences and which factor affects the processing of obtainingparallel sentences

In this experiment we construct based onLSTMand clas-sical bidirectional recurrent neural network (C-BiRNN) clas-sifiers to filter parallel sentences frommonolingual corpus Atthe same time we use 2000 5000 10000 20000 40000number of parallel sentences to train classifier Table 2 showsthe results of the tested systems for Uyghur-Chinese We canobserve that the two neural work to have a different perfor-mance for size and accuracy of obtaining parallel sentencesFrom the table the LSTM have a better result than the othernomatter the size and accuracyWe attribute it to the fact thatLSTM have a better neural network structure to remembermore information than the classical bidirectional recurrentneural network (C-BiRNN) Another interesting finding isthat the size of training corpus plays a big role to filterbilingual parallel sentences When the number of trainingparallel sentences is only 2000 the two testing systems onlyobtain a few results and the most unacceptable is that theresult has a very low accuracy so that the result cannot useany natural language process tasks However as the trainingparallel sentences increase the size and accuracy have a


Table 3 Statistics of the size and precision of parallel sentences extracted from multilingual websites

Model Training corpus sentences precision

BitextorampLSTM 30000 117900 07040000 124200 070

OursampLSTM 30000 120200 08140000 127900 082

Table 4 BLEU scores on Uyghur-Chinese SMT using differenttraining corpus

Model BLEU sentencesBitextorampLSTM ampSMT(baseline) 56 100000Ours ampLSTM ampSMT 1581 100000

great improvement no matter C-BiRNN and LSTM We canconclude that the number of training parallel sentences has abig influence on the performance of classifierThis conclusioncan present the importance of our inducing bilingual signalOnly by the methods detailed in Section 32 and experimentin Section 42 canwe obtain enoughparallel sentences to trainan state-of-the-art classifier

44 Machine Translation Evaluation Our final objective ofobtaining parallel sentences is training a machine translationsystem to perform translation task for low-resource languagepair In order to justify the effectiveness of our methods weobtain parallel sentences to construct a machine translationsystem in low-resources Uyghur-Chinese language pair andevaluate its quality by measuring the BLEU score on SMTsystem We use an state-of-the-art freeopen source Moses[19] to train phrase-based translation system

In our experiment we use the Bitextor and ours methodto obtain training parallel sentences and the classifiers alluse LSTM neural network The reason of using Bitextor isthat we need a baseline system to measure For two methodswe select 30000 40000 sentence pairs as training corpusto construct classifier and obtain enough training corpusto train machine translation system The first experiment isselecting enough number of parallel sentences (see Table 3)We can see that ours exceed Bitextor under using sameclassifier Although the two can get many of candidateparallel sentences the results of Bitextor have a low precisionWe attribute the reason that the Bitextor needs an enoughbilingual lexicon and ours does not

In next section we will use the collocated extractionprocedure described in Section 3 to train some machineUyghur-Chinese translation systems As the baseline SMTsystem we use parallel sentences obtained by Bitextor to traina classifier to obtain final training corpus for SMT systemTable 4 shows the BLEU scores for the different SMT systems

We can see that our approach can get a higher BLEUscore than the baseline In the experiment we all use 30000sentence pairs to train classifier Combining Table 3 withTable 4 we can believe that the baseline cannot get a very highaccuracy of parallel sentences and makes SMT system havea low performance As we all know the quality of training

corpus heavily affects the performance of SMT system Wefurther analyze the Bitextor need of a bilingual lexicon toguarantee a high accuracy of parallel corpus Although it isan excellent system to obtain parallel corpus it will showa poor performance for low-resources language pairs Thisexperiment clearly indicates the benefit of obtaining parallelsentences using our method It is important to note thatwe can construct a machine translation system with low-resources language pairs

5 Conclusion

In this paper we present a new minimal supervision methodto obtain parallel sentences for solving low-resources prob-lem in natural language processing Our experiments showthat our approach outperforms the traditional system toobtain parallel corpus from multilingual websites for low-resources language pairs

Our methods mainly contain three steps First we useWord2vec to train two monolingual word embeddings Bya small bilingual lexicon about hundreds of words we caninduce more bilingual signals Then using a word-overlapmodel finds some parallel sentences This step avoids theeffect of HTML structure as the current website is developedinto dynamic modules Finally we construct a LSTM-BiRNNclassifier to extract parallel sentences We use the parallelcorpus obtaining in above step to train this classifier andperform extracting process We use the final obtaining par-allel sentences to construct a Uyghur-Chinese SMT systemto measure our method The experiments indicate that ourmethod can get state-of-the-art results in low-resourceslanguage pair

Data Availability

The data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work is supported by the Xinjiang Fun (Grant no2015KL031) the West Light Foundation of the ChineseAcademy of Sciences (Grant no 2015-XBQN-B-10) theXinjiang Science and Technology Major Project (Grant no


2016A03007-3) and Natural Science Foundation of Xinjiang(Grant no 2015211B034)

References

[1] L Barbosa V Sridhar K and M Yarmohammadi ldquoHarvestingParallel Text in Multiple Languages with Limited SupervisionrdquoInternational Conference on Computational Linguistics pp 201ndash214 2012

[2] D S Munteanu and D Marcu ldquoImproving machine translationperformance by exploiting non-parallel corporardquo Computa-tional Linguistics vol 31 no 4 pp 477ndash504 2005

[3] M Espla-Gomis M Forcada S Ortiz Rojas and J Ferrandez-Tordera ldquoBitextorrsquos participation in WMTrsquo16 shared task ondocument alignmentrdquo in Proceedings of the First Conference onMachine Translation Volume 2 Shared Task Papers pp 685ndash691 Berlin Germany August 2016

[4] W Ling L Marujo C Dyer A W Black and I TrancosoldquoCrowdsourcing High-Quality Parallel Data Extraction fromTwitterrdquo in Proceedings of the Ninth Workshop on StatisticalMachine Translation pp 426ndash436 Baltimore Maryland USAJune 2014

[5] A Khwileh H Afli G Jones and A Way ldquoIdentifyingEffectiveTranslations forCross-lingualArabic-to-EnglishUser-generated Speech Searchrdquo in Proceedings of the Third ArabicNatural Language Processing Workshop pp 100ndash109 ValenciaSpain April 2017

[6] F Gregoire and P Langlais ldquoA Deep Neural Network ApproachTo Parallel Sentence Extractionrdquo 2017 httpsarxivorgabs170909783

[7] J R Smith C Quirk and K Toutanova ldquoExtracting par-allel sentences from comparable corpora using documentlevel alignmentrdquo in Proceedings of the 2010 Human LanguageTechnologies Conference ofthe North American Chapter of theAssociation for Computational Linguistics NAACL HLT 2010pp 403ndash411 USA June 2010

[8] C Tillmann and S Hewavitharana ldquoAn efficient unified extrac-tion algorithm for bilingual datardquo in Proceedings of the 12thAnnual Conference of the International Speech CommunicationAssociation INTERSPEECH 2011 pp 2093ndash2096 Italy August2011

[9] R G Hussain M A Ghazanfar M A Azam U Naeemand S Ur Rehman ldquoA performance comparison of machinelearning classification approaches for robust activity of dailyliving recognitionrdquo Artificial Intelligence Review pp 1ndash23 2018

[10] M A Ghazanfar S A Alahmari Y F Aldhafiri AMustaqeemM Maqsood and M A Azam ldquoUsing machine learning clas-sifiers to predict stock exchange indexrdquo International Journal ofMachine Learning and Computing vol 7 no 2 pp 24ndash29 2017

[11] C Chu T Nakazawa and S Kurohashi ldquoConstructing aChinese-Japanese parallel corpus from wikipediardquo in Proceed-ings of the 9th International Conference on Language Resourcesand Evaluation LREC 2014 pp 642ndash647 Iceland May 2014

[12] A Barron-Cedeno C Espana-Bonet J Boldoba and LMarquez ldquoA Factory of Comparable Corpora fromWikipediardquoin Proceedings of the Eighth Workshop on Building and UsingComparable Corpora pp 3ndash13 Beijing China July 2015

[13] V K Rangarajan Sridhar L Barbosa and S Bangalore ldquoAscalable approach to building a parallel corpus from the Webrdquoin Proceedings of the 12th Annual Conference of the InternationalSpeech Communication Association INTERSPEECH 2011 pp2113ndash2116 Italy August 2011

[14] A Antonova and A Misyurev ldquoBuilding a web-based parallelcorpus and filtering outmachine-translated textrdquoTheWorkshopon Building Using Comparable Corpora Comparable Corporathe Web pp 136ndash144 2011

[15] V Papavassiliou P Prokopidis and G Thurmair ldquoA modularopen-source focused crawler for mining monolingual andbilingual corpora from the webrdquo The Workshop on Building ampUsing Comparable Corpora pp 43ndash51 2013

[16] T Mikolov K Chen and G Corrado ldquoEfficient Estimationof Word Representations in Vector Spacerdquo Computation andLanguage 2013

[17] M Zhang H Peng Y Liu H Luan and M Sun ldquoBilinguallexicon induction from non-parallel data with minimal super-visionrdquo in Proceedings of the 31st AAAI Conference on ArtificialIntelligence AAAI 2017 pp 3379ndash3385 USA February 2017

[18] S Gouws Y Bengio and G Corrado ldquoBilBOWA Fast bilin-gual distributed representations without word alignmentsrdquo inProceedings of the 32nd International Conference on MachineLearning ICML 2015 pp 748ndash756 France July 2015

[19] P Koehn R Zens C Dyer et al ldquoMoses open source toolkitfor statistical machine translationrdquo in Proceedings of the 45thAnnual Meeting of the ACL on Interactive Poster and Demon-stration Sessions (ACL rsquo07) pp 177ndash180 Prague CzechRepublicJune 2007

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of


Mathematical Problems in Engineering

Applied MathematicsJournal of


Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of



Engineering Mathematics

International Journal of


Operations ResearchAdvances in

Journal of


Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences


Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018


Decision SciencesAdvances in


AnalysisInternational Journal of


Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom


List of URLs Website crawler

L1 docs

L2 docs

Monolingual data

Tools of cut sentences

sentenceL1

sentenceL2

L1 Word embedding

L2 Word embedding

HTML TagsURLs ofanalyzer

Candidatedocuments

Exposing bilingual

signal

Bilingual lexicon

Parallel sentences

Figure 1 The architecture of obtaining parallel sentences

parallel corpus obtaining in word-overlap model to trainthis classifier and perform extracting process To justify theeffectiveness of the proposed approach we obtain Uyghur-Chinese parallel corpus from multilingual websites to trainSMT systems and show improvements in BLEU (BLEUbilingual evaluation understudy) scores Our experimentsalso show that we can achieve promising results by removingthe need of any specific feature engineering or externalresources

2 Related Works

The amount of information available on the Internet isexpanding rapidly and many works attempt to constructtraining corpus for machine translation from websites Avariety of approaches have been proposed to extract parallelsentences from web Those approaches can be divided intotwo strategies

First many approaches treat collecting parallel sentencesas a text classification problem [6 13 14] such as SVMclassifier and neural network classifier For example [6]proposed a latest siamese bidirectional recurrent neuralnetwork to construct a state-of-the-art classifier and detectparallel sentences They remove the need of any domainspecific feature engineering or relay on multiples models andonly raw parallel sentences However parallel sentences alsoare a very invaluable corpus for some low-resources languagepairs So this excellent methodmaybe is not suitable for somelow-resources applications

Second many other works use the HTML structure of theweb pages URLs and image alt et al to detect possible parallelsentences [1 3 15] For instance [7] use the links betweentranslated articles in Wikipedia to crawl parallel sentences orwords These methods have proven to be useful for specificwebsite the real challenge is to find strategies that allow toextend them to crawl the Web in an unsupervised fashion

Espla-Gomis et al developed an excellent tool namelyBitextor a freeopen-source tool for harvesting parallel datafrommultilingual websites It is highly modular and is aimed

at allowing users to easily obtain segment-aligned parallelcorpora from the Internet It mainly obtained parallel sen-tences by comparing HTML structure of the documents andthe number of aligned words in bilingual lexicon The usersonly provide a bilingual lexicon and the system can contrastparallel data quickly automatically The real challenge is that abilingual lexicon is not easy to obtain for some low-resourceslanguage pairs

3 Methodology

Thefirst step of obtaining parallel corpus is harvesting sourceof data We use a web crawler to harvest monolingual dataand construct the continuous word representation Followingworks of [3] we use multifeatures to get the candidate dataThen we extend the works of [16 17] to learn bilingual signalWith the help of bilingual signal we can induce parallelsentences The general architecture of obtaining parallelcorpus is presented in Figure 1

31 Crawling Web-Data and Candidate Documents The firststep of harvesting bilingual parallel corpus is using web-crawler to download data However unlike the perviousworks that downloaded a mirror of a webpage we onlydownload texts that do not contain html tags As the currentwebsite is developed into module the same theme pagesusually have the same HTML structure

Whenwe perform the process of downloading we use theScrapy toolkit (httpspypipythonorgpypiScrapy) (it iswritten inPython) It is an excellent toolkit that allows user setspecific content to crawl The next step is selecting candidatedocument pair As we all know a website contains hundredsof thousand documents and if we match the whole websitethe matching procedure is very low and imprecise In orderto solve this problem we borrow the idea of [2] that addsa window of time The main characteristic of news websiteis time and every webpage has publication time The sametopic documents often are reported in a period by different




T (119882119894119881119904 119882119895

119881119905) = 120572T119898119900119899119900 + 120573T119898119886119905119888ℎ (1)


119881119905 At the same


119881119905) we set the sum of 120572



T119898119900119899119900 = T119904119898119900119899119900 + T

119905119898119900119899119900 (2)

T119904119898119900119899119900 = min

⟨119904119904119905119905⟩isind

10038171003817100381710038171003817119882119894119881119904 minus 119882119904119904119881119904

10038171003817100381710038171003817 (3)

T119905119898119900119899119900 = min

⟨119904119904119905119905⟩isind

10038171003817100381710038171003817119882119894119881119905 minus 119882119905119905119881119905

10038171003817100381710038171003817 (4)




[M119904119905] cos (119882119904119881119904 119882119905119881119905) (5)



M119904119905 = sum119894

sum119895

11988611989411989510038171003817100381710038171003817119908119904119894 minus 119908119905119895

100381710038171003817100381710038172

(6)

= (R119878 minus R119879)⊤119860 (R119878 minus R

119879) (7)





(8)



x119894 = sigm (w119894s119894 + b) (9)



ReLU

Bi-LSTM

ℎb1

ℎbi ℎb

j ℎbn

ℎf1 ℎ

fi ℎ

fj ℎ

fn

Ｑxℎ

Ｑ＜xℎ

x1 xi xj xn

tmt1sns1

y



ℎ119891119894 = 0 (119908119891119909ℎ

119909119894 + 119908119891ℎℎ

ℎ119894minus1 + 119887119891) (10)

ℎ119887119894 = 0 (119908119887119909ℎ119909119894 + 119908119887ℎℎℎ119894minus1 + 119887119887) (11)

h = ℎ119891119894 + ℎ119887119894 (12)


119910 =

1 119894119891 119901 (y = 1 | h) gt 1205900 119900119905ℎ119890119903119908119894119904119890

(13)

4 Experiments


41 Experiment Setup

















Lexicon

BitextorOurs

100008000600040002000

070

075

080

085

Prec

ision

090


BitextorOurs

Lexicon100008000600040002000

10000

20000

30000

Size

40000

50000

60000

70000











OursampLSTM 30000 120200 08140000 127900 082









5 Conclusion



Data Availability




Acknowledgments




References



























Journal of












Journal of







Volume 2018






Volume 2018











T (119882119894119881119904 119882119895

119881119905) = 120572T119898119900119899119900 + 120573T119898119886119905119888ℎ (1)


119881119905 At the same


119881119905) we set the sum of 120572



T119898119900119899119900 = T119904119898119900119899119900 + T

119905119898119900119899119900 (2)

T119904119898119900119899119900 = min

⟨119904119904119905119905⟩isind

10038171003817100381710038171003817119882119894119881119904 minus 119882119904119904119881119904

10038171003817100381710038171003817 (3)

T119905119898119900119899119900 = min

⟨119904119904119905119905⟩isind

10038171003817100381710038171003817119882119894119881119905 minus 119882119905119905119881119905

10038171003817100381710038171003817 (4)




[M119904119905] cos (119882119904119881119904 119882119905119881119905) (5)



M119904119905 = sum119894

sum119895

11988611989411989510038171003817100381710038171003817119908119904119894 minus 119908119905119895

100381710038171003817100381710038172

(6)

= (R119878 minus R119879)⊤119860 (R119878 minus R

119879) (7)





(8)



x119894 = sigm (w119894s119894 + b) (9)



ReLU

Bi-LSTM

ℎb1

ℎbi ℎb

j ℎbn

ℎf1 ℎ

fi ℎ

fj ℎ

fn

Ｑxℎ

Ｑ＜xℎ

x1 xi xj xn

tmt1sns1

y



ℎ119891119894 = 0 (119908119891119909ℎ

119909119894 + 119908119891ℎℎ

ℎ119894minus1 + 119887119891) (10)

ℎ119887119894 = 0 (119908119887119909ℎ119909119894 + 119908119887ℎℎℎ119894minus1 + 119887119887) (11)

h = ℎ119891119894 + ℎ119887119894 (12)


119910 =

1 119894119891 119901 (y = 1 | h) gt 1205900 119900119905ℎ119890119903119908119894119904119890

(13)

4 Experiments


41 Experiment Setup

















Lexicon

BitextorOurs

100008000600040002000

070

075

080

085

Prec

ision

090


BitextorOurs

Lexicon100008000600040002000

10000

20000

30000

Size

40000

50000

60000

70000











OursampLSTM 30000 120200 08140000 127900 082









5 Conclusion



Data Availability




Acknowledgments




References



























Journal of












Journal of







Volume 2018






Volume 2018










ReLU

Bi-LSTM

ℎb1

ℎbi ℎb

j ℎbn

ℎf1 ℎ

fi ℎ

fj ℎ

fn

Ｑxℎ

Ｑ＜xℎ

x1 xi xj xn

tmt1sns1

y



ℎ119891119894 = 0 (119908119891119909ℎ

119909119894 + 119908119891ℎℎ

ℎ119894minus1 + 119887119891) (10)

ℎ119887119894 = 0 (119908119887119909ℎ119909119894 + 119908119887ℎℎℎ119894minus1 + 119887119887) (11)

h = ℎ119891119894 + ℎ119887119894 (12)


119910 =

1 119894119891 119901 (y = 1 | h) gt 1205900 119900119905ℎ119890119903119908119894119904119890

(13)

4 Experiments


41 Experiment Setup

















Lexicon

BitextorOurs

100008000600040002000

070

075

080

085

Prec

ision

090


BitextorOurs

Lexicon100008000600040002000

10000

20000

30000

Size

40000

50000

60000

70000











OursampLSTM 30000 120200 08140000 127900 082









5 Conclusion



Data Availability




Acknowledgments




References



























Journal of












Journal of







Volume 2018






Volume 2018













Lexicon

BitextorOurs

100008000600040002000

070

075

080

085

Prec

ision

090


BitextorOurs

Lexicon100008000600040002000

10000

20000

30000

Size

40000

50000

60000

70000











OursampLSTM 30000 120200 08140000 127900 082









5 Conclusion



Data Availability




Acknowledgments




References



























Journal of












Journal of







Volume 2018






Volume 2018












OursampLSTM 30000 120200 08140000 127900 082









5 Conclusion



Data Availability




Acknowledgments




References



























Journal of












Journal of







Volume 2018






Volume 2018










References



























Journal of












Journal of







Volume 2018






Volume 2018















Journal of












Journal of







Volume 2018






Volume 2018