a focused crawler for romanian words discovery

Authors

University Politehnica of Bucharest

A Focused Crawler for Romanian Words Discovery

Ionuț-Gabriel RaduTraian Rebedea [email protected]

Overview

• Introduction• Objective• RWScraper• Related Work • RWScraper: Implementation• Results• Conclusions

01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 2

Introduction• All natural languages are subject to change

over time • As the Web becomes more prevalent, it also

constitutes a major source for identifying language evolution

• Due to large amounts of Romanian web content, the rate of change has increased significantly


Objective• To provide a mechanism to identify new

words (e.g. neologisms) that entered the Romanian language

• Develop a specialized (focused) web crawler for analyzing Romanian web pages and identifying new words


Focused Web Crawling• Crawling the web with a specific purpose:

– “Focus” the spiders to specific content (e.g. people search, scientific publications, products, etc.)

– Ignore other web pages and domains


Solution: RWScraper• RWScraper (Romanian Word Scraper) - is able

to solve the following problems:– Identify Romanian texts;– Distinguish between proper names and common

nouns;– Create a database with new words along with

context information and metadata. In order to identify new

– Discover the most frequent spelling errors in Romanian online texts.


RWScraper – Text Processing• Each word discovered in a Romanian text is looked in

the database provided by www.dexonline.ro, which contains definitions from several Romanian dictionaries (DEX, DOOM, etc.)

• Text Processing Pipeline– Text Normalization– Language Validation– Sentence Segmentation– Sentence-Level Language

Identification– Word Tokenization


Related Work: Neologisms Identification

• A study for Japanese:– Scanning existing Japanese corpora for possible ”new” words,

typically by processing the texts through segmentation software and dealing with the ”out-of-lexicon” problem

– Simulating the Japanese morphological processes to create new possible words and then test for the presence of them in large corpora

• Identification of lexical discriminants (e.g. termed, called, known as) and punctuation discriminants (e.g. single and double quotes) for introducing new words– This method is able to identify a significantly smaller number of

potential new words due to the limited number of lexical discriminant patterns.

• Using data about the frequency of words usage over time


Related Work:Language Identification

• Common Words Methods– Store and use a list with the most frequent words for each language

• Unique Letter Combinations– Database with the most frequent sequences of letters in a language,

not necessarily valid words– The main disadvantage: the poor performance on short texts– The main advantage: it does not require word tokenization

• Language Identification Using N-Grams– Every language has several specific frequently used character n-grams– For a particular language L, the n-gram ordered dictionary is called n-

gram language profile– For a new text, we compute the distance to all computed language

profiles• Markov Models for Language Identification

– The word can be represented as a Markov chain where letters are states

– Compute a Markov model for each language


RWScraper: Implementation• RWScraper is a focused crawler for Romanian

web pages• Developed using Scrapy: open-source scraping

framework in Python• It uses three main concepts:

– Spiders: responsible for defining rules to restrict the crawled content to our area of interest

– Items: data we want to scrape from the web pages– Pipelines: text processing tasks that act on the

crawled web resources


RWScraper Language Validation• Divide the texts into two categories:

– Diacritics free texts - DIAFREE – Genuine Romanian texts – GEN

• 6.40% of the characters in the Romanian texts part of the ro_eu_parliament corpus are diacritics

• One of the problems with this approach is that 4.14% of texts contained ș, â, and î. Unfortunately, there are also other languages that possess these diacritics

• Romanian is the only language that uses ț and ă• Our assumption: if a text has over 600 characters and

has no ț/ă are found– Then it is DIAFREE– Otherwise is GEN


RWScraper Language Validation• Build language profiles, consisting of:

– Character bigrams and trigrams frequency– Common words frequency– Diacritics frequency– Rare characters frequency– Double consonant frequency– Single quotes frequency


Results: Language Validation• 105 texts are divided into: 20 Romanian with diacritics (RO1 -

RO20), 20 Romanian without diacritics (RO21- RO40), 20 Italian, 15 English, 10 Spanish, 5 Latin, 5 French, 5 Turkish texts, 3 Catalan texts, and 2 Aromanian

• The size of the texts varied from 9KB to 2:5MB, the average size being 253:4KB

• Average scores for the discriminator function– Lower score means higher probability for the text to be written in

Romanian– Used to set the discriminant score to 0.77 to separate between

Romanian and non-Romanian texts


Results• Processed 264,328 online documents

– Only 12,555 documents contained new words• From this set of texts, we extracted 698,341

– Only 47,363 phrases contained new words• Discovered 53,724 new words

– 21,343 are proper names• The remaining tokens are common words and they are

divided into the following main categories: – Misspelled words (approximately 35%)– Technical words (approximately 15%)– Argotic words (approximately 10%)– Clitics, regionalisms, archaisms, alternative forms for

existing words account for the rest (cca. 40%)


Results• Most frequent new words


Conclusions• RWScraper is a simple new Romanian words discovery

system• The project has also managed to create a large

database of Romanian words extracted from the WWW– Statistics about common proper names, frequent spelling

mistakes and newly-invented words• There are several elements that could be further

improved– The accuracy of the NLP components used by the system– A more pertinent analysis of the words identified by the

system


Thank you!

Questions?

Discussion


This work has been funded by the Sectorial Operational Programme

Human Resources Development 2007-2013 of the Romanian Ministry of

European Funds through the Financial Agreement

POSDRU/159/1.5/S/132397

a focused crawler for romanian words discovery

Education