a focused crawler for romanian words discovery
TRANSCRIPT
Authors
University Politehnica of Bucharest
A Focused Crawler for Romanian Words Discovery
Ionuț-Gabriel RaduTraian Rebedea [email protected]
Overview
• Introduction• Objective• RWScraper• Related Work • RWScraper: Implementation• Results• Conclusions
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 2
Introduction• All natural languages are subject to change
over time • As the Web becomes more prevalent, it also
constitutes a major source for identifying language evolution
• Due to large amounts of Romanian web content, the rate of change has increased significantly
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 3
Objective• To provide a mechanism to identify new
words (e.g. neologisms) that entered the Romanian language
• Develop a specialized (focused) web crawler for analyzing Romanian web pages and identifying new words
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 4
Focused Web Crawling• Crawling the web with a specific purpose:
– “Focus” the spiders to specific content (e.g. people search, scientific publications, products, etc.)
– Ignore other web pages and domains
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 5
Solution: RWScraper• RWScraper (Romanian Word Scraper) - is able
to solve the following problems:– Identify Romanian texts;– Distinguish between proper names and common
nouns;– Create a database with new words along with
context information and metadata. In order to identify new
– Discover the most frequent spelling errors in Romanian online texts.
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 6
RWScraper – Text Processing• Each word discovered in a Romanian text is looked in
the database provided by www.dexonline.ro, which contains definitions from several Romanian dictionaries (DEX, DOOM, etc.)
• Text Processing Pipeline– Text Normalization– Language Validation– Sentence Segmentation– Sentence-Level Language
Identification– Word Tokenization
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 7
Related Work: Neologisms Identification
• A study for Japanese:– Scanning existing Japanese corpora for possible ”new” words,
typically by processing the texts through segmentation software and dealing with the ”out-of-lexicon” problem
– Simulating the Japanese morphological processes to create new possible words and then test for the presence of them in large corpora
• Identification of lexical discriminants (e.g. termed, called, known as) and punctuation discriminants (e.g. single and double quotes) for introducing new words– This method is able to identify a significantly smaller number of
potential new words due to the limited number of lexical discriminant patterns.
• Using data about the frequency of words usage over time
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 8
Related Work:Language Identification
• Common Words Methods– Store and use a list with the most frequent words for each language
• Unique Letter Combinations– Database with the most frequent sequences of letters in a language,
not necessarily valid words– The main disadvantage: the poor performance on short texts– The main advantage: it does not require word tokenization
• Language Identification Using N-Grams– Every language has several specific frequently used character n-grams– For a particular language L, the n-gram ordered dictionary is called n-
gram language profile– For a new text, we compute the distance to all computed language
profiles• Markov Models for Language Identification
– The word can be represented as a Markov chain where letters are states
– Compute a Markov model for each language
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 9
RWScraper: Implementation• RWScraper is a focused crawler for Romanian
web pages• Developed using Scrapy: open-source scraping
framework in Python• It uses three main concepts:
– Spiders: responsible for defining rules to restrict the crawled content to our area of interest
– Items: data we want to scrape from the web pages– Pipelines: text processing tasks that act on the
crawled web resources
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 10
RWScraper Language Validation• Divide the texts into two categories:
– Diacritics free texts - DIAFREE – Genuine Romanian texts – GEN
• 6.40% of the characters in the Romanian texts part of the ro_eu_parliament corpus are diacritics
• One of the problems with this approach is that 4.14% of texts contained ș, â, and î. Unfortunately, there are also other languages that possess these diacritics
• Romanian is the only language that uses ț and ă• Our assumption: if a text has over 600 characters and
has no ț/ă are found– Then it is DIAFREE– Otherwise is GEN
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 11
RWScraper Language Validation• Build language profiles, consisting of:
– Character bigrams and trigrams frequency– Common words frequency– Diacritics frequency– Rare characters frequency– Double consonant frequency– Single quotes frequency
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 12
Results: Language Validation• 105 texts are divided into: 20 Romanian with diacritics (RO1 -
RO20), 20 Romanian without diacritics (RO21- RO40), 20 Italian, 15 English, 10 Spanish, 5 Latin, 5 French, 5 Turkish texts, 3 Catalan texts, and 2 Aromanian
• The size of the texts varied from 9KB to 2:5MB, the average size being 253:4KB
• Average scores for the discriminator function– Lower score means higher probability for the text to be written in
Romanian– Used to set the discriminant score to 0.77 to separate between
Romanian and non-Romanian texts
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 13
Results• Processed 264,328 online documents
– Only 12,555 documents contained new words• From this set of texts, we extracted 698,341
– Only 47,363 phrases contained new words• Discovered 53,724 new words
– 21,343 are proper names• The remaining tokens are common words and they are
divided into the following main categories: – Misspelled words (approximately 35%)– Technical words (approximately 15%)– Argotic words (approximately 10%)– Clitics, regionalisms, archaisms, alternative forms for
existing words account for the rest (cca. 40%)
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 14
Results• Most frequent new words
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 15
Conclusions• RWScraper is a simple new Romanian words discovery
system• The project has also managed to create a large
database of Romanian words extracted from the WWW– Statistics about common proper names, frequent spelling
mistakes and newly-invented words• There are several elements that could be further
improved– The accuracy of the NLP components used by the system– A more pertinent analysis of the words identified by the
system
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 16
Thank you!
Questions?
Discussion
01.05.23 RoEduNet Conference 2014 – Chișinău, R. Moldova 17
This work has been funded by the Sectorial Operational Programme
Human Resources Development 2007-2013 of the Romanian Ministry of
European Funds through the Financial Agreement
POSDRU/159/1.5/S/132397