analyzing and improving the quality of a historical news collection using language technology and...
DESCRIPTION
This presentation was given by Timo Honkela (National Library of Finland and University of Helsinki) in the IFLA 2014 Pre-Conference "Digital Transformation and the Changing Role of News Media in the 21st Century", Geneva, Switzerland, August 13, 2014. The presentation consists of three main parts: (1) Background, (2) OCR result analysis and correction, and (3) Potential directions for future research in socio-cultural text mining of newspaper collections. The published paper covers items 1 and 2. The abstract of the paper is provided below. Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods Kimmo Kettunen, Timo Honkela, Krister Lindén, Pekka Kauppinen, Tuula Pääkkönen and Jukka Kervinen Abstract In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowdsourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.TRANSCRIPT
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Kimmo Kettunen 1, Timo Honkela 1,2, Krister Lindén 2,Pekka Kauppinen 2, Tuula Pääkkönen 1 & Jukka Kervinen 1
Analyzing and Improving the Quality of a Historical News Collection
using Language Technology and Statistical Machine Learning Methods
IFLA Pre-Conference Geneva, Switzerland, 13th of August, 2014
12
Presented byTimo Honkela
in
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
HELSINKI MIKKELI
Department ofModern Languages
Language TechnologyCenter for Preservation and Digitisation
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
www.fmi.fi http://oppimateriaalit.internetix.fi
HonkeLA KettuNENKauppiNENPääkköNEN KerviNEN
Lindén
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Structure of the presentation
● Some background on the digitalization process
● Introducing the paper content:analysis and correction of OCR results
● Discussion on future steps:In-depth analysis of newspaper contentsto promote research in humanities andsocial sciences
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Historical newspaper collection
● The National Library of Finland has digitized a large proportion of the historical newspapers published in Finland between 1771 and 1910 (Bremer-Laamanen 2001, 2005).
● This collection contains approximately 1.95 million pages in Finnish and Swedish
● According to Legal Deposit law, the National Library of Finland receives a copy of each newspaper and magazine published in Finland.
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Digitisation of thehistorical newspaper collection
● In the post-processing phase, the material is processed so that it can be shared to the library sector, researchers, and the wide public.
● The scanned images are enhanced and run through background software and processes which create METS/ALTO metadata (CCS Docworks)
● The optical character recognition (OCR) is conducted at the same time in order to get the text content from the materials.
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Two channels
● Search and exploration interface (“Digi”)– Approximate search, focusing based on time/place,
indexed contents, index creation using morphological analysis, etc.
– Digitalkoot: enables the public to collectively mark and collect articles (crowdsourcing)
● Corpus (FIN-CLARIN)– Mainly used by linguists
– Includes keyword-in-context (n-gram) view
– Morphological and syntactical analysis results
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Search interface
http://digi.kansalliskirjasto.fi
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
FIN-CLARIN corpus
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
OCR Challenges
● Regardless of recent development of the OCR software, there are still challenges with it, as some material is very old, with – varying paper and print quality,
– varying number of columns and layout patterns,
– different languages (mainly Finnish and Swedish but also French, German, etc.), and
– and varying font types (fraktur and antiqua)
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
OCR Challenges
● The amount of material is such thathuman efforts – even crowdsourced –can only be a partial solution
● Fully or partially automated processesare needed
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
A very long tail of low frequency forms...
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
zzhdysvautki Yhdyspankki
v, u, p ? u, n, ll ?
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
taioafliftiutpn tavallisuuden
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Sources of complexity
Word (lexeme)
Inflections
Typos
Recognition errors
Historical differences
“Recognized” surface word
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Inflections:
Complexity ofFinnish at thelevel of wordforms
Kimmo Koskenniemi (2013):Johdatus kieliteknologiaan,sen merkitykseen ja sovelluksiin(Introduction to language technology, its significance andapplications)
https://helda.helsinki.fi/bitstream/handle/10138/38503/kt-johd.pdf?sequence=1
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Typos
Not a major source of problem but they do exist
BaselMost likelynot a stain
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Historical differences
● All the time, new names and wordsare being introduced
● Even more static morphological aspectsevolve over centuries
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Net outcome
● A collection of millions of newspaperpages gives rise to a list of hundredsof millions of different word formsthat have been found in the process
● A large proportion of these formsis not correct
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Detection and correction
● Improving OCR quality – not considered here● Improving the OCR output based on linguistic
knowledge and statistical considerations– Detecting incorrect forms
– Correcting the incorrect form
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Introduction to the basic ideas
● Detection– Morphological analyzer
– Special dictionaries (e.g. names)
– N-grams
● Correction– Transformation rules created through
a supervised learning scheme
– Edit distance approach using corpus statistics
– Weighted edit distance based on letter shapes
– Future: context information (problem of sparsity)
Please seethe paper for
methodologicaldetails and
analysis results
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Similarity diagram of Fraktur letter shapes(a self-organizing map)
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Socio-Historical Text Miningof Newspaper Collections
Research direction
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Areas of analysis
● Named entity recognition(people, organizations, places, events)
● Time series analysis ● Social network analysis● Topic modeling
cf. Virginie Fortun's presentation
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Areas of analysis
● Multidimensional sentiment analysis● Analysis of social and
historical context● Intercultural and
multilingual analysis● Analysis of point of view ● Analysis of subjective
understandingStella Wisdom & Neil Smyth
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Earlier related results
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Learning meaning from context:
Maps of words in Grimm fairy tales
Honkela, Pulkki & Kohonen 1995
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Multidimensional sentimentusing the PERMA model
● Seligman and his colleagues has developed the PERMA model that addresses different aspects of wellbeing.
● The model includes five components related to subjective well-being: – Positive emotion (P),
– Engagement (E),
– Relationships (R),
– Meaning (M) and – Achievement (A) Honkela, Korhonen, Lagus & Saarinen 2014
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
PERMA profiles of different corpora
Honkela, Korhonen, Lagus & Saarinen 2014
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Timo Honkela, Juha Raitio, Krista Lagus, Ilari T. Nieminen, Nina Honkela, and Mika Pantzar:Subjects on objects in contexts: Using GICA method to quantify epistemological subjectivity (IJCNN 2012)
Analysis of the subjectivemeaning: word 'health'
Analysis of the State of the Union Adresses
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Socio-Historical Text Miningof Newspaper Collections
A call for interdisciplinary international collaboration
Libraries, researchers within journalism, corpus linguistics, history, sociology, political science,
psychology, computer science, machine learning, etc.
Kettunen, Honkela, Linden, Kauppinen, Pääkkönen & Kervinen 2014
Merci!Danke schön!
Grazie!Multumesc!¡Gracias!
Thank you!Kiitos!Tack!謝謝!
Σας ευχαριστούμε!