computational history and the transformation of public ... · improvements for data &...

23
Computational History and the Transformation of Public Discourse in Finland, 1640–1910 COMHIS

Upload: others

Post on 26-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Computational History and the Transformation of Public Discourse in Finland, 1640–1910

COMHIS

Page 2: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Consortium partners

● National Library of Finland, Centre for Preservation and Digitisation (PI Kettunen)

● University of Helsinki, Faculty of Humanities (PI Tolonen)

● University of Turku, Dept of Future Technologies (PI Salakoski)

● University of Turku, Dept of Cultural History (PI Salmi)

Presenters today

● Kimmo Kettunen, Mikko Tolonen and Hannu Salmi

Page 3: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

The National Library of Finland, DH research

People involved in the project:

Kimmo Kettunen (PI)

Teemu Ruokolainen

Mika Koistinen

Erno Liukkonen

+ Computational support from ICT in Mikkeli (especially Tuula Pääkkönen)

Page 4: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Data of the NLF - lots of open data released

Consists of the digitized historical newspapers & journals in Finnish and Swedish

from 1771-1918 (copyright law, agreements with Kopiosto and general

cautiousness shift the end year slightly)

Data packages released during the years (starting from NIL open data)

- First an open data package from 1771-1910 was made available (2 M pages)

- Later newspapers 1911-1917

- Uusi Suometar 1869-1917 with improved OCR

- Metadata (METS) for newspapers & journals 1771-1917

- OCR ground truth data for Finnish and Swedish

Page 5: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Improvements for data & presentation of data

Main achievements

- OCR quality estimation for old data

- A new OCR pipeline for Finnish based on Tesseract 3.04.01 (clear

improvement: e.g. for 86 000 pages of Uusi Suometar mean improvement in

word recognition 15% units)

- Named Entity recognition training and evaluation data for Finnish: with

Stanford NER: F scores of 0.72 (persons) and 0.79 (locations) → a working

NER model

- Article extraction training and evaluation data for Uusi Suometar → decent

results

Page 6: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

What the user sees: digi.kansalliskirjasto.fi

- User interface of the presentation system digi.kansalliskirjasto.fi has been

improved at least two times during the 4 years

- New ways to see data: NER and article extraction for Uusi Suometar

- Automatic clippings in Uusi Suometar with pdf & text available (based on

automatic article extraction software)

Page 7: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline
Page 8: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline
Page 9: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Publications of NLF1) Kettunen, Kimmo, Mäkelä, Eetu, Ruokolainen, Teemu, Kuokkala, Juha, Löfberg, Laura, 2017. Old Content and Modern Tools – Searching Named Entities in a

Finnish OCRed Historical Newspaper Collection 1771–1910. Digital Humanities Quarterly 11 (3)

2) Pääkkönen, Tuula, Kervinen, Jukka, Nivala, Asko, Kettunen, Kimmo, Mäkelä Eetu, 2016. Exporting Finnish Digitized Historical Newspaper Contents for Offline

Use. D-Lib Magazine, July/August 2016.

3) Kettunen, Kimmo, Pääkkönen, Tuula, 2018. Kansalliskirjaston historialliset sanoma- ja aikakauslehdet avoimena digitaalisena datana - datapaketteja, rajapintoja,

käyttäjiä ja tutkimusongelmia. Informaatiotutkimus 4, 94–109.

4) Pääkkönen, Tuula, Kervinen, Jukka, Kettunen, Kimmo, 2018. Digitisation and Digital Library Presentation System – A Resource-Conscientious Approach. Proc. of

Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018, CEUR-WS.org/Vol-2084/

5)Kettunen, Kimmo, Koistinen, Mika, Kervinen, Jukka 2019. Tidying up the Mess – on a Way to Improved Quality in a Historical Finnish Newspaper and Journal

Collection 1771-1910 DATeCH 2019.

6)La Mela, Matti, Tamper, Minna, Kettunen, Kimmo, 2019. Finding Nineteenth-century Berry Spots: Recognizing and Linking Place Names in a Historical Newspaper

Berry-picking Corpus. DHN2019.

7) Pääkkönen, Tuula (2019). Digital heritage presentation system development + new material types: early findings. In Jarmo Harri Jantunen, Sisko Brunni, Niina

Kunnas, Santeri Palviainen and Katja Västi (eds), Proceedings Of The Research Data And Humanities (Rdhum) 2019 Conference: Data, Methods And Tools. Studia

humaniora ouluensia,67-80.

8) Ruokolainen, Teemu, Kettunen, Kimmo (2020, submitted in Sep. 2019). Name the Name - Named Entity Recognition in OCRed 19th and Early 20th Century

Finnish Newspaper and Journal Collection Data

Page 10: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Helsinki Computational History Group

Leo Lahti², Ville Vaara¹, Jani Marjanen¹, Hege Roivainen¹, Ali Ijaz¹, Simon Hengchen¹, Iiro Tiihonen¹, Tanja Säily¹, Antti Kanner¹, Mark Hill¹, Eetu

Mäkelä¹, Mikko Tolonen¹

[email protected]

¹ University of Helsinki² University of Turku

Page 11: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Academy of Finland DigiHum project as seed money

Twin collaborations

Projects

Research

Groups

Research group

Expected timeframe +10 years

- In practice mutates into all

other types of modes of

collaboration

Twin collaboration

Timeframe 1-2 years

- F.e. humanities partner

seeking a data science partner

for a particular case

Project collaboration

Timeframe 3-5 years

- Either straightforward

research projects or

collaborations that mix

research and infrastructure

Page 12: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Helsinki Computational History Group

● Focus on early modern public discourse during hand-press

era (1450-1830).

● In text analysis we are often dealing with noisy sources.

● Work on Europe in general (f.e. CERL sources), but also

strong focus on eighteenth-century Britain (ECCO & ESTC

datasets).

● Also studying newspaper publicity in nineteenth-century

Europe, and in Finland in particular.

● Methodological strategy is to combine harmonizing of

metadata with text mining of full-text sources.

Page 13: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Helsinki Computational History Group Strategy

Text mining of large corpora• Objective: understanding conceptual change,

uses of language • Sources: full-text databases (ECCO, EEBO,

Finnish Newspapers etc.)• Potential: Theoretically great, the future?• In practice: raw data almost never openly

available; if it is, tied to limited interfaces• Scalability with open research data: data-driven

approach• Methodological perspective: Messy to study

historical sources, intellectual input not guaranteed.

Metadata as a quantitative tool• Objective: Quantitative study of material

objects• Sources: World is full of different metadata

collections• Potential: Greatly underestimated (even by

librarians)• In practice: difficulties with open access to raw

data and supporting data sources, but not impossible.

• Scalability with open research data: fantastic• Methodological perspective: excellent for

borrowing best practices from other scientific fields. Quality of catalogues varies.

Page 14: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Fig. 1 Paper consumption by book format in the FNB 1640–1911 (a) and the SNB 1640–1828 (b)

Page 15: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Fig. 2 Newspapers published annually according to approximate page size, 1800–1917

Page 16: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Publications

● Tolonen, M., Marjanen, J., Kanner, A., Mäkelä, E., Lahti, L., Vaara, V., ... Lähteenoja, V. (2017). OCTAVO – Analysing Early

Modern Public Communication [poster]. Presented in Digital Humanities at Oxford Summer School. 2017.

● Lahti, L., Vaara, V., Marjanen, J., Roivainen, H., Ijaz, A., Hengchen, S., ... Tolonen, M. (2018). Quantitative analysis of public

discourse in Europe 1470-1910. Abstract from Digital Humanities Benelux (DH Benelux 2018), Amsterdam, Netherlands.

● Tolonen, M., Mäkelä, E., Marjanen, J., Kanner, A., Lahti, L., Ginter, F., ... Sippola, R. (2018). Metadata Analysis and Text Reuse

Detection: Reassessing public discourse in Finland through newspapers and journals 1771–1917. Poster session presented at

Digital humanities in the Nordic Countries DHN2018, Helsinki, Finland.

● Tolonen, M., Lahti, L., Marjanen, J., & Roivainen, H. (2018). A Quantitative Approach to Book-Printing in Sweden and Finland,

1640-1828. Historical Methods, 57-78. https://doi.org/10.1080/01615440.2018.1526657

● Lahti, L., Marjanen, J. P., Roivainen, H. H. M., & Tolonen, M. S. (2019). Bibliographic Data Science and the History of the Book

(c. 1500–1800). Cataloging & Classification Quarterly, 57(1), 5-23. https://doi.org/10.1080/01639374.2018.1543747

● Marjanen, J., Vaara, V., Kanner, A., Roivainen, H., Mäkelä, E., Lahti, L., & Tolonen, M. (2019). A National Public Sphere?

Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917. Journal of European Periodical Studies,

4(1), 54-77. https://doi.org/10.21825/jeps.v4i1.10483

● Hengchen, S., Ros, R., & Marjanen, J. (2019). A data-driven approach to the changing vocabulary of the ‘nation’ in English,

Dutch, Swedish and Finnish newspapers, 1750-1950. Abstract from Digital Humanities 2019, Utrecht, Netherlands.

Page 17: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Department of Future Technologies: Tapio Salakoski (PI), Filip Ginter, Aleksi Vesanto

Department of Cultural History: Hannu Salmi (PI), Asko Nivala, Petri Paju, Heli Rantala, Reetta Sippola

Two subprojects of the consortium in Turku

Page 18: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Full-text mining of newspapers and magazines, 1771–1920: text reuse

• We developed a special solution for detecting text reuse, which is based on NCBI BLAST, a software originally created for comparing and aligning biomedical sequences.

• In our method, we encoded the data into protein sequences, which then could be read by BLAST to identify regions in the sequences that overlap. These pairs were then clustered so that overlapping passages were interpreted to be part of a same cluster. A cluster then contains all found occurrences of a particular reuse.

• We processed 5 million pages of newspapers and magazines, found 61 million occurrences of similarity which formed 13.8 million clusters of reuse.

• The method was published in 2017 (Vesanto et al.).

• The software and guideline are openly available at https://github.com/avjves/textreuse-blast

Page 19: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Text reuse database

Page 20: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Press as a network

Year 1877 (59 581 clusters) Year 1897 (240 933 clusters)

Source:

comhis.fi/cluster

s

Page 21: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Timescapes of text reuse

• In 1776, the Suomenkieliset Tieto-Sanomat

published an announcement by the Royal

Swedish Academy of Sciences on a prize

awarded to a peasant who seeded the largest

amount of root crops in one field (Cluster ID

6317820).

• Timescape of the cluster, copied 19 times

within the time span of 142 years.

Source: comhis.fi/clusters.

Page 22: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Social outreach & future steps

• Press releases during the project: Petri Paju’s findings on the role of newspapers in

promoting lichen food were published, for example, in Helsingin Sanomat 29 Aug 2019

• Blog project on everyday life in Turku 1917–1918, https://blogit.utu.fi/elamaaturussa/

• Future of our database: negotiation with the National Library of Finland to integrate

reuse clusters into digi.kansalliskirjasto.fi

• The project has boosted the study of digital history in Turku and fuelled interdisciplinary

collaborations, see digitalhistory.fi

• Future steps 1: text reuse over the Baltic Sea, bringing together Swedish and Finnish-

Swedish corpora

• Future steps 2: intermedial relations

Page 23: Computational History and the Transformation of Public ... · Improvements for data & presentation of data Main achievements - OCR quality estimation for old data - A new OCR pipeline

Publications

• Nivala A., Salmi H., Sarjala J., History and Virtual Topology: The Nineteenth-Century Press as Material Flow. Historein Vol. 17, No 2 (2018), DOI: http://dx.doi.org/10.12681/historein.14612

• Paju, P. Ensimmäiset naiset insinöörien ja arkkitehtien yhdistyksissä. Tekniikan Waiheita 36, 1/2018, 5–24. https://journal.fi/tekniikanwaiheita/article/view/82350

• Paju P., Jäkälän paluu: Jäkälävalistus ja tekstien uudelleenkäyttö historiallisen tutkimusteeman jäsentäjänä. Ennen ja nyt 2, 2019: http://www.ennenjanyt.net/2019/08/jakalan-paluu-jakalavalistus-ja-tekstien-uudelleenkaytto-historiallisen-tutkimusteeman-jasentajana/

• Paju P., Rantala H., Salmi H.: Digitaalinen historiantutkimus. Erikoisnumero. Ennen ja nyt 2, 2019, http://www.ennenjanyt.net/numerot/2019-2/

• Paju P., Rantala H., Salmi H., Tietokannoista tulkintoihin: digitaalisen historiantutkimuksen käytäntöjä. Ennen ja nyt 2, 2019, http://www.ennenjanyt.net/2019/08/tietokannoista-tulkintoihin-digitaalisen-historiantutkimuksen-kaytantoja/

• Rantala H., Salmi H., Vesanto A., Ginter F., Tekstien pitkä elämä: Ajassa liikkuvat tekstit suomalaisessa sanomalehdistössä 1771–1920. Ennen ja nyt 2, 2019: http://www.ennenjanyt.net/2019/08/tekstien-pitka-elama-ajassa-liikkuvat-tekstit-suomalaisessa-sanomalehdistossa-1771-1920/

• Rantala H., Nivala A., Salmi H., Paju P., Sippola R., Vesanto A. & Ginter F., Tekstien uudelleenkäyttö suomalaisessa sanoma- ja aikakauslehdistössä 1771–1920. Digitaalisten ihmistieteiden näkökulma. Historiallinen Aikakauskirja 1, 2019: 53–67.

• Salmi H., Nivala A., Rantala H., Sippola R., Vesanto A. & Ginter F., Återanvändningen av text i den finska tidningspressen 1771–1853. Historisk tidskrift för Finland 1, 2018: 46–76.

• Salmi H., Rantala H., Vesanto A. & Ginter F., The Long-Term Reuse of Text in the Finnish Press, 1771–1920. Proceedings of the 4th Digital Humanities in the Nordic Countries 2019. Copenhagen, Denmark 6–8 March 2019, http://ceur-ws.org/Vol-2364/36_paper.pdf

• Salmi H. Viraalisuus – kulttuurihistoriallinen näkökulma. Niin & näin 1, 2018: 71–79.

• Vesanto A., Nivala A., Rantala H., Salakoski T., Salmi H. & Ginter F., Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771–1910. In: Proceedings of the 21st Nordic Conference of Computational Linguistics. Gothenburg, Sweden, 23–24 May 2017 (Linköping 2017), http://www.ep.liu.se/ecp/133/010/ecp17133010.pdf

• Vesanto A., Nivala A., Salakoski T., Salmi H. & Ginter F.: A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora. In: Proceedings of the 21st Nordic Conference of Computational Linguistics. Gothenburg, Sweden, 23–24 May 2017 (Linköping 2017), http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf