computational history and the transformation of public ... · improvements for data &...
TRANSCRIPT
Computational History and the Transformation of Public Discourse in Finland, 1640–1910
COMHIS
Consortium partners
● National Library of Finland, Centre for Preservation and Digitisation (PI Kettunen)
● University of Helsinki, Faculty of Humanities (PI Tolonen)
● University of Turku, Dept of Future Technologies (PI Salakoski)
● University of Turku, Dept of Cultural History (PI Salmi)
Presenters today
● Kimmo Kettunen, Mikko Tolonen and Hannu Salmi
The National Library of Finland, DH research
People involved in the project:
Kimmo Kettunen (PI)
Teemu Ruokolainen
Mika Koistinen
Erno Liukkonen
+ Computational support from ICT in Mikkeli (especially Tuula Pääkkönen)
Data of the NLF - lots of open data released
Consists of the digitized historical newspapers & journals in Finnish and Swedish
from 1771-1918 (copyright law, agreements with Kopiosto and general
cautiousness shift the end year slightly)
Data packages released during the years (starting from NIL open data)
- First an open data package from 1771-1910 was made available (2 M pages)
- Later newspapers 1911-1917
- Uusi Suometar 1869-1917 with improved OCR
- Metadata (METS) for newspapers & journals 1771-1917
- OCR ground truth data for Finnish and Swedish
Improvements for data & presentation of data
Main achievements
- OCR quality estimation for old data
- A new OCR pipeline for Finnish based on Tesseract 3.04.01 (clear
improvement: e.g. for 86 000 pages of Uusi Suometar mean improvement in
word recognition 15% units)
- Named Entity recognition training and evaluation data for Finnish: with
Stanford NER: F scores of 0.72 (persons) and 0.79 (locations) → a working
NER model
- Article extraction training and evaluation data for Uusi Suometar → decent
results
What the user sees: digi.kansalliskirjasto.fi
- User interface of the presentation system digi.kansalliskirjasto.fi has been
improved at least two times during the 4 years
- New ways to see data: NER and article extraction for Uusi Suometar
- Automatic clippings in Uusi Suometar with pdf & text available (based on
automatic article extraction software)
Publications of NLF1) Kettunen, Kimmo, Mäkelä, Eetu, Ruokolainen, Teemu, Kuokkala, Juha, Löfberg, Laura, 2017. Old Content and Modern Tools – Searching Named Entities in a
Finnish OCRed Historical Newspaper Collection 1771–1910. Digital Humanities Quarterly 11 (3)
2) Pääkkönen, Tuula, Kervinen, Jukka, Nivala, Asko, Kettunen, Kimmo, Mäkelä Eetu, 2016. Exporting Finnish Digitized Historical Newspaper Contents for Offline
Use. D-Lib Magazine, July/August 2016.
3) Kettunen, Kimmo, Pääkkönen, Tuula, 2018. Kansalliskirjaston historialliset sanoma- ja aikakauslehdet avoimena digitaalisena datana - datapaketteja, rajapintoja,
käyttäjiä ja tutkimusongelmia. Informaatiotutkimus 4, 94–109.
4) Pääkkönen, Tuula, Kervinen, Jukka, Kettunen, Kimmo, 2018. Digitisation and Digital Library Presentation System – A Resource-Conscientious Approach. Proc. of
Digital Humanities in the Nordic Countries 3rd Conference, Helsinki, Finland, March 7-9, 2018, CEUR-WS.org/Vol-2084/
5)Kettunen, Kimmo, Koistinen, Mika, Kervinen, Jukka 2019. Tidying up the Mess – on a Way to Improved Quality in a Historical Finnish Newspaper and Journal
Collection 1771-1910 DATeCH 2019.
6)La Mela, Matti, Tamper, Minna, Kettunen, Kimmo, 2019. Finding Nineteenth-century Berry Spots: Recognizing and Linking Place Names in a Historical Newspaper
Berry-picking Corpus. DHN2019.
7) Pääkkönen, Tuula (2019). Digital heritage presentation system development + new material types: early findings. In Jarmo Harri Jantunen, Sisko Brunni, Niina
Kunnas, Santeri Palviainen and Katja Västi (eds), Proceedings Of The Research Data And Humanities (Rdhum) 2019 Conference: Data, Methods And Tools. Studia
humaniora ouluensia,67-80.
8) Ruokolainen, Teemu, Kettunen, Kimmo (2020, submitted in Sep. 2019). Name the Name - Named Entity Recognition in OCRed 19th and Early 20th Century
Finnish Newspaper and Journal Collection Data
Helsinki Computational History Group
Leo Lahti², Ville Vaara¹, Jani Marjanen¹, Hege Roivainen¹, Ali Ijaz¹, Simon Hengchen¹, Iiro Tiihonen¹, Tanja Säily¹, Antti Kanner¹, Mark Hill¹, Eetu
Mäkelä¹, Mikko Tolonen¹
¹ University of Helsinki² University of Turku
Academy of Finland DigiHum project as seed money
Twin collaborations
Projects
Research
Groups
Research group
Expected timeframe +10 years
- In practice mutates into all
other types of modes of
collaboration
Twin collaboration
Timeframe 1-2 years
- F.e. humanities partner
seeking a data science partner
for a particular case
Project collaboration
Timeframe 3-5 years
- Either straightforward
research projects or
collaborations that mix
research and infrastructure
Helsinki Computational History Group
● Focus on early modern public discourse during hand-press
era (1450-1830).
● In text analysis we are often dealing with noisy sources.
● Work on Europe in general (f.e. CERL sources), but also
strong focus on eighteenth-century Britain (ECCO & ESTC
datasets).
● Also studying newspaper publicity in nineteenth-century
Europe, and in Finland in particular.
● Methodological strategy is to combine harmonizing of
metadata with text mining of full-text sources.
Helsinki Computational History Group Strategy
Text mining of large corpora• Objective: understanding conceptual change,
uses of language • Sources: full-text databases (ECCO, EEBO,
Finnish Newspapers etc.)• Potential: Theoretically great, the future?• In practice: raw data almost never openly
available; if it is, tied to limited interfaces• Scalability with open research data: data-driven
approach• Methodological perspective: Messy to study
historical sources, intellectual input not guaranteed.
Metadata as a quantitative tool• Objective: Quantitative study of material
objects• Sources: World is full of different metadata
collections• Potential: Greatly underestimated (even by
librarians)• In practice: difficulties with open access to raw
data and supporting data sources, but not impossible.
• Scalability with open research data: fantastic• Methodological perspective: excellent for
borrowing best practices from other scientific fields. Quality of catalogues varies.
Fig. 1 Paper consumption by book format in the FNB 1640–1911 (a) and the SNB 1640–1828 (b)
Fig. 2 Newspapers published annually according to approximate page size, 1800–1917
Publications
● Tolonen, M., Marjanen, J., Kanner, A., Mäkelä, E., Lahti, L., Vaara, V., ... Lähteenoja, V. (2017). OCTAVO – Analysing Early
Modern Public Communication [poster]. Presented in Digital Humanities at Oxford Summer School. 2017.
● Lahti, L., Vaara, V., Marjanen, J., Roivainen, H., Ijaz, A., Hengchen, S., ... Tolonen, M. (2018). Quantitative analysis of public
discourse in Europe 1470-1910. Abstract from Digital Humanities Benelux (DH Benelux 2018), Amsterdam, Netherlands.
● Tolonen, M., Mäkelä, E., Marjanen, J., Kanner, A., Lahti, L., Ginter, F., ... Sippola, R. (2018). Metadata Analysis and Text Reuse
Detection: Reassessing public discourse in Finland through newspapers and journals 1771–1917. Poster session presented at
Digital humanities in the Nordic Countries DHN2018, Helsinki, Finland.
● Tolonen, M., Lahti, L., Marjanen, J., & Roivainen, H. (2018). A Quantitative Approach to Book-Printing in Sweden and Finland,
1640-1828. Historical Methods, 57-78. https://doi.org/10.1080/01615440.2018.1526657
● Lahti, L., Marjanen, J. P., Roivainen, H. H. M., & Tolonen, M. S. (2019). Bibliographic Data Science and the History of the Book
(c. 1500–1800). Cataloging & Classification Quarterly, 57(1), 5-23. https://doi.org/10.1080/01639374.2018.1543747
● Marjanen, J., Vaara, V., Kanner, A., Roivainen, H., Mäkelä, E., Lahti, L., & Tolonen, M. (2019). A National Public Sphere?
Analyzing the Language, Location, and Form of Newspapers in Finland, 1771–1917. Journal of European Periodical Studies,
4(1), 54-77. https://doi.org/10.21825/jeps.v4i1.10483
● Hengchen, S., Ros, R., & Marjanen, J. (2019). A data-driven approach to the changing vocabulary of the ‘nation’ in English,
Dutch, Swedish and Finnish newspapers, 1750-1950. Abstract from Digital Humanities 2019, Utrecht, Netherlands.
Department of Future Technologies: Tapio Salakoski (PI), Filip Ginter, Aleksi Vesanto
Department of Cultural History: Hannu Salmi (PI), Asko Nivala, Petri Paju, Heli Rantala, Reetta Sippola
Two subprojects of the consortium in Turku
Full-text mining of newspapers and magazines, 1771–1920: text reuse
• We developed a special solution for detecting text reuse, which is based on NCBI BLAST, a software originally created for comparing and aligning biomedical sequences.
• In our method, we encoded the data into protein sequences, which then could be read by BLAST to identify regions in the sequences that overlap. These pairs were then clustered so that overlapping passages were interpreted to be part of a same cluster. A cluster then contains all found occurrences of a particular reuse.
• We processed 5 million pages of newspapers and magazines, found 61 million occurrences of similarity which formed 13.8 million clusters of reuse.
• The method was published in 2017 (Vesanto et al.).
• The software and guideline are openly available at https://github.com/avjves/textreuse-blast
Text reuse database
Press as a network
Year 1877 (59 581 clusters) Year 1897 (240 933 clusters)
Source:
comhis.fi/cluster
s
Timescapes of text reuse
• In 1776, the Suomenkieliset Tieto-Sanomat
published an announcement by the Royal
Swedish Academy of Sciences on a prize
awarded to a peasant who seeded the largest
amount of root crops in one field (Cluster ID
6317820).
• Timescape of the cluster, copied 19 times
within the time span of 142 years.
Source: comhis.fi/clusters.
Social outreach & future steps
• Press releases during the project: Petri Paju’s findings on the role of newspapers in
promoting lichen food were published, for example, in Helsingin Sanomat 29 Aug 2019
• Blog project on everyday life in Turku 1917–1918, https://blogit.utu.fi/elamaaturussa/
• Future of our database: negotiation with the National Library of Finland to integrate
reuse clusters into digi.kansalliskirjasto.fi
• The project has boosted the study of digital history in Turku and fuelled interdisciplinary
collaborations, see digitalhistory.fi
• Future steps 1: text reuse over the Baltic Sea, bringing together Swedish and Finnish-
Swedish corpora
• Future steps 2: intermedial relations
Publications
• Nivala A., Salmi H., Sarjala J., History and Virtual Topology: The Nineteenth-Century Press as Material Flow. Historein Vol. 17, No 2 (2018), DOI: http://dx.doi.org/10.12681/historein.14612
• Paju, P. Ensimmäiset naiset insinöörien ja arkkitehtien yhdistyksissä. Tekniikan Waiheita 36, 1/2018, 5–24. https://journal.fi/tekniikanwaiheita/article/view/82350
• Paju P., Jäkälän paluu: Jäkälävalistus ja tekstien uudelleenkäyttö historiallisen tutkimusteeman jäsentäjänä. Ennen ja nyt 2, 2019: http://www.ennenjanyt.net/2019/08/jakalan-paluu-jakalavalistus-ja-tekstien-uudelleenkaytto-historiallisen-tutkimusteeman-jasentajana/
• Paju P., Rantala H., Salmi H.: Digitaalinen historiantutkimus. Erikoisnumero. Ennen ja nyt 2, 2019, http://www.ennenjanyt.net/numerot/2019-2/
• Paju P., Rantala H., Salmi H., Tietokannoista tulkintoihin: digitaalisen historiantutkimuksen käytäntöjä. Ennen ja nyt 2, 2019, http://www.ennenjanyt.net/2019/08/tietokannoista-tulkintoihin-digitaalisen-historiantutkimuksen-kaytantoja/
• Rantala H., Salmi H., Vesanto A., Ginter F., Tekstien pitkä elämä: Ajassa liikkuvat tekstit suomalaisessa sanomalehdistössä 1771–1920. Ennen ja nyt 2, 2019: http://www.ennenjanyt.net/2019/08/tekstien-pitka-elama-ajassa-liikkuvat-tekstit-suomalaisessa-sanomalehdistossa-1771-1920/
• Rantala H., Nivala A., Salmi H., Paju P., Sippola R., Vesanto A. & Ginter F., Tekstien uudelleenkäyttö suomalaisessa sanoma- ja aikakauslehdistössä 1771–1920. Digitaalisten ihmistieteiden näkökulma. Historiallinen Aikakauskirja 1, 2019: 53–67.
• Salmi H., Nivala A., Rantala H., Sippola R., Vesanto A. & Ginter F., Återanvändningen av text i den finska tidningspressen 1771–1853. Historisk tidskrift för Finland 1, 2018: 46–76.
• Salmi H., Rantala H., Vesanto A. & Ginter F., The Long-Term Reuse of Text in the Finnish Press, 1771–1920. Proceedings of the 4th Digital Humanities in the Nordic Countries 2019. Copenhagen, Denmark 6–8 March 2019, http://ceur-ws.org/Vol-2364/36_paper.pdf
• Salmi H. Viraalisuus – kulttuurihistoriallinen näkökulma. Niin & näin 1, 2018: 71–79.
• Vesanto A., Nivala A., Rantala H., Salakoski T., Salmi H. & Ginter F., Applying BLAST to Text Reuse Detection in Finnish Newspapers and Journals, 1771–1910. In: Proceedings of the 21st Nordic Conference of Computational Linguistics. Gothenburg, Sweden, 23–24 May 2017 (Linköping 2017), http://www.ep.liu.se/ecp/133/010/ecp17133010.pdf
• Vesanto A., Nivala A., Salakoski T., Salmi H. & Ginter F.: A System for Identifying and Exploring Text Repetition in Large Historical Document Corpora. In: Proceedings of the 21st Nordic Conference of Computational Linguistics. Gothenburg, Sweden, 23–24 May 2017 (Linköping 2017), http://www.ep.liu.se/ecp/131/049/ecp17131049.pdf