on mining citations to primary and secondary sources in historiography
TRANSCRIPT
![Page 1: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/1.jpg)
Giovanni Colavizza, Frédéric Kaplan
On Mining Citations to Primary and Secondary Sources in Historiography
1
![Page 2: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/2.jpg)
Motivation - the Scholar
Sciences: Google Scholar
English Low-cost information gathering
Humanities: no Google Scholar like system
multiple languages High-cost information gathering
Issues: lack of data [Sula and Miller, 2014] leads to absence of services: estimated coverage of Web of Science for Humanities circa 13% [Mingers and Leydesdorff, 2015].
2
![Page 3: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/3.jpg)
Motivation - the Footnote
How humanists cite? Footnotes [see e.g. Hellqvist, 2009]
3
![Page 4: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/4.jpg)
Motivation - the Archive
Approximately half citations to primary sources [Wiberley Jr., 2009]
4
![Page 5: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/5.jpg)
Project: Linked Books
In the context of the Venice Time Machine
Partners: • Ca’ Foscari Library System • Biblioteca Marciana • Istituto Veneto di Scienze, Lettere ed Arti • Archivio di Stato di Venezia • EPFL
5
![Page 6: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/6.jpg)
Data acquisition
Corpus and annotation:
• 4 journals (for 150 years) • 2000 monographs • digitisation almost over • ongoing annotation (samples from 1
journal and 1’000 monographs done, approx 10’000 annotated citations)
6
![Page 7: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/7.jpg)
1- Data: new citation corpus (History of Venice) and a pipeline for citation extraction from footnotes.
2- Analytical framework: development or adaptation of methods from bibliometrics to the humanities.
3- Services: a Google Scholar for the History of Venice accounting for citations to both primary and secondary sources.
Goals
7
![Page 8: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/8.jpg)
Pipeline
Text block detection
Citation extraction
Citation parsing
8
![Page 9: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/9.jpg)
Text block detection
Method: SVM classifier. On what: text lines. Why? Citations are in footnotes, filter input space.
9
![Page 10: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/10.jpg)
Text block detection
Main causes of errors: - partial citations (e.g. “Ivi., p. 37. some text”)
FALSE NEGATIVE (critical) - in-text shortened citations FALSE POSITIVE
(not critical) - footnotes without citations FALSE NEGATIVE
(not critical)
10
Next steps: finalise with extra features, layout detection on images.
![Page 11: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/11.jpg)
Citation extraction
Method: CRF classifier. On what: words. Why? Citations need to be individuated and separated from text. Citations are classified in primary and secondary.
11
![Page 12: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/12.jpg)
Citation extraction
Main causes of error: - Wrong boundaries (instance accuracy
0.78, critical) - Wrong class (Primary-Secondary, not
critical)
12
Next steps: finalise with extra features, add rules.
![Page 13: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/13.jpg)
Citation parsing
Method: CRF classifier. On what: words. Why? Elements of citations need to be individuated to identify the cited source.
13
![Page 14: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/14.jpg)
Citation parsing
Main causes of error: - Wrong classes (e.g. Editor mistaken for
Author, not critical: reduce class space) - Sensible to under-represented classes
(e.g. archival terminology, not critical)
14
Next steps: finalise with extra features, simplify class space, add rules, lookup features.
![Page 15: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/15.jpg)
What’s next
Towards a pipeline for citation extraction:
1. Normalisation 2. Linkage 3. Database as linked data 4. Publication
15
![Page 16: On Mining Citations to Primary and Secondary Sources in Historiography](https://reader033.vdocuments.net/reader033/viewer/2022051503/587be07e1a28ab834d8b7657/html5/thumbnails/16.jpg)
Giovanni Colavizza, Frédéric Kaplan
Thank you
16