Natural Language Processing in Digital Humanities: application examples
Presentation at IXA, January 2016
Pablo Ruiz Fabo — LATTICE Lab
Summary
• Digital Humanities’ needs in terms of text
analysis tools
• Two examples of pipelines to address
these needs:
– Entity Linking and the PoliInformatics corpus
– Proposition Extraction and the Earth
Negotiations Bulletin Corpus
• Implementation choices, demo, evaluation
2
Digital Humanities (DH)
• Application of computational methods to
questions in fields like Humanities or
Social Sciences.
– A goal is to allow for asking questions and
attaining findings that would be impossible
without computational means (Berry, 2012)
– Critical reflection: “as much of a focus on what
the computational techniques obscure as
reveal” (Meeks and Weingart, 2012)
3
DH and Text: Topic Models
• Very popular tool in DH: LDA (MALLET) (discussion in Meeks and Weingart, 2012)
4
Europarl Corpus Topics for all speeches by the French Green Party (Les Verts), 1999-2004 session
DH and Text …
• But there are many other Text Analysis /
Language Technology / Natural Language
Processing applications potentially useful
for DH questions …
5
DH methods: Network analysis
• Transforming a collection of texts into a
network (a graph)
• Network’s nodes:
– Actors in the corpus: People, Institutions …
– Concepts in the corpus
6
DH methods: Network analysis
• Transforming a collection of texts into a
network (a graph)
• Network’s nodes:
– Actors in the corpus: People, Institutions …
– Concepts in the corpus
7
DH methods: Network analysis
• Transforming a collection of texts into a
network (a graph)
• Network’s nodes:
– Actors in the corpus: People, Institutions …
– Concepts in the corpus
Named Entity Recognition / Disambiguation
Entity and Concept Linking
8
Text => Network via EL: User Needs
9
Venturini et al. (2012) Once Upon a Text
[médialab at Paris SciencesPo]
The careful use of natural language processing algorithms could provide better filtering metrics and support in expression merging
The manual filtering is crucial because it allows entities to be reduced to a set size appropriate for analysis, but also recovering important entities that could have been excluded by the automatic filtering.
Though AlchemyAPI offers a trustworthy
service, we don’t like relying on it. In
particular, we don’t like that the service is
offered as a “black box” and that the exact
extraction algorithm is secret.
10
Venturini and Guido (2012). Once upon a text:
An Actor-Network Theory tale in text analytics.
Text => Network via EL: User needs
https://github.com/medialab/ANTA
Entity Linking in DH: User Needs
• Variety of corpora needs to be treated
• EL literature shows that tools’ performance
varies according to the corpus:
11
TOOL CORPUS
AIDA/CoNLL (news, sports) IITB (web, various topics)
P R F1 P R F1
Spotlight 31.2 40.4 35.2 46.2 50.0 48.0
TagMe 61.4 55.5 58.3 45.2 42.0 43.6
WikipediaMiner 46.9 52.8 49.7 56.8 48.2 43.6
AIDA 63.3 29.1 39.8 65.7 4.1 7.6
Data from Cornolti et al. (2013) Metric: Weak Annotation Match
• Correlations between number of
occurrences of a textual feature and tool
performance (Usbeck et al. 2015)
12
CORRELATIONS Nbr. PER Nbr. ORG Nbr. LOC Nbr. OTHER
Babelfy 0.769 -0.376 0.254 -0.431
Spotlight 0.217 -0.480 -0.461 0.26
TagMe 0.257 -0.272 -0.194 0.036
WikipediaMiner 0.082 -0.679 -0.632 0.497
Data from 20 Nov 2015, GERBIL platform (gerbil.aksw.org/gerbil/overview), A2KB/Ma task
Entity Linking in DH: User Needs
Entity Linking approach given User needs
NEED APPROACH
• Avoiding “black boxes” • Open source tools
• Treating a variety of corpora, knowing that tools’ performance varies with the corpora
• Combining tools to get complementary results
• Manual filtering of entities
• Information to guide the filtering
• Providing annotation quality metrics to users
• Simultaneous acess to metrics and text to validate annotations
• Optional automatic annotation selection 13
Open Source Tool Combination
• Open-source Public-domain tools which link
to generic ontologies (DBpedia, YAGO, Babelnet)
[P.S.: OK maybe Babelfy is not exactly open source … we
could perhaps use AGDISTIS:
https://github.com/AKSW/AGDISTIS]
14
2010 2011 2008 2011 2014
Metrics to guide filtering?
15
• Confidence scores
• Coherence scores
EL: Annotation confidence
SOCCER –JAPAN GET LUCKY WIN,
CHINA IN SURPRISE DEFEAT
16
CONFIDENCE
EL: Annotation coherence
• Wikipedia Link Based Measure: Relies on
common inbound links
• Milne & Witten (2008) [original proposal]
• Ferragina et al. (2010) [optimizations]
• Hoffart et al. (2011) [among other measures]
• Moro et al. (2014) [other measures]
17
Milne-Witten coherence between entities e1 and e2 (as in Hoffart et al. 2011)
Entity Linking Demo
18
Demo: PoliInformatics Corpus
19
Demo: PoliInformatics Corpus
20
• NLP Unshared Task in PoliInformatics
2014
– Who was the financial crisis?
• Participants: individuals, industries, …
– What was the financial crisis?
• Causes, proposals for reform, ….
• Heterogeneous corpus containing
– Congress Hearings transcripts
– Official reports on the crisis by Congress
– Bills
– Etc.
Demo: PoliInformatics Corpus
21
• NLP Unshared Task in PoliInformatics
2014
– Who was the financial crisis?
• Participants: individuals, industries, …
– What was the financial crisis?
• Causes, proposals for reform, ….
• Heterogeneous corpus containing
– Congress Hearings transcripts
– Official reports on the crisis by Congress
– Bills
– Etc.
EL Demo: Corpus
22
Congress Hearings: Interviews to witnesses
EL Demo: Corpus
23
Official report by Congress about the causes
for the crisis
Entity Linking Demo
2010 2011 2008 2011 2013
results displayed on demo not displayed
Description: Ruiz, Poibeau & Mélanie (Demo at NAACL 2015).
Evaluation
• … as an NLP-related system
• … in terms of Digital Humanities
25
• Does a selection among the annotations
provided by several systems outperform
each of those systems’ annotations taken
individually?
• Combination method: ROVER. Each
system is weighted by its precision on a
test-corpus – Fiscus, 1997 for ASR
– De la Clergerie et al. 2008, for parsing
– Ruiz and Poibeau, 2015 (*SEM poster)
26
Evaluation as an NLP system
ROVER: system weights
• Assume two application scenarios:
– User wants Entities only
– User wants Entities and Concepts
• For Entities only:
– Weight according to Precision on
AIDA/CoNLL B corpus (no concepts in gold)
• For Entities and Concepts:
– Weight according to P on IITB corpus (many
concepts in gold)
27
Corpora: Cornolti et al. 2013, BAT Framework https://github.com/marcocor/bat-framework/tree/master/benchmark/datasets
ROVER: testing
• Entities only was tested on the MSNBC
corpus (reference set has no concepts)
• Entities and Concepts was tested on the
AQUAINT corpus (ref set has concepts)
28
29
Evaluation as an NLP system M
icro
-ave
rage
d S
tro
ng
An
no
tati
on
Mat
ch
t =
too
l’s o
pti
mal
co
nfi
de
nce
th
resh
old
*p
<0
.05
(ra
nd
om
per
mu
tati
on
)
30
Evaluation as an NLP system M
icro
-ave
rage
d S
tro
ng
An
no
tati
on
Mat
ch
t =
too
l’s o
pti
mal
co
nfi
de
nce
th
resh
old
*p
<0
.05
(ra
nd
om
per
mu
tati
on
)
31
Evaluation as an NLP system M
icro
-ave
rage
d S
tro
ng
An
no
tati
on
Mat
ch
t =
too
l’s o
pti
mal
co
nfi
de
nce
th
resh
old
*p
<0
.05
(ra
nd
om
per
mu
tati
on
)
Evaluation for DH
• Possible researcher objectives:
– Finding evidence about a research question
– Doing that faster or with less manual work
– Obtaining networks that confirm (or challenge)
previously available knowledge
– Obtaining quantitative evidence where only
qualitative one was available
• Do the following elements help ?
– The UI and workflows it allows
• The confidence and coherence scores
• The automatic annotation selection
32
Evaluation for DH
• Possible researcher objectives:
– Finding evidence about a research question
– Doing that faster or with less manual work
– Obtaining networks that confirm (or challenge)
previously available knowledge
– Obtaining quantitative evidence where only
qualitative one was available
• Do the following elements help ?
– The UI and workflows it allows
• The confidence and coherence scores
• The automatic annotation selection
33
34
35
36
Proposition Extraction
• Daily reports on international climate
conferences (Conference of the Parties or
COP), like COP-21 which took place in
December 2015.
• Summary of participant countries’
proposals.
37
Prop Ext: Corpus
• Identify different countries’ participations
• Identify negotiation points supported or
opposed by participants
• Help researchers compare countries’
positions via keyphrase and entity
extraction on the negotiation points
• Provide more detailed analysis than
prior work on this corpus, based on
word co-occurrence methods (Venturini et al., 2014)
38
Pipeline Objectives
Typical corpus sentence
The EU, with NEW ZEALAND and opposed
by CHINA, MALAYSIA and BHUTAN,
supported including the promotion of natural
regeneration within the definitions of
"afforestation" and "reforestation."
39
Actors (Countries)
The EU, with NEW ZEALAND and opposed
by CHINA, MALAYSIA and BHUTAN,
supported including the promotion of natural
regeneration within the definitions of
"afforestation" and "reforestation."
40
Messages (negotiation points)
The EU, with NEW ZEALAND and opposed
by CHINA, MALAYSIA and BHUTAN,
supported including the promotion of
natural regeneration within the
definitions of "afforestation" and
"reforestation."
41
Predicates (support/opposition)
The EU, with NEW ZEALAND and opposed
by CHINA, MALAYSIA and BHUTAN,
supported including the promotion of
natural regeneration within the definitions of
"afforestation" and "reforestation."
42
Actor + Predicate + Message =
Proposition
43
Propositions
ACTORS PREDICATES MESSAGES
European_Union supported including the promotion of
natural regeneration within the definitions of "afforestation" and "reforestation."
New_Zealand
China
~supported Malaysia
Bhutan
44
Propositions
ACTORS PREDICATES MESSAGES
1 European_Union supported including the promotion of
natural regeneration within the definitions of "afforestation" and "reforestation."
2 New_Zealand
3 China
~supported 4 Malaysia
5 Bhutan
45
Propositions
ACTORS VERBAL
PREDICATES MESSAGES
1 European_Union supported including the promotion of
natural regeneration within the definitions of "afforestation" and "reforestation."
2 New_Zealand
3 China
~supported 4 Malaysia
5 Bhutan
46
ACTORS NOMINAL
PREDICATES MESSAGES
1 Group_of_77 / China
proposal
to include research and development in the transport and energy sectors in the priority areas to be financed by the SCCF.
Sources of proposition info?
• Open Relation Extraction
– OLLIE (Mausam et al., 2012 EMNLP)
• https://github.com/knowitall/ollie
– Open Information Extraction 4.0
• https://github.com/knowitall/openie
• Traditional sources
– Syntactic dependency parsing
– Semantic Role Labeling
(CONLL 08/09 for both)
47
Sources of proposition info?
• Open Relation Extraction
– OLLIE (Mausam et al., 2012 EMNLP)
• https://github.com/knowitall/ollie
– Open Information Extraction 4.0
• https://github.com/knowitall/openie
• Traditional sources
– Syntactic dependency parsing
– Semantic Role Labeling
• IXA Pipes wrapper for MATE-tools
48
Can we do this with patterns?
The EU, with NEW ZEALAND and
opposed by CHINA, MALAYSIA and
BHUTAN, supported including the
promotion of natural regeneration within
the definitions of "afforestation" and
"reforestation."
49
Can we do this with patterns?
[ The EU, with NEW ZEALAND ] and
[ opposed by CHINA, MALAYSIA and
BHUTAN ], [ supported including the
promotion of natural regeneration within
the definitions of "afforestation" and
"reforestation." ]
50
Cf. Salway et al., 2014, ACL, Grammar Induction approach exploiting the ADIOS algorithm from Solan et al. 2005, PNAS
Can we do this with patterns?
• Maybe, but …
– What about anaphora resolution?
– What about negation?
• An NLP pipeline deals with these
phenomena in a uniform way (unlike
“linguistically-agnostic” patterns)
51
Can we do this with patterns?
• Maybe, but …
– What about anaphora resolution?
– What about negation?
• An NLP pipeline deals with these
phenomena in a uniform way (unlike
“linguistically-agnostic” patterns)
• IXA Pipes provides SRL info, coreference
chains (and syntactic dependencies in
case needed)
52
Using SRL info
• We have propositions (events) involving:
– speakers
– reporting verbs/reporting-related nouns
– messages communicated by the speakers
53
The EU, with NEW ZEALAND and opposed by
CHINA, MALAYSIA and BHUTAN, supported
including the promotion of natural
regeneration within the definitions of
"afforestation" and "reforestation."
Using SRL info
• We have propositions (events) involving:
– speakers: generally in the predicate’s A0 role
• List of countries and other actors created manually
from specialized sources (UNFCC site)
– reporting verbs/reporting-related nouns
• Created a list based on VerbNet and NomBank
(using NLTK interface).
– messages communicated by the speakers:
generally the predicate’s A1 role
(complemented with adjunct roles …)
• What about “opposed by …”? (below)
54
Using SRL Info: Generic rule
55
“Opposed by” spans
• A rule verifies if roles contains a span
introduced by “opposed by”. If yes:
• The main verb, and the negotiation point
related to it are found
• A proposition is created with:
– Each actor in the “opposed by” span
– A negated form of the main verb
– The main verb’s message
56
Treating Pronominal anaphora
• IXA Pipes’ CorefGraph coreference chains
A small subset of possible cases is treated
• Note: He/She can refer to a country (acc.
to country representative’s gender)
- Pronoun antecedents only searched in
their same sentence or the preceding one
- Actor in main verb’s subject (dep-
parsing) is the antecedent of a sentence-
initial he/she in the following sentence.
• Evaluation: Accurate. But coverage … ? 57
Ru
les
Treating negation
• AM-NEG roles from SRL
• Surface cues: negative items in a small
window preceding a predicate
– Not, no, lack of, …
• Problems …
– There was no lack of acceptance by …
58
Overall goal
• Help researchers to compare countries’
positions, via keyphrases and linked
entities in the messages supported or
opposed by countries
• Who agrees with whom?
59
Proposition Extraction Demo
60
KeyPhrase Extraction
• YaTeA (Aubin et Hamon, 2006)
61
Evaluation as an NLP system
• 311 propositions from ENB corpus: F1 .69
• 631 propositions from IPCC corpus (official
scientific report creation negotiations): F1 .72
• Exact match of all three proposition elements
62
ENB
res
ult
s
• Is this pipeline (and UI) helping
researchers analyze climate negotiations?
– Comparing actors’ positions
– Looking for answers to research questions
– Drawing attention to overlooked evidence
– Confirming one view on a controversial point
– …
63
Evaluation for DH
Future work
• More implementation
• Actual domain-expert evaluation for DH
purposes
• Assessing contribution of the work based
on domain-expert evaluation
64
Other technologies
• Discourse Analysis (e.g. Xue et al., 2015 (CONLL Task))
– Testing Lin et al. 2014 PDTB parser on the
subcorpora of the IPCC corpus
• Assessment Report
• Summary for Policymakers
• Technical summary
• Semantic Textual Similarity (Agirre et al., 2012
onwards, SemEval)
– Testing TakeLab system (from SemEval 2012)
– Comparing to “non-semantic” similarity
• Perhaps Interpretable STS (Agirre et al. SemEv
2015 Pilot) 65
Summary
• Proposing to researchers in Humanities or
Social Sciences:
– Language technologies that help move
beyond word co-occurrence methods
• Making researchers in other domains
familiar with a broader variety of
Language Technologies
• Evaluating impact of the tools on those
researchers’ activities
66
Some references (1)
Sophie Aubin and Thierry Hamon. (2006) Improving Term
Extraction with Terminological Resources. In Advances
in Natural Language Processing: 5th International
Conference on NLP, FinTAL 2006, pp. 380-387. LNAI
4139. Springer.
David Berry (2012). Understanding Digital Humanities.
Palgrave
Marco Cornolti, Paolo Ferragina, and Massimiliano
Ciaramita. (2013). A framework for benchmarking
entity-annotation systems. In Proc. of WWW, 249–260.
Éric V. De La Clergerie, Olivier Hamon, Djamel Mostefa,
Christelle Ayache, Patrick Paroubek, and Anne Vilnat.
(2008). Passage: from French parser evaluation to
large sized treebank. In Proc. of LREC 2008
Paolo Ferragina and Ugo Scaiella. (2010). Tagme: on-the-fly
annotation of short text fragments (by wikipedia
entities). In Proc. of CIKM’10, 1625–1628.
Jonathan G. Fiscus. (1997). A post-processing system to
yield reduced word error rates: Recognizer output
voting error reduction (ROVER). In Proc. of the IEEE
ASRU Workshop, 1997, 347–354.
Johannes Hoffart, Mohamed Amir Yosef, Ilaria Bordino,
Hagen Fürstenau, Manfred Pinkal, Marc Spaniol,
Bilyana Taneva, Stefan Thater, and Gerhard Weikum.
(2011). Robust disambiguation of named entities in
text. In Proc. of EMNLP, 782–792.
Heng Ji, Joel Nothman and Ben Hachey. (2014). Overview
of TAC-KBP2014 Entity Discovery and Linking Tasks. In
Proc. Text Analysis Conference.
Lin, Ziheng, Hwee Tou Ng, and Min-Yen Kan. 2014. A PDTB-
Styled End-to-End Discourse Parser. Natural Language
Engineering 20 (02): 151–84.
Mausam, Schmitz, Bart, Soderland, Etzioni, and others.
(2012). Open Language Learning for Information
Extraction. In Proc. EMNLP / CoNLL, 523–34.
Elijah Meeks and Scott B. Weingart. (2012). The Digital
Humanities Contribution to Topic Modeling. Journal of
Digital Humanities, 2:1
Pablo N. Mendes, Max Jakob, Andrés García-Silva, and
Christian Bizer. (2011). DBpedia spotlight: shedding
light on the web of documents. In Proc. of the 7th Int.
Conf. on Semantic Systems, I-SEMANTICS’11, 1–8.
David Milne and Ian H. Witten. (2008a). An effective, low-
cost measure of semantic relatedness obtained from
Wikipedia links. In Proc. of AAAI Workshop on
Wikipedia and Artificial Intelligence: an Evolving
Synergy, 25–30.
Andrea Moro, Alessandro Raganato, and Roberto Navigli.
(2013). Entity Linking meets Word Sense
Disambiguation: A Unified Approach. Transactions of
the ACL, 2, 231–244.
Thierry Poibeau, Horacio Saggion, Jakub Piskorski, and
Roman Yangarber (eds.). (2012). Multi-source,
Multilingual Information Extraction and Summarization.
Springer Science & Business Media.
P. Ruiz, T. Poibeau, F. Mélanie. (2015). Entity Linking with
corpus coherence combining open source annotators.
In Proc. NAACL-HLT Demos
P. Ruiz, T. Poibeau. (2015). Combining Open Source
Annotators for Entity Linking through Weighted Voting.
In Proc. of *SEM. Denver, U.S..
67
Some references (2)
Satoshi Sekine, Kiyoshi Sudo and Chikashi Nobata. (2002).
Extended Named Entity Hierarchy. In Proc. LREC.
Eric F. Tjong Kim Sang and Fien De Meulder. (2003).
Introduction to the CoNLL-2003 Shared Task:
Language-Independent Named Entity Recognition. In
Proc. CoNLL. (ACL)
Ricardo Usbeck et al. (2015). GERBIL – General Entity
Annotator Benchmarking Framework. In Proc. of
WWW.
Venturini, Tommaso, and Daniele Guido. 2012. Once Upon a
Text: An ANT Tale in Text Analysis. Sociologica 6 (3).
[Note: ANT=Actor-Network Theory]
Venturini, T., N. Baya Laffite, J.-P. Cointet, I. Gray, V.
Zabban, and K. De Pryck. 2014. Three Maps and Three
Misunderstandings: A Digital Mapping of Climate
Diplomacy. Big Data & Society 1 (2).
… plus several references to work by IXA
68
Thank you !
[email protected] http://www.lattice.cnrs.fr/Pablo-Ruiz-Fabo,541
Supplemental Slides
70
Cooccurrence methods:
Wordfish
71
Wikipedia-Link-Based Relatedness
72
Witten, I., and David Milne. 2008. “An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links.”
Coherence: other examples
73
Thomas and Mario are strikers playing in
Munich (Moro and Navigli, 2014)