linguistic processing in lattice- based taxonomy construction anastasia novokreshchenova, maria...
TRANSCRIPT
Linguistic Processing in Lattice-Based
Taxonomy Construction
Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and NinaBelyaeva
State University Higher School of Economics, MoscowSchool of Applied Mathematics and Computer Science
CLA 2010 Seville, Spain. October19-21, 2010.
Outline
• Motivation in Social Studies and the Data• Building a lattice-based taxonomy over a text
corpus• Natural language processing techniques for
automatic attributes acquisition– Keywords extraction– Probabilistic latent modeling of text– Named entity recognition
Motivation
• Represent the structure of a given domain in a form of a lattice-based taxonomy– Interdisciplinary research project “Discrete mathematical
models for political analysis of democratic institutions and human rights"
– Speeches of Western leaders and international organizations– The context in which Russia is addressed– The role and importance of democracy and human rights
agenda• Construct a context from the text corpora
– Extract the set of attributes from texts for describing the documents
– Analyze and develop natural language processing methods
CONSTRUCTING LATTICE-BASED TAXONOMY OVER A TEXT CORPUS
Preliminary text processing Attributes extraction for describing the
documentsBuilding and pruning the lattice
THREE KINDS OF TAXONOMIES
Three kinds of taxonomies depending on the attributes type: frequent words latent topicsnamed entities
BUILDING A TAXONOMY WITH FREQUENT WORDS
eliminating of stop-words
stemming - collapsing all morphological variants of the term to a single root form
describing each document with its N most frequent terms
building and pruning the lattice
t1 … tn
Doc1 Х … -
Doc2 - Х
…
DocT Х … Х
ijij
ikk
ntf
n
31 FORMAL CONCEPTS OF THE LATTICE BASED ON FREQUENT WORDS
Figures in squares show the number of documents in each concept
ACCORDING TO WORD FREQUENCIES TAXONOMY:
security issues and relationships of Russia with Europe are the most discussed topics along with some global problems
democracy and human rights are not included in the presented taxonomy due to pruning◦ words "democracy", "human" and "right" appear in
the concepts which include speeches by Barack Obama and Hillary Clinton.
Probabilistic latent semantic analysis (pLSA)
• P( z ) – the distribution over topics z in a particular document
• P( w | z ) – the probability distribution over words w given topic z
• T is the number of topics
1
( ) ( | ) ( )T
i i i ij
P w P w z j P z j
BUILDING A TAXONOMY WITH LATENT TOPICS
probabilistic modeling of text: documents are represented as random mixtures
over latent topicseach topic is characterized by a distribution over
words.20 topics were derived from the 26 documents20 topics were used as attributes for describing the
documents
6 OF THE 20 RECEIVED TOPICS FROM THE DOCUMENTS: WORDS DISTRIBUTIONS OVER TOPICS
Economics and financial crisis
Democracy and human rights
Future of the US and weapon issues
France and ecological problems
Russian – Georgian conflict
Russia and energy issues
crisi right nation franc georgia russia
presid human unit summit russian russian
finance govern nuclear responc intern interest
econom peopl america final georgian energy
system democraci american french territori medvedev
govern work interest preapar south issu
reform women futur longer order rule
propos democrat weapon lead process trust
time protect alli choic ethnic dialog
market principl centuri environment feder area
subject societi war african direct agreement
unit account common debat address partnership
bank univers year renew ossetia trade
septemb commun prosper organ plan law
reason leader forward africa sepatatism intern
euro Life partnership collect august neighbor
war clinton great contribut absolut common
promot independ goal ambiti bomb gas
ACCORDING TO THE LATENT TOPICS - TAXONOMY
The most actual topics are those connected with: European Union global problems security issues energy resources Russian-Georgian conflict possible ways of solving conflicts and problems
The topic of democracy and human rights is not included in the presented taxonomy due to pruning
the concept with this topic includes speeches by Barack Obama and Nicolas Sarcozy
BUILDING A TAXONOMY WITH NAMED ENTITIES
38 paragraphs derived from the 26 and enlighten solely issues concerning Russia
three types of named entities for describing the documents◦ names of persons◦ organizations◦ geographical objects
CONCLUSION REMARKS several techniques have been proposed to build a
context over a text corpus frequent words allowed to define what questions
are raised most frequently by foreign leaders regarding Russia
latent topic modeling allowed to specify and describe these issues more thoroughly
Named-entity would be more informative to use in the context of latent topics
the corpus of the texts should be expanded