studying the history of ideas using topic models
DESCRIPTION
Studying the History of Ideas Using Topic Models. D. Hall, D. Jurafsky , & C. D. Manning Standord University EMNLP 2008. Agenda. Introduction Methodology Historical trends in computation l inguistics Is computational l inguistics b ecoming m ore a pplied? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/1.jpg)
Studying the History of Ideas Using Topic Models
D. Hall, D. Jurafsky, & C. D. ManningStandord University
EMNLP 2008
![Page 2: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/2.jpg)
Agenda
• Introduction• Methodology• Historical trends in computation linguistics• Is computational linguistics becoming more
applied?• Differences and similarities among COLING,
ACL, and EMNLP• Conclusion
![Page 3: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/3.jpg)
Goal
• Identify and study the exploration of ideas in a scientific field over time.– Periods of gradual development.– Major ruptures.– Waxing and waning of both topic areas and
connections with applied topics and nearby fields?
![Page 4: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/4.jpg)
Citation graphs
![Page 5: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/5.jpg)
Change of ideas
• Rather than deal with papers or authors, this paper is focused on the change of ideas in a field over time.
• Apply Kuhn’s insight that vocabulary and vocabulary shift is a crucial indicator of ideas and shifts in ideas.
• Operationalize on the unsupervised topic model Latent Dirichlet Allocation, LDA (Blei et al. 2003)
![Page 6: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/6.jpg)
Analyzing the trends in CL
• 12,500 documents of the ACL Anthology have been analyzed.
• The CL field gotten more theoretical or more applied?
• What topics have declined over the years, and which ones have remained constant?
• How have fields like Dialogue or MT changed over the years?
• Are there differences among the conferences?
![Page 7: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/7.jpg)
ACL Anthology
• A public repository of all papers in the major journals, conferences, and workshops.– Computational Linguistics.– ACL, COLING, EMNLP, and so on.
• Comprises over 14,000 documents.• From 1965 to 2008.• Indexed by conference and year.• Used as the basis of citation analysis work.
(Joseph & Radev, 2007)
![Page 8: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/8.jpg)
Data in the ACL Anthology
![Page 9: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/9.jpg)
Latent Dirichlet Allocation (LDA)
• A generative latent variable model that treats documents as bags of words generated by one or more topics.– Each document is represented as a multinomial
distribution over topics.– Each topic is in turn characterized by a
multinomial distribution over words.• Parameter estimation using collapsed Gibbs
sampling (Griffiths & Steyvers, 2004)
![Page 10: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/10.jpg)
Topic Modeling
• The empirical probability that an arbitrary paper d written in year y was about topic z:
• I is the indicator function, td is the year document d was written, and p(d|y) = 1/C.
![Page 11: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/11.jpg)
Topic selection
• Apply LDA to induces 100 topics, and took 36 that are relevant.
• Hand selected seed words for 10 more topics to improve coverage of the field.
• These 46 topics were used as priors to a new 100-topic run.
• Finally, 43 topics are selected.
![Page 12: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/12.jpg)
![Page 13: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/13.jpg)
Topics becoming more important
![Page 14: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/14.jpg)
Trend of probabilistic models
• The probabilistic model topic increases around 1988, which seems to have been an important year for this topic.
• What do the papers from 1988 tell us about how probabilistic models entered the field?
![Page 15: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/15.jpg)
![Page 16: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/16.jpg)
![Page 17: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/17.jpg)
![Page 18: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/18.jpg)
Analysys
• 9 of 10 the papers appeared in conference proceedings rather than journal.– New ideas appear in conferences.
• 5 of conference papers appeared in COLING compared to only 1 in ACL.– COLING is more receptive than ACL to new ideas.
• 6 of 10 papers either focus on speech or were written by authors who had published on speech recognition topics.– Speech recognition is an EE field which made early use of
probabilistic and statistical methodologies.
![Page 19: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/19.jpg)
Topics that have declined
![Page 20: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/20.jpg)
Including lexical semantics, conceptual semantics/story understanding, computational semantics, WordNet, WSD, semantic role labeling, RTE and paraphrase, MUC information extraction, and events/temporal.
![Page 21: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/21.jpg)
Paradigm shift in machine translation
![Page 22: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/22.jpg)
Paradigm shift in dialogue
![Page 23: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/23.jpg)
Peaked topics
![Page 24: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/24.jpg)
Is CL becoming more applied?
Including machine translation, spelling correction, dialogue systems, information retrieval, call routing, speech recognition, and biomedical applications.
![Page 25: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/25.jpg)
Six applied topics over time
The years 1989-1994 correspond exactly to the DARPA Speech and Natural Language Workshop, held at different location.
![Page 26: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/26.jpg)
Differences and similarities among COLING, ACL, and EMNLP
• Whether the topics of these conferences are converging or not.
• Are the probabilistic and machine learning trends that are dominant in ACL becoming dominant in COLING as well?
• Is EMNLP adopting some of the topics that are popular at COLING?
![Page 27: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/27.jpg)
Entropy of the 3 conferences over time
![Page 28: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/28.jpg)
Divergence between the 3 conferences
The Jensen-Shannon (JS) divergence between each pair of conference are plotted.
![Page 29: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/29.jpg)
Conclusion
• Proposed method discovers a number of trends in the computational linguistics.
• Show a convergence over time in topic coverage of ACL, COLING, and EMNLP as well an expansion of topic diversity.
• The growth and convergence of the 3 conferences, perhaps influenced by the need to increase recall seems to be leading toward a tripartite realization of a single new “latent” conference.
![Page 30: Studying the History of Ideas Using Topic Models](https://reader035.vdocuments.net/reader035/viewer/2022062305/5681668c550346895dda5394/html5/thumbnails/30.jpg)