bcs_seminar.ppt
TRANSCRIPT
![Page 1: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/1.jpg)
Lexical Chains for Topic Detection and Tracking
British Classification Society
Feb 23rd 2001
Joe Carthy & Nicola StokesUniversity College Dublin
[email protected]@ucd.ie
http://www.cs.ucd.ie/staff/jcarthyTel. +353 1 706 2481 or 706 2469
Fax. +353 1 269 7262
![Page 2: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/2.jpg)
Topic Detection and Tracking
• Topic Detection and Tracking (TDT)– DARPA funded TDT project with UMass, CMU and Dragon
Systems– Domain is all broadcast news: written and spoken
• TDT includes:– First story Detection – Event Tracking– Segmentation
• Applications– digital news editors– media analysts – equity traders
![Page 3: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/3.jpg)
Topic Tracking and Detection
• Tracking may be defined as– Take a corpus of news stories– Given 1 (or 2,4,8,16) sample stories about an
event– Find all subsequent stories in the corpus
about that event
• Detection: Is this a new story ?
![Page 4: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/4.jpg)
• Event is defined by a list of stories that discuss the event
e.g.
“Kobe earthquake”
is defined by first story that describes this event
Topic Tracking and Detection
![Page 5: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/5.jpg)
5
UCD TDT ARCHITECTURE
SERVER
Lexical Chainer
Event Tracker Event Detector
![Page 6: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/6.jpg)
6
Topic Detection and Tracking
DATE: 02:36TITLE: O.J. SIMPSON
Bought Knife, Murder Hearing told
O.J. SIMPSON MURDER TRIAL
NYC SUBWAY BOMBINGS
CARLOS THE JACKEL
DATA STREAM
Previous Stories
![Page 7: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/7.jpg)
• Implemented Benchmark systems using
conventional IR techniques:
– Stemmed keywords
– Stopword removal(Porter)
– Term weighting (Robertson, Sparck Jones)
Benchmark Systems
![Page 8: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/8.jpg)
8
Lexical Chaining
– Lexical chains - textual cohesion (Halliday & Hasan)
– Cohesion: text makes sense as a whole
– Cohesion occurs where the interpretation of one item is dependent of that of another item in the text. It is this dependency that gives rise to cohesion.
![Page 9: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/9.jpg)
9
Lexical Chaining
– Where the cohesive elements occur over a number of sentences a cohesive chain is formed.
– For example, the sentences:
John had mud pie for dessert. Mud pie is made of chocolate. John really enjoyed it.
– give rise to the lexical chain: {mud pie, dessert, mud pie, chocolate, it}
– Lexical cohesion is as the name suggests lexical - it involves the selection of a lexical item that is in some way related to one occurring previously.
![Page 10: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/10.jpg)
10
Lexical Chaining
– Reiteration is a form of lexical cohesion which involves the repetition of a lexical item.
This may involve simple repetition of the word but also includes the use of a synonym, near-synonym or superordinate.
For example in the sentences John bought a Jag. He loves the car. a superordinate, car, refers back to a subordinate Jag.
The part-whole relationship is also an example of lexical cohesion e.g. airplane and wing.
– A lexical chain is a sequence of related words in the text, spanning short or long distances.
![Page 11: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/11.jpg)
11
Lexical Chaining
– A chain is independent of the grammatical structure of the text and in effect it is a list of words that captures a portion of the cohesive structure of the text.
– A lexical chain can provide a context for the resolution of an ambiguous term and enable identification of the concept the term represents i.e. word sense disambiguation
– Morris and Hirst were the first researchers to suggest the use of lexical chains to determine the structure of texts.
![Page 12: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/12.jpg)
12
Lexical Chaining
– By identifying the lexical chains in a news story we hope to identify the focus of a news story. This can then be used in tracking and detection.
– It is important to realise that determining lexical chains is not a sophisticated natural language analysis process.
– Other Applications of Lexical Chaining• Hypertext links: Green• Summarisation: Barzilay• Segmentation: Okumura and Honda• IR: Stairmand, Ellman, Mochizuki• Malapropism detection: St. Onge• Multimedia indexing: Kazman,Al-Halimi
![Page 13: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/13.jpg)
13
Chain Generation
– In order to construct lexical chains we must be able to identify relationships between terms.
– This is made possible by the use of WordNet
– WordNet is a computational lexicon which was developed at Princeton University.
– In WordNet, synonym sets (synsets) are used to represent concepts where a synonym set corresponds to a concept and consists of all those terms that may be used to refer to that concept.
![Page 14: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/14.jpg)
14
Chain Generation
– For example, take the concept airplane it is represented by the synset {airplane, aeroplane, plane}.
– A WordNet synset has a numerical identifier such as 02054514.
– Links between synsets in WordNet represent conceptual relations such as synonymy, hyponymy, meronymy (part-of) etc.
– The synset identifier can be used to represent the concept referred to in the synset, for indexing and lexical chaining purposes.
![Page 15: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/15.jpg)
15
Word Sense Disambiguation
1st Term
EXHAUST
Part of
Has a
Train 3984
Exhaust32748
Railway carriage324932
Automobile057643
Termi
CAR
Car_exhaust32748
Tire_out, Fatigue374222
![Page 16: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/16.jpg)
16
Chain Generation
• Chaining procedure for a story:– Take the ith term in the story and generate the set Neighbouri of its
related synsets
– For each other term, if it is a member of the set Neighbouri then add it to the lexical chain for termi.
– If the lexical chain contains 3 or more elements then store the chain in a chain index file
– Repeat above for all terms in the story.
![Page 17: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/17.jpg)
17
– Computing Chain_Sim(Trackseti, Storyj )
• Overlap Coefficient which may be defined as follows, for two lexical chains c1 and c2:
• Overlap Coefficient =
| c1 ∩ c2 |min(| c1 |, | c2 |)
![Page 18: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/18.jpg)
18
Evaluation Metrics
– System returns a set of S documents :• a = # in S discussing new events
• b = # in S not discussing new events
• c = # in S' discussing new events
• d = # in S' not discussing new events
– Recall = a / (a+c)– Precision = a / (a+b)– Miss Rate = c / (a+c) = 1 - R– False Alarm Rate = b / (b+d) = Fallout
![Page 19: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/19.jpg)
19
Tracking Results
Average Recall vs Threshold
0
10
20
30
40
50
60
70
80
90
100
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
Threshold
LexTrack-O Nt = 1
KeyTrack Nt = 1
LexTrack Nt = 1
![Page 20: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/20.jpg)
20
Tracking Results
%Miss Rate vs Threshold
0
10
20
30
40
50
60
70
80
90
100
0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6
Threshold
LexTrack-O Nt=1
KeyTrack Nt=1
LexTrack Nt=1
![Page 21: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/21.jpg)
21
Detection Results
Detection Performance
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
% False Alarms
% Misses
Lex_DetectTRADCHAINS_ONLY
![Page 22: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/22.jpg)
22
Analysis of results
– Expected trade-off between precision and recall
– Small number of stories are sufficient to construct a tracking query
– Performance in line with other TDT researchers
– Lexical Chains - Improvement not significant ?
![Page 23: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/23.jpg)
http://www.cs.ucd.ie/staff/jcarthy 23
TDT and Lexical Chain References
• Allan, J., Carbonell, J., Doddington, G., Yamron, J, and Yang, Y., “Topic Detection and Tracking Pilot Study: Final Report”, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Morgan Kaufmann, San Francisco,1998.
• Allan, J., Papka, R., and Lavrenko, V., “Online New Event Detection and Tracking”, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 1998.
• Barzilay, R., “Lexical Chains for Summarization”, M.Sc. Thesis, Ben-Gurion University of the Negev, Israel, November 1997.
• Barzilay, R., and Elhadad, M., “Using Lexical Chains for Text Summarization”, The Fifth Bar-Ilan Symposium on Foundations of Artificial Intelligence Focusing on Intelligent Agents, Bar-Ilan University, Ramat Gan, Israel, June, 1997
• Budanitsky, A., “Lexical Semantic Relatedness and its Application in Natural Language Processing”, (PhD thesis) Technical Report CSRG-390, University of Toronto, 1999.
• Ellman, J., “Using Roget's Thesaurus to Determine the Similarity of Texts”, PhD Thesis, University of Sunderland, 2000.
• Fellbaum, C., (Ed.), WordNet: An Electronic Lexical Database and Some of its Applications, MIT Press, 1998.
• Green, S.J., “Automatically Generating Hypertext by Computing Semantic Similarity”, Ph.D. Thesis, University of Toronto, 1997.
![Page 24: BCS_Seminar.ppt](https://reader034.vdocuments.net/reader034/viewer/2022052410/555025ffb4c9059f318b46cb/html5/thumbnails/24.jpg)
http://www.cs.ucd.ie/staff/jcarthy 24
• Halliday, M.A.K. and Hasan, R., “Cohesion In English”, Longman , 1976.
• Hatch, P., "Lexical Chaining for the Online Detection of New Events", M.Sc. Thesis, University College Dublin, 2000.
• Hirst, G., and St-Onge, D., “Lexical Chains as Representations of Context for the Detection and Correction of Malapropisms”, in WordNet: An Electronic Lexical Database and Some of its Applications, Fellbaum, C., (Ed.), MIT Press, 1998.
• Kazman, R., Al-Halimi, R., Hunt, W., and Mantei, M., “Four Paradigms for Indexing Video Conferences”, IEEE MultiMedia, 3 (1), Spring 1996.
• Mochizuki, H., Iwayama, M., and Okumura, M., “Passage Level Document Retrieval Using Lexical Chains”, RIAO 2000, Content Based Multimedia Information Access, 491-506, 2000.
• Morris J., and Hirst, G., “Lexical Cohesion, the Thesaurus, and the Structure of Text”, Computational Linguistics, 17 (1), 211-232, 1991.
• Okumura, M., and Honda, T., “Word Sense Disambiguation and Text Segmentation Based on Lexical Cohesion”, In Proceedings of the Fifteenth International Conference on Computational Linguistics (COLING-94), Vol. 2, 775-761, Kyoto, Japan, August 1994.
• Porter, M.F., “An Algorithm for Suffix Stripping”, Program, 14, 130-137, 1980.
• Robertson, S.E. and Sparck Jones, K, "Simple Approaches to Text Retrieval", University of Cambridge Computing Laboratory Technical Report Number 356, May 1997.
• Stairmand, M.A., “A Computational Analysis of Lexical Cohesion with Applications in Information Retrieval”, Ph.D. Thesis, UMIST, 1996.
• Stokes, N., Carthy, J., First Story Detection using a Composite Document Representation, HLT 2001, Human Language Technology Confererence, San Diego, California, March 18-21, 2001
• TDT2000, “The Year 2000 Topic Detection and Tracking (TDT2000) Task Definition and Evaluation Plan”, available at the following URL: http://morph.ldc.upenn.edu/TDT/Guide/manual.front.html, November 2000.