language and information
DESCRIPTION
Language and Information. November 9, 2000. Handout #4. Course Information. Instructor: Dragomir R. Radev ([email protected]) Office: 305A, West Hall Phone: (734) 615-5225 Office hours: TTh 3-4 Course page: http://www.si.umich.edu/~radev/760 - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/1.jpg)
(C) 2000, The University of Michigan
1
Language and Information
Handout #4
November 9, 2000
![Page 2: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/2.jpg)
(C) 2000, The University of Michigan
2
Course Information
• Instructor: Dragomir R. Radev ([email protected])• Office: 305A, West Hall• Phone: (734) 615-5225• Office hours: TTh 3-4• Course page: http://www.si.umich.edu/~radev/760• Class meets on Thursdays, 5-8 PM in 311 West Hall
![Page 3: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/3.jpg)
(C) 2000, The University of Michigan
3
Readings
• Textbook:– Oakes Ch.3: 95-96, 110-120– Oakes Ch.4: 149-150, 158-166, 182-189– Oakes Ch.5: 199-212, 221-223, 236-247
• Additional readings– Knight “Statistical Machine Translation Workbook” (
http://www.clsp.jhu.edu/ws99/)– McKeown & Radev “Collocations”– Optional: M&S chapters 4, 5, 6, 13, 14
![Page 4: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/4.jpg)
(C) 2000, The University of Michigan
4
Statistical Machine Translationand Language Modeling
![Page 5: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/5.jpg)
(C) 2000, The University of Michigan
5
The Noisy Channel Model
• Source-channel model of communication• Parametric probabilistic models of language
and translation• Training such models
![Page 6: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/6.jpg)
(C) 2000, The University of Michigan
6
Statistics
• Given f, guess e
ef
e’E F F E
encoder decoder
e’ = argmax P(e|f) = argmax P(f|e) P(e)e e
translation model language model
![Page 7: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/7.jpg)
(C) 2000, The University of Michigan
7
Parametric probabilistic models
• Language model (LM)
• Deleted interpolation
• Translation model (TM)
P(e) = P(e1, e2, …, eL) = P(e1) P(e2|e1) … P(eL|e1 … eL-1)
P(eL|e1 … eK-1) P(eL|eL-2, eL-1)
Alignment: P(f,a|e)
![Page 8: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/8.jpg)
(C) 2000, The University of Michigan
8
IBM’s EM trained models
1. Word translation2. Local alignment3. Fertilities4. Class-based alignment5. Non-deficient algorithm (avoid overlaps,
overflow)
![Page 9: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/9.jpg)
(C) 2000, The University of Michigan
9
Lexical Semanticsand WordNet
![Page 10: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/10.jpg)
(C) 2000, The University of Michigan
10
• Lexemes, lexicon, sense(s)• Examples:
– Red, n: the color of blood or a ruby– Blood, n: the red liquid that circulates in the heart, arteries
and veins of animals– Right, adj: located nearer the right hand esp. being on the
right when facing the same direction as the observer• Do dictionaries gives us definitions??
Meanings of words
![Page 11: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/11.jpg)
(C) 2000, The University of Michigan
11
Relations among words• Homonymy:
– Instead, a bank can hold the investments in a custodial account in the client’s name.
– But as agriculture burgeons on the east bank, the river will shrink even more.
• Other examples: be/bee?, wood/would?• Homophones• Homographs• Applications: spelling correction, speech recognition, text-
to-speech• Example: Un ver vert va vers un verre vert.
![Page 12: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/12.jpg)
(C) 2000, The University of Michigan
12
Polysemy• They rarely serve red meat, preferring to prepare seafood,
poultry, or game birds.• He served as U.S. ambassador to Norway in 1976 and 1977.• He might have served his time, come out and led an
upstanding life.• Homonymy: distinct and unrelated meanings, possibly with
different etymology (multiple lexemes).• Polysemy: single lexeme with two meanings.• Example: an “idea bank”
![Page 13: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/13.jpg)
(C) 2000, The University of Michigan
13
Synonymy
• Principle of substitutability• How big is this plane?• Would I be flying on a large or small plane?• Miss Nelson, for instance, became a kind of big
sister to Mrs. Van Tassel’s son, Benjamin.• ?? Miss Nelson, for instance, became a kind of
large sister to Mrs. Van Tassel’s son, Benjamin.• What is the cheapest first class fare?• ?? What is the cheapest first class cost?
![Page 14: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/14.jpg)
(C) 2000, The University of Michigan
14
Semantic Networks
• Used to represent relationships between words
• Example: WordNet - created by George Miller’s team at Princeton (http://www.cogsci.princeton.edu/~wn)
• Based on synsets (synonyms, interchangeable words) and lexical matrices
![Page 15: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/15.jpg)
(C) 2000, The University of Michigan
15
Lexical matrix
Word FormsWord
Meanings F1 F2 F3 … Fn
M1 E1,1 E1,2
M2 E1,2
……
Mm Em,n
![Page 16: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/16.jpg)
(C) 2000, The University of Michigan
16
Synsets
• Disambiguation– {board, plank}– {board, committee}
• Synonyms– substitution– weak substitution– synonyms must be of the same part of speech
![Page 17: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/17.jpg)
(C) 2000, The University of Michigan
17
$ ./wn board -hypen
Synonyms/Hypernyms (Ordered by Frequency) of noun board9 senses of board
Sense 1board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping
Sense 2board => sheet, flat solid => artifact, artefact => object, physical object => entity, something
Sense 3board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something
![Page 18: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/18.jpg)
(C) 2000, The University of Michigan
18
Sense 4display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
Sense 5board, gameboard => surface => artifact, artefact => object, physical object => entity, something
Sense 6board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something
![Page 19: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/19.jpg)
(C) 2000, The University of Michigan
19
Sense 7control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, somethingSense 8circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, somethingSense 9dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
![Page 20: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/20.jpg)
(C) 2000, The University of Michigan
20
Antonymy
• “x” vs. “not-x”• “rich” vs. “poor”?• {rise, ascend} vs. {fall, descend}
![Page 21: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/21.jpg)
(C) 2000, The University of Michigan
21
Other relations
• Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”.
• Hyponymy: {tree} is a hyponym of {plant}.• Hierarchical structure based on hyponymy
(and hypernymy).
![Page 22: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/22.jpg)
(C) 2000, The University of Michigan
22
Other features of WordNet
• Index of familiarity• Polysemy
![Page 23: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/23.jpg)
(C) 2000, The University of Michigan
23
board used as a noun is familiar (polysemy count = 9)
bird used as a noun is common (polysemy count = 5)
cat used as a noun is common (polysemy count = 7)
house used as a noun is familiar (polysemy count = 11)
information used as a noun is common (polysemy count = 5)
retrieval used as a noun is uncommon (polysemy count = 3)
serendipity used as a noun is very rare (polysemy count = 1)
Familiarity and polysemy
![Page 24: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/24.jpg)
(C) 2000, The University of Michigan
24
Compound nouns
advisory boardappeals boardbackboardbackgammon boardbaseboardbasketball backboardbig boardbillboardbinder's boardbinder board
blackboardboard gameboard measureboard meetingboard memberboard of appealsboard of directorsboard of educationboard of regentsboard of trustees
![Page 25: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/25.jpg)
(C) 2000, The University of Michigan
25
Overview of senses1. board -- (a committee having supervisory powers; "the board has seven members")2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows")3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes)4. display panel, display board, board -- (a board on which information can be displayed to public view)5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces")6. board, table -- (food or meals in general; "she sets a fine table"; "room and board")7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree")8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")
![Page 26: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/26.jpg)
(C) 2000, The University of Michigan
26
Top-level concepts{act, action, activity}{animal, fauna}{artifact}{attribute, property}{body, corpus}{cognition, knowledge}{communication}{event, happening}{feeling, emotion}{food}{group, collection}{location, place}{motive}
{natural object}{natural phenomenon}{person, human being}{plant, flora}{possession}{process}{quantity, amount}{relation}{shape}{state, condition}{substance}{time}
![Page 27: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/27.jpg)
(C) 2000, The University of Michigan
27
Information Extraction
![Page 28: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/28.jpg)
(C) 2000, The University of Michigan
28
Types of Information Extraction
• Template filling• Language reuse• Biographical information• Question answering
![Page 29: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/29.jpg)
(C) 2000, The University of Michigan
29
MUC-4 Example
INCIDENT: DATE 30 OCT 89 INCIDENT: LOCATION EL SALVADOR INCIDENT: TYPE ATTACK INCIDENT: STAGE OF EXECUTION ACCOMPLISHED INCIDENT: INSTRUMENT ID INCIDENT: INSTRUMENT TYPEPERP: INCIDENT CATEGORY TERRORIST ACT PERP: INDIVIDUAL ID "TERRORIST" PERP: ORGANIZATION ID "THE FMLN" PERP: ORG. CONFIDENCE REPORTED: "THE FMLN" PHYS TGT: ID PHYS TGT: TYPEPHYS TGT: NUMBERPHYS TGT: FOREIGN NATIONPHYS TGT: EFFECT OF INCIDENTPHYS TGT: TOTAL NUMBERHUM TGT: NAMEHUM TGT: DESCRIPTION "1 CIVILIAN"HUM TGT: TYPE CIVILIAN: "1 CIVILIAN"HUM TGT: NUMBER 1: "1 CIVILIAN"HUM TGT: FOREIGN NATIONHUM TGT: EFFECT OF INCIDENT DEATH: "1 CIVILIAN"HUM TGT: TOTAL NUMBER
On October 30, 1989, one civilian was killed in a reported FMLN attack in El Salvador.
![Page 30: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/30.jpg)
(C) 2000, The University of Michigan
30
Yugoslav President Slobodan Milosevic
[description]
NP
Phrase to be reused
Language reuse
[entity]
![Page 31: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/31.jpg)
(C) 2000, The University of Michigan
31
NPExample
Andrija Hebrang , The Croatian Defense Minister
[description][entity]
NP NPPunc
![Page 32: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/32.jpg)
(C) 2000, The University of Michigan
32
Issues involved
• Text generation depends on lexical resources• Lexical choice• Corpus processing vs. manual compilation• Deliberate decisions by writers• Difficult to encode by hand• Dynamically updated (Scott O’Grady)• No full semantic representation
![Page 33: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/33.jpg)
(C) 2000, The University of Michigan
33
Named entitiesRichard Butler met Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work.
Yitzhak Mordechai will meet Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking.
Sinn Fein deferred a vote on Northern Ireland's peace deal Sunday.
Hundreds of troops patrolled Dili on Friday during the anniversary of Indonesia's 1976 annexation of the territory.
![Page 34: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/34.jpg)
(C) 2000, The University of Michigan
34
Entities + DescriptionsChief U.N. arms inspector Richard Butler met Iraq’s Deputy Prime Minister Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work.
Israel's Defense Minister Yitzhak Mordechai will meet senior Palestinian negotiator Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking.
Sinn Fein, the political wing of the Irish Republican Army, deferred a vote on Northern Ireland's peace deal Sunday.
Hundreds of troops patrolled Dili, the Timorese capital, on Friday during the anniversary of Indonesia's 1976 annexation of the territory.
![Page 35: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/35.jpg)
(C) 2000, The University of Michigan
35
Building a database of descriptions
• Size of database: 59,333 entities and 193,228 descriptions as of 08/01/98
• Text processed: 494 MB (ClariNet, Reuters, UPI)
• Length: 1-15 lexical items• Accuracy: (precision 94%, recall 55%)
![Page 36: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/36.jpg)
(C) 2000, The University of Michigan
36
Ung Huot
A senior memberCambodia’sCambodian foreign ministerCo-premierFirst prime ministerForeign ministerHis excellencyMr.New co-premierNew first prime ministerNewly-appointed first prime ministerPremier
Multiple descriptions per entity
Profile for Ung Huot
![Page 37: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/37.jpg)
(C) 2000, The University of Michigan
37
Language reuse and regeneration
+ =CONCEPTS CONSTRAINTS CONSTRUCTS
Corpus analysis: determining constraints
Text generation: applying constraints
![Page 38: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/38.jpg)
(C) 2000, The University of Michigan
38
• Understanding: full parsing is expensive • Generation: expensive to use full parses• Bypassing certain stages (e.g., syntax)• Not(!) template-based: still required
extraction, analysis, context identification, modification, and generation
• Factual sentences, sentence fragments• Reusability of a phrase
Language reuse and regeneration
![Page 39: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/39.jpg)
(C) 2000, The University of Michigan
39
Context-dependent solution
Redefining the relation:DescriptionOf (E,C) =
{Di,c, Di,c is a description of E in context C}
If named entity E appears in text and the context is C:Insert DescriptionOf (E,C) in text.
![Page 40: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/40.jpg)
(C) 2000, The University of Michigan
40
Multiple descriptions per entity
Bill Clinton
U.S. PresidentPresidentAn Arkansas nativeDemocratic presidential candidate
Profile for Bill Clinton
![Page 41: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/41.jpg)
(C) 2000, The University of Michigan
41
Choosing the right description
Bill Clinton CONTEXT
U.S. President …………………………..foreign relationsPresident ………………………………… national affairsAn Arkansas native ……………....false bomb alert in ARDemocratic presidential candidate …………….. elections
Pragmatic and semantic constraints on lexical choice.
![Page 42: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/42.jpg)
(C) 2000, The University of Michigan
42
Semantic information from WordNet
• All words contribute to the semantic representation
• First sense is used only
• What is a synset?
![Page 43: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/43.jpg)
(C) 2000, The University of Michigan
43
WordNet synset hierarchy
{07063762} director, manager, managing director
{07063507} administrator, decision maker
{07311393} head, chief, top dog
{06950891} leader
{00004123} person, individual, someone, somebody, human
{00002086} life form, organism, being, living thing
{00001740} entity, something
![Page 44: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/44.jpg)
(C) 2000, The University of Michigan
44
Lexico-semantic matrixWord synsets Parent synsets
Description{07147929}premier
{07009772}Kampuchean …
{07412658}minister
{07087841}associate
A senior member … XCambodia's X …Cambodian foreign minister X … XCo-premier X … XFirst prime minister X … XForeign minister … XHis excellency …Mr. …New co-premier X … XNew first prime minister X … XNewly-appointed first prime minister X … XPremier X … XPrime minister X … X
Profile for Ung Huot
![Page 45: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/45.jpg)
(C) 2000, The University of Michigan
45
Choosing the right description• Topic approximation by context: words that
appear near the entity in the text (bag) • Name of the entity (set)• Length of article (continuous)• Profile: set of all descriptions for that entity (bag)
- parent synset offsets for all words wi.• Semantic information: WordNet synset offsets
(bag)
![Page 46: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/46.jpg)
(C) 2000, The University of Michigan
46
Choosing the right description
(Context, Entity, Description, Length, Profile, Parent) Classes
Ripper feature vector [Cohen 1996]
![Page 47: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/47.jpg)
(C) 2000, The University of Michigan
47
Example (training)T# Context Entity Description Len Profile Parent Classes1 Election,
promised,said, carry,party …
KimDae-Jung
Veteranoppositionleader
949 Candidate,chief, policymaker,Korean ...
person,leader,Asian,importantperson ...
{07136302}{07486519}{07311393}{06950891}{07486079}
2 Introduced,responsible,running,should,bringing …
KimDae-Jung
SouthKorea'soppositioncandidate
629 Candidate,chief, policymaker,Korean ...
person,leader,Asian,importantperson ...
{07136302}{07486519}{07311393}{06950891}{07486079}
3 Attend,during,party, time,traditionally …
KimDae-Jung
A front-runner
535 Candidate,chief, policymaker,Korean ...
person,leader,Asian,importantperson ...
{07136302}{07486519}{07311393}{06950891}{07486079}
4 Discuss,making,party,statement,said …
KimDae-Jung
A front-runner
1114 Candidate,chief, policymaker,Korean ...
person,leader,Asian,importantperson ...
{07136302}{07486519}{07311393}{06950891}{07486079}
5 New, party,politics, in,it …
KimDae-Jung
SouthKorea'spresident-elect
449 Candidate,chief, policymaker,Korean ...
person,leader,Asian,importantperson ...
{07136302}{07486519}{07311393}{06950891}{07486079}
![Page 48: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/48.jpg)
(C) 2000, The University of Michigan
48
Sample rules
Total number of rules: 4085 for 100,000 inputs
{07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 361 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ presidential LENGTH <=
412 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~
during .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ case .{07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 390
LENGTH <= 412 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~
and .
![Page 49: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/49.jpg)
(C) 2000, The University of Michigan
49
Evaluation
• 35,206 tuples; 11,504 distinct entities; 3.06 DDPE
• Training: 90% of corpus (10,353 entities)• Test: 10% of corpus (1,151 entities)
![Page 50: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/50.jpg)
(C) 2000, The University of Michigan
50
Evaluation
• Rule format (each matching rule adds constraints):
X [A] (evidence of A)
Y [B] (evidence of B)
X Y [A] [B] (evidence of A and B)
• Classes are in 2W (powerset of WN nodes)• P&R on the constraints selected by system
![Page 51: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/51.jpg)
(C) 2000, The University of Michigan
51
Definition of precision and recall
Model System P R
50.0 %[A] [B] [C]
[A] [B] [C] [A] [B] [D]
[B] [D] 33.3 %
66.7 % 66.7 %
![Page 52: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/52.jpg)
(C) 2000, The University of Michigan
52
Precision and recallWord nodes only Word and parent nodes
Trainingset
Precision Recall Precision Recall500 64.29% 2.86% 78.57% 2.86%
1000 71.43% 2.86% 85.71% 2.86%2000 42.86% 40.71% 67.86% 62.14%5000 59.33% 48.40% 64.67% 53.73%
10000 69.72% 45.04% 74.44% 59.32%15000 76.24% 44.02% 73.39% 53.17%20000 76.25% 49.91% 79.08% 58.70%25000 83.37% 52.26% 82.39% 57.49%30000 80.14% 50.55% 82.77% 57.66%50000 83.13% 58.53% 88.87% 63.39%
100000 85.42% 62.81% 89.70% 64.64%150000 87.07% 63.17%200000 85.73% 62.86%250000 87.15% 63.85%
![Page 53: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/53.jpg)
(C) 2000, The University of Michigan
53
Question Answering
![Page 54: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/54.jpg)
(C) 2000, The University of Michigan
54
Q: When did Nelson Mandela become president of South Africa?
A: 10 May 1994
Q: How tall is the Matterhorn?
A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches
Q: How tall is the replica of the Matterhorn at Disneyland?
A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years
Q: If Iraq attacks a neighboring country, what should the US do?
A: ??
Question answering
![Page 55: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/55.jpg)
(C) 2000, The University of Michigan
55
Q: Why did David Koresh ask the FBI for a word processor?Q: Name the designer of the shoe that spawned millions of plastic imitations, known as "jellies".Q: What is the brightest star visible from Earth?Q: What are the Valdez Principles?Q: Name a film that has won the Golden Bear in the Berlin Film Festival?Q: Name a country that is developing a magnetic levitation railway system?Q: Name the first private citizen to fly in space.Q: What did Shostakovich write for Rostropovich?Q: What is the term for the sum of all genetic material in a given organism?Q: What is considered the costliest disaster the insurance industry has ever faced?Q: What is Head Start?Q: What was Agent Orange used for during the Vietnam War?Q: What did John Hinckley do to impress Jodie Foster?Q: What was the first Gilbert and Sullivan opera?Q: What did Richard Feynman say upon hearing he would receive the Nobel Prize in Physics?Q: How did Socrates die?Q: Why are electric cars less efficient in the north-east than in California?
![Page 56: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/56.jpg)
(C) 2000, The University of Michigan
56
The TREC evaluation
• Document retrieval• Eight years• Information retrieval?• Corpus: texts and questions
![Page 57: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/57.jpg)
(C) 2000, The University of Michigan
57
documents
query
TextractResporator Indexer
Index
QueryProcessing
Search
Hit ListAnSel/Werlect
RankedHitList
GuruQA
Answer selection
Prager et al. 2000 (SIGIR)Radev et al. 2000 (ANLP/NAACL)
![Page 58: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/58.jpg)
(C) 2000, The University of Michigan
58
QA-Token Question type ExamplePLACE$ Where In the Rocky MountainsCOUNTRY$ Where/What country United KingdomSTATE$ Where/What state MassachusettsPERSON$ Who Albert EinsteinROLE$ Who DoctorNAME$ Who/What/Which The Shakespeare FestivalORG$ Who/What The US Post OfficeDURATION$ How long For 5 centuriesAGE$ How old 30 years oldYEAR$ When/What year 1999TIME$ When In the afternoonDATE$ When/What date July 4th, 1776VOLUME$ How big 3 gallonsAREA$ How big 4 square inchesLENGTH$ How big/long/high 3 milesWEIGHT$ How big/heavy 25 tonsNUMBER$ How many 1,234.5METHOD$ How By rubbingRATE$ How much 50 per centMONEY$ How much 4 million dollars
![Page 59: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/59.jpg)
(C) 2000, The University of Michigan
59
<p><NUMBER>1</NUMBER></p><p><QUERY>Who is the author of the book, "The Iron Lady: ABiography of Margaret Thatcher"?</QUERY></p><p><PROCESSED_QUERY>@excwin(*dynamic* @weight(200*Iron_Lady) @weight(200 Biography_of_Margaret_Thatcher)@weight(200 Margaret) @weight(100 author) @weight(100book) @weight(100 iron) @weight(100 lady) @weight(100 :)@weight(100 biography) @weight(100 thatcher) @weight(400@syn(PERSON$ NAME$)) )</PROCESSED_QUERY></p><p><DOC>LA090290-0118</DOC></p><p><SCORE>1020.8114</SCORE></p><TEXT><p>THE IRON LADY; A <span class="NAME">Biography ofMargaret Thatcher</span> by <span class="PERSON">HugoYoung</span> (<span class="ORG">Farrar , Straus &Giroux</span>) The central riddle revealed here is why, asa woman <span class="PLACEDEF">in a man</span>'s world,<span class="PERSON">Margaret Thatcher</span> evinces suchan exclusionary attitude toward women.</p></TEXT>
![Page 60: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/60.jpg)
(C) 2000, The University of Michigan
60
SYN-set N Score Score/NPERSON NAME 30 16.5 55.0%PLACE COUNTRY STATE NAME PLACEDEF 21 7.08 33.7%NAME 18 3.67 20.4%DATE YEAR 18 5.31 29.5%PERSON ORG NAME ROLE 19 4.62 24.3%undefined 19 11.45 60.3%NUMBER 18 8.00 44.4%PLACE NAME PLACEDEF 14 10.00 71.4%PERSON ORG PLACE NAME PLACEDEF 10 3.03 30.3%MONEY RATE 6 1.50 25%ORG NAME 4 1.25 31.2%SIZE1 4 2.50 62.5%SIZE1 DURATION 3 0.83 27.7%STATE 3 2.00 66.7%COUNTRY 3 1.33 44.3%YEAR 2 1.00 50.0%RATE 2 1.50 75.0%TIME DURATION 1 0.00 0.0%SIZE1 SIZE2 1 0.00 0.0%DURATION TIME 1 0.33 33.3%DATE 1 0 0.00%
![Page 61: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/61.jpg)
(C) 2000, The University of Michigan
61
Span Type Number Rspanno Count Notinq Type Avgdst Sscore TOTALOllie Matson PERSON 3 3 6 2 1 12 0.02507 -7.53Lou Vasquez PERSON 1 1 6 2 1 16 0.02507 -9.93Tim O'Donohue PERSON 17 1 4 2 1 8 0.02257 -12.57Athletic Director Dave Cowen PERSON 23 6 4 4 1 11 0.02257 -15.87Johnny Ceballos PERSON 22 5 4 1 1 9 0.02257 -19.07Civic Center Director Martin Durham PERSON 13 1 2 5 1 16 0.02505 -19.36Johnny Hodges PERSON 25 2 4 1 1 15 0.02256 -25.22Derric Evans PERSON 33 4 4 2 1 14 0.02256 -25.37NEWSWIRE Johnny Majors PERSON 30 1 4 2 1 17 0.02256 -25.47Woodbridge High School ORG 18 2 4 1 2 6 0.02257 -28.37Evan PERSON 37 6 4 1 1 14 0.02256 -29.57Gary Edwards PERSON 38 7 4 2 1 17 0.02256 -30.87O.J. Simpson NAME 2 2 6 2 3 12 0.02507 -37.40South Lake Tahoe NAME 7 5 6 3 3 14 0.02507 -40.06Washington High NAME 10 6 6 1 3 18 0.02507 -49.80Morgan NAME 26 3 4 1 3 12 0.02256 -52.52Tennesseefootball NAME 31 2 4 1 3 15 0.02256 -56.27Ellington NAME 24 1 4 1 3 20 0.02256 -59.42assistant ROLE 21 4 4 1 4 8 0.02257 -62.77the Volunteers ROLE 34 5 4 2 4 14 0.02256 -71.17Johnny Mathis PERSON 4 4 6 -100 1 11 0.02507 -211.33Mathis NAME 14 2 2 -100 3 10 0.02505 -254.16coach ROLE 19 3 4 -100 4 4 0.02257 -259.67
![Page 62: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/62.jpg)
(C) 2000, The University of Michigan
62
Features (1)• Number: position of the span among all spans returned. Example:
“Lou Vasquez” was the first span returned by GuruQA on the sample question.
• Rspanno: position of the span among all spans returned within the current passage.
• Count: number of spans of any span class retrieved within the current passage.
• Notinq: the number of words in the span that do not appear in the query. Example: Notinq (“Woodbridge high school”) = 1, because both “high” and “school” appear in the query while “Woodbridge” does not. It is set to –100 when the actual value is 0.
![Page 63: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/63.jpg)
(C) 2000, The University of Michigan
63
• Type: the position of the span type in the list of potential span types. Example: Type (“Lou Vasquez”) = 1, because the span type of “Lou Vasquez”, namely “PERSON” appears first in the SYN-set, “PERSON ORG NAME ROLE”.
• Avgdst: the average distance in words between the beginning of the span and the words in the query that also appear in the passage. Example: given the passage “Tim O'Donohue, Woodbridge High School's varsity baseball coach, resigned Monday and will be replaced by assistant Johnny Ceballos, Athletic Director Dave Cowen said.” and the span “Tim O’Donohue”, the value of avgdst is equal to 8.
• Sscore: passage relevance as computed by GuruQA.
Features (2)
![Page 64: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/64.jpg)
(C) 2000, The University of Michigan
64
Combining evidence
• TOTAL (span) = – 0.3 * number – 0.5 * rspanno + 3.0 * count + 2.0 * notinq – 15.0 * types – 1.0 * avgdst + 1.5 * sscore
![Page 65: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/65.jpg)
(C) 2000, The University of Michigan
65
DocumentID
Score Extract
LA053189-0069
892.5 of O.J. Simpson , Ollie Matson and Johnny Mathis
LA053189-0069
890.1 Lou Vasquez , track coach of O.J. Simpson , Ollie
LA060889-0181
887.4 Tim O'Donohue , Woodbridge High School 's varsity
LA060889-0181
884.1 nny Ceballos , Athletic Director Dave Cowen said.
LA060889-0181
880.9 aced by assistant Johnny Ceballos , Athletic Direc
Extracted text
![Page 66: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/66.jpg)
(C) 2000, The University of Michigan
66
First Second Third Fourth Fifth TOTAL# cases 49 15 11 9 4 88Points 49.00 7.50 3.67 2.25 0.80 63.22
First Second Third Fourth Fifth TOTAL# cases 71 16 11 6 5 109Points 71.00 8.00 3.67 1.50 1.00 85.17
50 bytes
250 bytes
Results
![Page 67: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/67.jpg)
(C) 2000, The University of Michigan
67
Style and Authorship Analysis
![Page 68: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/68.jpg)
(C) 2000, The University of Michigan
68
Style and authorship analysis
• Use of nouns, verbs…• Use of rare words• Positional and contextual distribution• Use of alternatives: “and/also”,
“since/because”, “scarcely/hardly”
![Page 69: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/69.jpg)
(C) 2000, The University of Michigan
69
Sample problem
• 15-th century Latin work “De Imitatione Christi”
• Was it written by Thomas a Kempis or Jean Charlier de Gerson?
• Answer: by Kempis• Why?
![Page 70: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/70.jpg)
(C) 2000, The University of Michigan
70
Yule’s K characteristic
• Vocabulary richness: measure of the probability that any randomly selected pair of words will be identical
K = 10,000 x (M2 - M1)/(M1 x M1)
• M1, M2 - distribution moments• M1 - total number of usages (words including repetitions)• M2 - sum of all vocabulary words in each frequency group, from 1 to the
maximum word frequency, multiplied by the square of the frequency
![Page 71: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/71.jpg)
(C) 2000, The University of Michigan
71
Example
• Text consisting of 12 words, where two of the words occur once, two occur twice, and two occur three times.
• M0 = 6• M1 = 12• M2 = (2 x 12) + (2 x 22) + (2 x 32) = 28• K increases as the diversity of the vocabulary
decreases.
![Page 72: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/72.jpg)
(C) 2000, The University of Michigan
72
Example (cont’d)
• Three criteria used:– total vocabulary size– frequency distribution of the different words– Yule’s K– the mean frequency of the word sin the sample– the number of nouns unique to a particular
sample• Pearson’s coefficient used
![Page 73: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/73.jpg)
(C) 2000, The University of Michigan
73
Federalist papers
• Published in 1787-1788 to persuade the population of New York state to ratify the new American constitution
• Published under the pseudonym Publius, the three authors were James Madison, John Jay, and Alexander Hamilton.
• Before dying in a duel, Hamilton claimed some portion of the essays.
• It was agreed that Jay wrote 5 essays, Hamilton - 43, Madison - 14. Three others were jointly written by Hamilton and Madison, and 12 were disputed
![Page 74: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/74.jpg)
(C) 2000, The University of Michigan
74
Method
• Mosteller and Wallace (1963) used Bayesian statistics to determine which papers were written by whom.
• Authors had tried to imitate each other. So - sentence length and other easily imitable features are not useful.
• Madison and Hamilton were found to vary in their use of “by” (H) and “to” (M), “enough” (H) and “whilst” (M).
![Page 75: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/75.jpg)
(C) 2000, The University of Michigan
75
Cluster Analysis
![Page 76: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/76.jpg)
(C) 2000, The University of Michigan
76
Clustering
• Idea: find similar objects and group them together
• Examples: – all news stories on the same topic– all documents from the same genre or language
• Types of clustering: classification (tracking) and categorization (detection)
![Page 77: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/77.jpg)
(C) 2000, The University of Michigan
77
Non-hierarchical clustering
• Concept of a centroid• document/centroid similarity• other parameters:
– number of clusters– maximum and minimum size for each cluster– vigilance parameter– overlap between clusters
![Page 78: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/78.jpg)
(C) 2000, The University of Michigan
78
Hierarchical clustering
• Similarity matrix (expensive: the SIM matrix needs to be updated after every iteration)
• Average linkage method• dendrograms
![Page 79: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/79.jpg)
(C) 2000, The University of Michigan
79
Introduction
• Abundance of newswire on the Web• Multiple sources reporting on the same
event• Multiple modalities (speech, text)• Summarization and filtering
![Page 80: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/80.jpg)
(C) 2000, The University of Michigan
80
Introduction
• TDT participation topic detection and tracking– CIDR
• Multi-document summarization – statistical, domain-dependent– knowledge-based (SUMMONS)
![Page 81: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/81.jpg)
(C) 2000, The University of Michigan
81
Topics and events
• Topic = event (single act) or activity (ongoing action)
• Defined by content, time, and place of occurrence [Allan et al. 1998, Yang et al. 1998]
• Examples:– Marine fighter pilot’s plane cuts cable in Italian Alps
(February 3, 1998)– Eduard Shavardnadze assassination attempt (February 9,
1998)– Jonesboro shooting (March 24, 1998)
![Page 82: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/82.jpg)
(C) 2000, The University of Michigan
82
TDT overview
• Event detection: monitoring a continuous stream of news articles and identifying new salient events
• Event tracking: identifying stories that belong to predefined event topics
• [Story segmentation: identifying topic boundaries]
![Page 83: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/83.jpg)
(C) 2000, The University of Michigan
83
The TDT-2 corpus
• Corpus described in [Doddington et al. 1999, Cieri et al. 1999]
• One hundred topics, 54K stories, 6 sources• Two newswire sources (AP, NYT); 2 TV
stations (ABC, CNN-HN); 2 radio stations (PRI, VOA)
• 11 participants (4 industrial sites, 7 universities)
![Page 84: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/84.jpg)
(C) 2000, The University of Michigan
84
Detection conditions
• Default:– Newswire + Audio - automatic transcription– Deferral period of 10 source files– Given boundaries for ASR
![Page 85: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/85.jpg)
(C) 2000, The University of Michigan
85
Description of the system
• Single-pass clustering algorithm• Normalized, tf*idf-modified, cosine-based
similarity between document and centroid • detection only, standard evaluation
conditions, no deferral
![Page 86: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/86.jpg)
(C) 2000, The University of Michigan
86
Research problems
• focus on speedup• search space of five experimental
parameters• tradeoffs between parallelization and
accuracy
![Page 87: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/87.jpg)
(C) 2000, The University of Michigan
87
Vector-based representationTerm 1
Term 2
Term 3
Document
Centroid
![Page 88: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/88.jpg)
(C) 2000, The University of Michigan
88
Vector-based matching
• The cosine measure
sim (D,C) = (dk . ck . idf(k))
(dk)2 . (ck)2k
k
k
![Page 89: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/89.jpg)
(C) 2000, The University of Michigan
89
Description of the system
sim T
![Page 90: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/90.jpg)
(C) 2000, The University of Michigan
90
Description of the system
sim > T sim < T
![Page 91: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/91.jpg)
(C) 2000, The University of Michigan
91
Centroid sizeC 10007 (N=11)
(10000)crashes 1.00safety 0.55
transportation
0.55drivers 0.45board 0.36flight 0.27buckle 0.27
pittsburgh 0.18graduating 0.18automobile 0.18
C 00008 (N=113)(10000)space 1.98
shuttle 1.17station 0.75nasa 0.51
columbia 0.37mission 0.33
mir 0.30astronaut
s0.14
steering 0.11safely 0.07
C 10062 (N=161)microsoft 3.24
justice 0.93departmen
t0.88
windows 0.98corp 0.61
software 0.57ellison 0.07hatch 0.06
netscape 0.04metcalfe 0.02
![Page 92: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/92.jpg)
(C) 2000, The University of Michigan
92
Centroid sizeC 00022 (N=44)
(10000)diana 1.93princess 1.52
C 00025 (N=19)(10000)albanians 3.00
C 00026 (N=10)(10000)universe 1.50
expansion 1.00bang 0.90
C 00035 (N=22)(10000)airlines 1.45
finnair 0.45
C 00031 (N=34)(10000)el 1.85
nino 1.56
![Page 93: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/93.jpg)
(C) 2000, The University of Michigan
93
Centroid sizeC 00022 (N=44)
(10000)diana 1.93princess 1.52
C 00025 (N=19)(10000)albanians 3.00
C 00026 (N=10)(10000)universe 1.50
expansion 1.00bang 0.90
C 10007 (N=11)(10000)crashes 1.00
safety 0.55transportat
ion0.55
drivers 0.45board 0.36flight 0.27buckle 0.27
pittsburgh 0.18graduating 0.18automobile 0.18
C 00035 (N=22)(10000)airlines 1.45
finnair 0.45
C 00031 (N=34)(10000)el 1.85
nino 1.56
C 00008 (N=113)(10000)space 1.98
shuttle 1.17station 0.75nasa 0.51
columbia 0.37mission 0.33
mir 0.30astronaut
s0.14
steering 0.11safely 0.07
C 10062 (N=161)microsoft 3.24
justice 0.93departmen
t0.88
windows 0.98corp 0.61
software 0.57ellison 0.07hatch 0.06
netscape 0.04metcalfe 0.02
![Page 94: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/94.jpg)
(C) 2000, The University of Michigan
94
Parameter space• Similarity
– DECAY: Number of words at beginning of document that will be considered in computing vector similarities (50 - 1000)
– IDF: Minimum value for idf so that a word is considered (1 - 10)– SIM: Similarity threshold (0.01 - 0.25)
• Centroids– KEEPI: Keep all words whose tf*idf scores are above a certain
threshold (1-10)– KEEP: Keep at least that many words in centroid (1-50)
![Page 95: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/95.jpg)
(C) 2000, The University of Michigan
95
Parameter selection (dev-test)
![Page 96: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/96.jpg)
(C) 2000, The University of Michigan
96
Cluster stability10000 docs
(10000)22443 docs
suharto 2.48 suharto 2.61jakarta 0.58 jakarta 0.58habibie 0.47 habibie 0.53students 0.45 students 0.43student 0.22 student 0.21
protesters 0.20 protesters 0.19asean 0.11 asean 0.10
campuses 0.05 campuses 0.04geertz 0.04 geertz 0.04medan 0.04 medan 0.04
10000 docs 22443 docsmicrosoft 3.31 microsoft 3.24
justice 1.06 justice 0.93department 1.01 department 0.88windows 0.90 windows 0.98
corp 0.60 corp 0.61software 0.51 software 0.57ellison 0.09 ellison 0.07hatch 0.06 hatch 0.06
netscape 0.05 netscape 0.04metcalfe 0.03 metcalfe 0.03
![Page 97: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/97.jpg)
(C) 2000, The University of Michigan
97
Parallelization
![Page 98: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/98.jpg)
(C) 2000, The University of Michigan
98
Parallelization
C(P)
![Page 99: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/99.jpg)
(C) 2000, The University of Michigan
99
Parallelization
![Page 100: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/100.jpg)
(C) 2000, The University of Michigan
100
Parallelization
![Page 101: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/101.jpg)
(C) 2000, The University of Michigan
101
Evaluation principles
CDet(R,H) = Cmiss.Pmiss(R,H).Ptopic + CFalseAlarm.PFalseAlarm (R,H).(1-Ptopic)
CMiss = 1CFalseAlarm = 1PMiss(R,H) = NMiss(R,H)/|R|PFalseAlarm(R,H) = NFalseAlarm(R,H)/|S-R|Ptopic = 0.02 (a priori probability)R - set of stories in a reference target topicH - set of stories in a system-defined topicS - set of stories to be scored in eval corpus
Task: to determine H(R) = argmin{CDet(R,H)}H
![Page 102: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/102.jpg)
(C) 2000, The University of Michigan
102
Official results
![Page 103: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/103.jpg)
(C) 2000, The University of Michigan
103
Story Weighted Topic Weighted# Parallel Sim Decay Idf Keep P(miss) P(fa) Cdet P(miss) P(fa) Cdet
1 yes .1 100 3 10 0.3861 0.0018 0.0095 0.3309 0.0018 0.0084
2 no .1 100 3 10 0.3164 0.0014 0.0077 0.3139 0.0014 0.0077
3 no .1 100 2 10 0.3178 0.0014 0.0077 0.2905 0.0014 0.0072
4 no .1 50 3 10 0.5045 0.0014 0.0114 0.3201 0.0014 0.0077
Results
![Page 104: Language and Information](https://reader035.vdocuments.net/reader035/viewer/2022062816/568155b7550346895dc38df3/html5/thumbnails/104.jpg)
(C) 2000, The University of Michigan
104
Novelty detection<DOCID> reute960109.0101 </DOCID><HEADER> reute 01-09 0057 </HEADER>... German court convicts Vogel of extortion
BERLIN, Jan 9 (Reuter) - A German court on Tuesday convicted Wolfgang Vogel, the East Berlin lawyer famous for organising Cold War spy swaps, on charges that he extorted money from would-be East German emigrants. The Berlin court gave him a two-year suspended jail sentence and a fine -- less than the 3 3/8 years prosecutors had sought.
<DOCID> reute960109.0201 </DOCID><HEADER> reute 01-09 0582 </HEADER>... East German spy-swap lawyer convicted of extortion
BERLIN (Reuter) - The East Berlin lawyer who became famous for engineering Cold War spy swaps, Wolfgang Vogel, was convicted by a German court Tuesday of extorting money from East German emigrants eager to flee to the West. Vogel, a close confidant of former East German leader Erich Honecker and one of the Soviet bloc's rare millionaires, was found guilty of perjury, four counts of blackmail and five counts of falsifying documents. The Berlin court gave him the two-year suspended sentence and a $63,500 fine. Prosecutors had pressed for a jail sentence of 3 3/8 years and a $215,000 penalty...