language and information

104
(C) 2000, The Univ ersity of Michigan 1 Language and Information Handout #4 November 9, 2000

Upload: tawana

Post on 21-Mar-2016

39 views

Category:

Documents


3 download

DESCRIPTION

Language and Information. November 9, 2000. Handout #4. Course Information. Instructor: Dragomir R. Radev ([email protected]) Office: 305A, West Hall Phone: (734) 615-5225 Office hours: TTh 3-4 Course page: http://www.si.umich.edu/~radev/760 - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Language and Information

(C) 2000, The University of Michigan

1

Language and Information

Handout #4

November 9, 2000

Page 2: Language and Information

(C) 2000, The University of Michigan

2

Course Information

• Instructor: Dragomir R. Radev ([email protected])• Office: 305A, West Hall• Phone: (734) 615-5225• Office hours: TTh 3-4• Course page: http://www.si.umich.edu/~radev/760• Class meets on Thursdays, 5-8 PM in 311 West Hall

Page 3: Language and Information

(C) 2000, The University of Michigan

3

Readings

• Textbook:– Oakes Ch.3: 95-96, 110-120– Oakes Ch.4: 149-150, 158-166, 182-189– Oakes Ch.5: 199-212, 221-223, 236-247

• Additional readings– Knight “Statistical Machine Translation Workbook” (

http://www.clsp.jhu.edu/ws99/)– McKeown & Radev “Collocations”– Optional: M&S chapters 4, 5, 6, 13, 14

Page 4: Language and Information

(C) 2000, The University of Michigan

4

Statistical Machine Translationand Language Modeling

Page 5: Language and Information

(C) 2000, The University of Michigan

5

The Noisy Channel Model

• Source-channel model of communication• Parametric probabilistic models of language

and translation• Training such models

Page 6: Language and Information

(C) 2000, The University of Michigan

6

Statistics

• Given f, guess e

ef

e’E F F E

encoder decoder

e’ = argmax P(e|f) = argmax P(f|e) P(e)e e

translation model language model

Page 7: Language and Information

(C) 2000, The University of Michigan

7

Parametric probabilistic models

• Language model (LM)

• Deleted interpolation

• Translation model (TM)

P(e) = P(e1, e2, …, eL) = P(e1) P(e2|e1) … P(eL|e1 … eL-1)

P(eL|e1 … eK-1) P(eL|eL-2, eL-1)

Alignment: P(f,a|e)

Page 8: Language and Information

(C) 2000, The University of Michigan

8

IBM’s EM trained models

1. Word translation2. Local alignment3. Fertilities4. Class-based alignment5. Non-deficient algorithm (avoid overlaps,

overflow)

Page 9: Language and Information

(C) 2000, The University of Michigan

9

Lexical Semanticsand WordNet

Page 10: Language and Information

(C) 2000, The University of Michigan

10

• Lexemes, lexicon, sense(s)• Examples:

– Red, n: the color of blood or a ruby– Blood, n: the red liquid that circulates in the heart, arteries

and veins of animals– Right, adj: located nearer the right hand esp. being on the

right when facing the same direction as the observer• Do dictionaries gives us definitions??

Meanings of words

Page 11: Language and Information

(C) 2000, The University of Michigan

11

Relations among words• Homonymy:

– Instead, a bank can hold the investments in a custodial account in the client’s name.

– But as agriculture burgeons on the east bank, the river will shrink even more.

• Other examples: be/bee?, wood/would?• Homophones• Homographs• Applications: spelling correction, speech recognition, text-

to-speech• Example: Un ver vert va vers un verre vert.

Page 12: Language and Information

(C) 2000, The University of Michigan

12

Polysemy• They rarely serve red meat, preferring to prepare seafood,

poultry, or game birds.• He served as U.S. ambassador to Norway in 1976 and 1977.• He might have served his time, come out and led an

upstanding life.• Homonymy: distinct and unrelated meanings, possibly with

different etymology (multiple lexemes).• Polysemy: single lexeme with two meanings.• Example: an “idea bank”

Page 13: Language and Information

(C) 2000, The University of Michigan

13

Synonymy

• Principle of substitutability• How big is this plane?• Would I be flying on a large or small plane?• Miss Nelson, for instance, became a kind of big

sister to Mrs. Van Tassel’s son, Benjamin.• ?? Miss Nelson, for instance, became a kind of

large sister to Mrs. Van Tassel’s son, Benjamin.• What is the cheapest first class fare?• ?? What is the cheapest first class cost?

Page 14: Language and Information

(C) 2000, The University of Michigan

14

Semantic Networks

• Used to represent relationships between words

• Example: WordNet - created by George Miller’s team at Princeton (http://www.cogsci.princeton.edu/~wn)

• Based on synsets (synonyms, interchangeable words) and lexical matrices

Page 15: Language and Information

(C) 2000, The University of Michigan

15

Lexical matrix

Word FormsWord

Meanings F1 F2 F3 … Fn

M1 E1,1 E1,2

M2 E1,2

……

Mm Em,n

Page 16: Language and Information

(C) 2000, The University of Michigan

16

Synsets

• Disambiguation– {board, plank}– {board, committee}

• Synonyms– substitution– weak substitution– synonyms must be of the same part of speech

Page 17: Language and Information

(C) 2000, The University of Michigan

17

$ ./wn board -hypen

Synonyms/Hypernyms (Ordered by Frequency) of noun board9 senses of board

Sense 1board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping

Sense 2board => sheet, flat solid => artifact, artefact => object, physical object => entity, something

Sense 3board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something

Page 18: Language and Information

(C) 2000, The University of Michigan

18

Sense 4display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Sense 5board, gameboard => surface => artifact, artefact => object, physical object => entity, something

Sense 6board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something

Page 19: Language and Information

(C) 2000, The University of Michigan

19

Sense 7control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, somethingSense 8circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, somethingSense 9dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something

Page 20: Language and Information

(C) 2000, The University of Michigan

20

Antonymy

• “x” vs. “not-x”• “rich” vs. “poor”?• {rise, ascend} vs. {fall, descend}

Page 21: Language and Information

(C) 2000, The University of Michigan

21

Other relations

• Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”.

• Hyponymy: {tree} is a hyponym of {plant}.• Hierarchical structure based on hyponymy

(and hypernymy).

Page 22: Language and Information

(C) 2000, The University of Michigan

22

Other features of WordNet

• Index of familiarity• Polysemy

Page 23: Language and Information

(C) 2000, The University of Michigan

23

board used as a noun is familiar (polysemy count = 9)

bird used as a noun is common (polysemy count = 5)

cat used as a noun is common (polysemy count = 7)

house used as a noun is familiar (polysemy count = 11)

information used as a noun is common (polysemy count = 5)

retrieval used as a noun is uncommon (polysemy count = 3)

serendipity used as a noun is very rare (polysemy count = 1)

Familiarity and polysemy

Page 24: Language and Information

(C) 2000, The University of Michigan

24

Compound nouns

advisory boardappeals boardbackboardbackgammon boardbaseboardbasketball backboardbig boardbillboardbinder's boardbinder board

blackboardboard gameboard measureboard meetingboard memberboard of appealsboard of directorsboard of educationboard of regentsboard of trustees

Page 25: Language and Information

(C) 2000, The University of Michigan

25

Overview of senses1. board -- (a committee having supervisory powers; "the board has seven members")2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows")3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes)4. display panel, display board, board -- (a board on which information can be displayed to public view)5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces")6. board, table -- (food or meals in general; "she sets a fine table"; "room and board")7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree")8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")

Page 26: Language and Information

(C) 2000, The University of Michigan

26

Top-level concepts{act, action, activity}{animal, fauna}{artifact}{attribute, property}{body, corpus}{cognition, knowledge}{communication}{event, happening}{feeling, emotion}{food}{group, collection}{location, place}{motive}

{natural object}{natural phenomenon}{person, human being}{plant, flora}{possession}{process}{quantity, amount}{relation}{shape}{state, condition}{substance}{time}

Page 27: Language and Information

(C) 2000, The University of Michigan

27

Information Extraction

Page 28: Language and Information

(C) 2000, The University of Michigan

28

Types of Information Extraction

• Template filling• Language reuse• Biographical information• Question answering

Page 29: Language and Information

(C) 2000, The University of Michigan

29

MUC-4 Example

INCIDENT: DATE 30 OCT 89 INCIDENT: LOCATION EL SALVADOR INCIDENT: TYPE ATTACK INCIDENT: STAGE OF EXECUTION ACCOMPLISHED INCIDENT: INSTRUMENT ID INCIDENT: INSTRUMENT TYPEPERP: INCIDENT CATEGORY TERRORIST ACT PERP: INDIVIDUAL ID "TERRORIST" PERP: ORGANIZATION ID "THE FMLN" PERP: ORG. CONFIDENCE REPORTED: "THE FMLN" PHYS TGT: ID PHYS TGT: TYPEPHYS TGT: NUMBERPHYS TGT: FOREIGN NATIONPHYS TGT: EFFECT OF INCIDENTPHYS TGT: TOTAL NUMBERHUM TGT: NAMEHUM TGT: DESCRIPTION "1 CIVILIAN"HUM TGT: TYPE CIVILIAN: "1 CIVILIAN"HUM TGT: NUMBER 1: "1 CIVILIAN"HUM TGT: FOREIGN NATIONHUM TGT: EFFECT OF INCIDENT DEATH: "1 CIVILIAN"HUM TGT: TOTAL NUMBER

On October 30, 1989, one civilian was killed in a reported FMLN attack in El Salvador.

Page 30: Language and Information

(C) 2000, The University of Michigan

30

Yugoslav President Slobodan Milosevic

[description]

NP

Phrase to be reused

Language reuse

[entity]

Page 31: Language and Information

(C) 2000, The University of Michigan

31

NPExample

Andrija Hebrang , The Croatian Defense Minister

[description][entity]

NP NPPunc

Page 32: Language and Information

(C) 2000, The University of Michigan

32

Issues involved

• Text generation depends on lexical resources• Lexical choice• Corpus processing vs. manual compilation• Deliberate decisions by writers• Difficult to encode by hand• Dynamically updated (Scott O’Grady)• No full semantic representation

Page 33: Language and Information

(C) 2000, The University of Michigan

33

Named entitiesRichard Butler met Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work.

Yitzhak Mordechai will meet Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking.

Sinn Fein deferred a vote on Northern Ireland's peace deal Sunday.

Hundreds of troops patrolled Dili on Friday during the anniversary of Indonesia's 1976 annexation of the territory.

Page 34: Language and Information

(C) 2000, The University of Michigan

34

Entities + DescriptionsChief U.N. arms inspector Richard Butler met Iraq’s Deputy Prime Minister Tareq Aziz Monday after rejecting Iraqi attempts to set deadlines for finishing his work.

Israel's Defense Minister Yitzhak Mordechai will meet senior Palestinian negotiator Mahmoud Abbas at 7 p.m. (1600 GMT) in Tel Aviv after a 16-month-long impasse in peacemaking.

Sinn Fein, the political wing of the Irish Republican Army, deferred a vote on Northern Ireland's peace deal Sunday.

Hundreds of troops patrolled Dili, the Timorese capital, on Friday during the anniversary of Indonesia's 1976 annexation of the territory.

Page 35: Language and Information

(C) 2000, The University of Michigan

35

Building a database of descriptions

• Size of database: 59,333 entities and 193,228 descriptions as of 08/01/98

• Text processed: 494 MB (ClariNet, Reuters, UPI)

• Length: 1-15 lexical items• Accuracy: (precision 94%, recall 55%)

Page 36: Language and Information

(C) 2000, The University of Michigan

36

Ung Huot

A senior memberCambodia’sCambodian foreign ministerCo-premierFirst prime ministerForeign ministerHis excellencyMr.New co-premierNew first prime ministerNewly-appointed first prime ministerPremier

Multiple descriptions per entity

Profile for Ung Huot

Page 37: Language and Information

(C) 2000, The University of Michigan

37

Language reuse and regeneration

+ =CONCEPTS CONSTRAINTS CONSTRUCTS

Corpus analysis: determining constraints

Text generation: applying constraints

Page 38: Language and Information

(C) 2000, The University of Michigan

38

• Understanding: full parsing is expensive • Generation: expensive to use full parses• Bypassing certain stages (e.g., syntax)• Not(!) template-based: still required

extraction, analysis, context identification, modification, and generation

• Factual sentences, sentence fragments• Reusability of a phrase

Language reuse and regeneration

Page 39: Language and Information

(C) 2000, The University of Michigan

39

Context-dependent solution

Redefining the relation:DescriptionOf (E,C) =

{Di,c, Di,c is a description of E in context C}

If named entity E appears in text and the context is C:Insert DescriptionOf (E,C) in text.

Page 40: Language and Information

(C) 2000, The University of Michigan

40

Multiple descriptions per entity

Bill Clinton

U.S. PresidentPresidentAn Arkansas nativeDemocratic presidential candidate

Profile for Bill Clinton

Page 41: Language and Information

(C) 2000, The University of Michigan

41

Choosing the right description

Bill Clinton CONTEXT

U.S. President …………………………..foreign relationsPresident ………………………………… national affairsAn Arkansas native ……………....false bomb alert in ARDemocratic presidential candidate …………….. elections

Pragmatic and semantic constraints on lexical choice.

Page 42: Language and Information

(C) 2000, The University of Michigan

42

Semantic information from WordNet

• All words contribute to the semantic representation

• First sense is used only

• What is a synset?

Page 43: Language and Information

(C) 2000, The University of Michigan

43

WordNet synset hierarchy

{07063762} director, manager, managing director

{07063507} administrator, decision maker

{07311393} head, chief, top dog

{06950891} leader

{00004123} person, individual, someone, somebody, human

{00002086} life form, organism, being, living thing

{00001740} entity, something

Page 44: Language and Information

(C) 2000, The University of Michigan

44

Lexico-semantic matrixWord synsets Parent synsets

Description{07147929}premier

{07009772}Kampuchean …

{07412658}minister

{07087841}associate

A senior member … XCambodia's X …Cambodian foreign minister X … XCo-premier X … XFirst prime minister X … XForeign minister … XHis excellency …Mr. …New co-premier X … XNew first prime minister X … XNewly-appointed first prime minister X … XPremier X … XPrime minister X … X

Profile for Ung Huot

Page 45: Language and Information

(C) 2000, The University of Michigan

45

Choosing the right description• Topic approximation by context: words that

appear near the entity in the text (bag) • Name of the entity (set)• Length of article (continuous)• Profile: set of all descriptions for that entity (bag)

- parent synset offsets for all words wi.• Semantic information: WordNet synset offsets

(bag)

Page 46: Language and Information

(C) 2000, The University of Michigan

46

Choosing the right description

(Context, Entity, Description, Length, Profile, Parent) Classes

Ripper feature vector [Cohen 1996]

Page 47: Language and Information

(C) 2000, The University of Michigan

47

Example (training)T# Context Entity Description Len Profile Parent Classes1 Election,

promised,said, carry,party …

KimDae-Jung

Veteranoppositionleader

949 Candidate,chief, policymaker,Korean ...

person,leader,Asian,importantperson ...

{07136302}{07486519}{07311393}{06950891}{07486079}

2 Introduced,responsible,running,should,bringing …

KimDae-Jung

SouthKorea'soppositioncandidate

629 Candidate,chief, policymaker,Korean ...

person,leader,Asian,importantperson ...

{07136302}{07486519}{07311393}{06950891}{07486079}

3 Attend,during,party, time,traditionally …

KimDae-Jung

A front-runner

535 Candidate,chief, policymaker,Korean ...

person,leader,Asian,importantperson ...

{07136302}{07486519}{07311393}{06950891}{07486079}

4 Discuss,making,party,statement,said …

KimDae-Jung

A front-runner

1114 Candidate,chief, policymaker,Korean ...

person,leader,Asian,importantperson ...

{07136302}{07486519}{07311393}{06950891}{07486079}

5 New, party,politics, in,it …

KimDae-Jung

SouthKorea'spresident-elect

449 Candidate,chief, policymaker,Korean ...

person,leader,Asian,importantperson ...

{07136302}{07486519}{07311393}{06950891}{07486079}

Page 48: Language and Information

(C) 2000, The University of Michigan

48

Sample rules

Total number of rules: 4085 for 100,000 inputs

{07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 361 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ presidential LENGTH <=

412 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~

during .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ case .{07136302} IF PROFILE ~ P{07136302} LENGTH <= 603 LENGTH >= 390

LENGTH <= 412 .{07136302} IF PROFILE ~ P{07136302} CONTEXT ~ nominee CONTEXT ~

and .

Page 49: Language and Information

(C) 2000, The University of Michigan

49

Evaluation

• 35,206 tuples; 11,504 distinct entities; 3.06 DDPE

• Training: 90% of corpus (10,353 entities)• Test: 10% of corpus (1,151 entities)

Page 50: Language and Information

(C) 2000, The University of Michigan

50

Evaluation

• Rule format (each matching rule adds constraints):

X [A] (evidence of A)

Y [B] (evidence of B)

X Y [A] [B] (evidence of A and B)

• Classes are in 2W (powerset of WN nodes)• P&R on the constraints selected by system

Page 51: Language and Information

(C) 2000, The University of Michigan

51

Definition of precision and recall

Model System P R

50.0 %[A] [B] [C]

[A] [B] [C] [A] [B] [D]

[B] [D] 33.3 %

66.7 % 66.7 %

Page 52: Language and Information

(C) 2000, The University of Michigan

52

Precision and recallWord nodes only Word and parent nodes

Trainingset

Precision Recall Precision Recall500 64.29% 2.86% 78.57% 2.86%

1000 71.43% 2.86% 85.71% 2.86%2000 42.86% 40.71% 67.86% 62.14%5000 59.33% 48.40% 64.67% 53.73%

10000 69.72% 45.04% 74.44% 59.32%15000 76.24% 44.02% 73.39% 53.17%20000 76.25% 49.91% 79.08% 58.70%25000 83.37% 52.26% 82.39% 57.49%30000 80.14% 50.55% 82.77% 57.66%50000 83.13% 58.53% 88.87% 63.39%

100000 85.42% 62.81% 89.70% 64.64%150000 87.07% 63.17%200000 85.73% 62.86%250000 87.15% 63.85%

Page 53: Language and Information

(C) 2000, The University of Michigan

53

Question Answering

Page 54: Language and Information

(C) 2000, The University of Michigan

54

Q: When did Nelson Mandela become president of South Africa?

A: 10 May 1994

Q: How tall is the Matterhorn?

A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches

Q: How tall is the replica of the Matterhorn at Disneyland?

A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years

Q: If Iraq attacks a neighboring country, what should the US do?

A: ??

Question answering

Page 55: Language and Information

(C) 2000, The University of Michigan

55

Q: Why did David Koresh ask the FBI for a word processor?Q: Name the designer of the shoe that spawned millions of plastic imitations, known as "jellies".Q: What is the brightest star visible from Earth?Q: What are the Valdez Principles?Q: Name a film that has won the Golden Bear in the Berlin Film Festival?Q: Name a country that is developing a magnetic levitation railway system?Q: Name the first private citizen to fly in space.Q: What did Shostakovich write for Rostropovich?Q: What is the term for the sum of all genetic material in a given organism?Q: What is considered the costliest disaster the insurance industry has ever faced?Q: What is Head Start?Q: What was Agent Orange used for during the Vietnam War?Q: What did John Hinckley do to impress Jodie Foster?Q: What was the first Gilbert and Sullivan opera?Q: What did Richard Feynman say upon hearing he would receive the Nobel Prize in Physics?Q: How did Socrates die?Q: Why are electric cars less efficient in the north-east than in California?

Page 56: Language and Information

(C) 2000, The University of Michigan

56

The TREC evaluation

• Document retrieval• Eight years• Information retrieval?• Corpus: texts and questions

Page 57: Language and Information

(C) 2000, The University of Michigan

57

documents

query

TextractResporator Indexer

Index

QueryProcessing

Search

Hit ListAnSel/Werlect

RankedHitList

GuruQA

Answer selection

Prager et al. 2000 (SIGIR)Radev et al. 2000 (ANLP/NAACL)

Page 58: Language and Information

(C) 2000, The University of Michigan

58

QA-Token Question type ExamplePLACE$ Where In the Rocky MountainsCOUNTRY$ Where/What country United KingdomSTATE$ Where/What state MassachusettsPERSON$ Who Albert EinsteinROLE$ Who DoctorNAME$ Who/What/Which The Shakespeare FestivalORG$ Who/What The US Post OfficeDURATION$ How long For 5 centuriesAGE$ How old 30 years oldYEAR$ When/What year 1999TIME$ When In the afternoonDATE$ When/What date July 4th, 1776VOLUME$ How big 3 gallonsAREA$ How big 4 square inchesLENGTH$ How big/long/high 3 milesWEIGHT$ How big/heavy 25 tonsNUMBER$ How many 1,234.5METHOD$ How By rubbingRATE$ How much 50 per centMONEY$ How much 4 million dollars

Page 59: Language and Information

(C) 2000, The University of Michigan

59

<p><NUMBER>1</NUMBER></p><p><QUERY>Who is the author of the book, "The Iron Lady: ABiography of Margaret Thatcher"?</QUERY></p><p><PROCESSED_QUERY>@excwin(*dynamic* @weight(200*Iron_Lady) @weight(200 Biography_of_Margaret_Thatcher)@weight(200 Margaret) @weight(100 author) @weight(100book) @weight(100 iron) @weight(100 lady) @weight(100 :)@weight(100 biography) @weight(100 thatcher) @weight(400@syn(PERSON$ NAME$)) )</PROCESSED_QUERY></p><p><DOC>LA090290-0118</DOC></p><p><SCORE>1020.8114</SCORE></p><TEXT><p>THE IRON LADY; A <span class="NAME">Biography ofMargaret Thatcher</span> by <span class="PERSON">HugoYoung</span> (<span class="ORG">Farrar , Straus &Giroux</span>) The central riddle revealed here is why, asa woman <span class="PLACEDEF">in a man</span>'s world,<span class="PERSON">Margaret Thatcher</span> evinces suchan exclusionary attitude toward women.</p></TEXT>

Page 60: Language and Information

(C) 2000, The University of Michigan

60

SYN-set N Score Score/NPERSON NAME 30 16.5 55.0%PLACE COUNTRY STATE NAME PLACEDEF 21 7.08 33.7%NAME 18 3.67 20.4%DATE YEAR 18 5.31 29.5%PERSON ORG NAME ROLE 19 4.62 24.3%undefined 19 11.45 60.3%NUMBER 18 8.00 44.4%PLACE NAME PLACEDEF 14 10.00 71.4%PERSON ORG PLACE NAME PLACEDEF 10 3.03 30.3%MONEY RATE 6 1.50 25%ORG NAME 4 1.25 31.2%SIZE1 4 2.50 62.5%SIZE1 DURATION 3 0.83 27.7%STATE 3 2.00 66.7%COUNTRY 3 1.33 44.3%YEAR 2 1.00 50.0%RATE 2 1.50 75.0%TIME DURATION 1 0.00 0.0%SIZE1 SIZE2 1 0.00 0.0%DURATION TIME 1 0.33 33.3%DATE 1 0 0.00%

Page 61: Language and Information

(C) 2000, The University of Michigan

61

Span Type Number Rspanno Count Notinq Type Avgdst Sscore TOTALOllie Matson PERSON 3 3 6 2 1 12 0.02507 -7.53Lou Vasquez PERSON 1 1 6 2 1 16 0.02507 -9.93Tim O'Donohue PERSON 17 1 4 2 1 8 0.02257 -12.57Athletic Director Dave Cowen PERSON 23 6 4 4 1 11 0.02257 -15.87Johnny Ceballos PERSON 22 5 4 1 1 9 0.02257 -19.07Civic Center Director Martin Durham PERSON 13 1 2 5 1 16 0.02505 -19.36Johnny Hodges PERSON 25 2 4 1 1 15 0.02256 -25.22Derric Evans PERSON 33 4 4 2 1 14 0.02256 -25.37NEWSWIRE Johnny Majors PERSON 30 1 4 2 1 17 0.02256 -25.47Woodbridge High School ORG 18 2 4 1 2 6 0.02257 -28.37Evan PERSON 37 6 4 1 1 14 0.02256 -29.57Gary Edwards PERSON 38 7 4 2 1 17 0.02256 -30.87O.J. Simpson NAME 2 2 6 2 3 12 0.02507 -37.40South Lake Tahoe NAME 7 5 6 3 3 14 0.02507 -40.06Washington High NAME 10 6 6 1 3 18 0.02507 -49.80Morgan NAME 26 3 4 1 3 12 0.02256 -52.52Tennesseefootball NAME 31 2 4 1 3 15 0.02256 -56.27Ellington NAME 24 1 4 1 3 20 0.02256 -59.42assistant ROLE 21 4 4 1 4 8 0.02257 -62.77the Volunteers ROLE 34 5 4 2 4 14 0.02256 -71.17Johnny Mathis PERSON 4 4 6 -100 1 11 0.02507 -211.33Mathis NAME 14 2 2 -100 3 10 0.02505 -254.16coach ROLE 19 3 4 -100 4 4 0.02257 -259.67

Page 62: Language and Information

(C) 2000, The University of Michigan

62

Features (1)• Number: position of the span among all spans returned. Example:

“Lou Vasquez” was the first span returned by GuruQA on the sample question.

• Rspanno: position of the span among all spans returned within the current passage.

• Count: number of spans of any span class retrieved within the current passage.

• Notinq: the number of words in the span that do not appear in the query. Example: Notinq (“Woodbridge high school”) = 1, because both “high” and “school” appear in the query while “Woodbridge” does not. It is set to –100 when the actual value is 0.

Page 63: Language and Information

(C) 2000, The University of Michigan

63

• Type: the position of the span type in the list of potential span types. Example: Type (“Lou Vasquez”) = 1, because the span type of “Lou Vasquez”, namely “PERSON” appears first in the SYN-set, “PERSON ORG NAME ROLE”.

• Avgdst: the average distance in words between the beginning of the span and the words in the query that also appear in the passage. Example: given the passage “Tim O'Donohue, Woodbridge High School's varsity baseball coach, resigned Monday and will be replaced by assistant Johnny Ceballos, Athletic Director Dave Cowen said.” and the span “Tim O’Donohue”, the value of avgdst is equal to 8.

• Sscore: passage relevance as computed by GuruQA.

Features (2)

Page 64: Language and Information

(C) 2000, The University of Michigan

64

Combining evidence

• TOTAL (span) = – 0.3 * number – 0.5 * rspanno + 3.0 * count + 2.0 * notinq – 15.0 * types – 1.0 * avgdst + 1.5 * sscore

Page 65: Language and Information

(C) 2000, The University of Michigan

65

DocumentID

Score Extract

LA053189-0069

892.5 of O.J. Simpson , Ollie Matson and Johnny Mathis

LA053189-0069

890.1 Lou Vasquez , track coach of O.J. Simpson , Ollie

LA060889-0181

887.4 Tim O'Donohue , Woodbridge High School 's varsity

LA060889-0181

884.1 nny Ceballos , Athletic Director Dave Cowen said.

LA060889-0181

880.9 aced by assistant Johnny Ceballos , Athletic Direc

Extracted text

Page 66: Language and Information

(C) 2000, The University of Michigan

66

First Second Third Fourth Fifth TOTAL# cases 49 15 11 9 4 88Points 49.00 7.50 3.67 2.25 0.80 63.22

First Second Third Fourth Fifth TOTAL# cases 71 16 11 6 5 109Points 71.00 8.00 3.67 1.50 1.00 85.17

50 bytes

250 bytes

Results

Page 67: Language and Information

(C) 2000, The University of Michigan

67

Style and Authorship Analysis

Page 68: Language and Information

(C) 2000, The University of Michigan

68

Style and authorship analysis

• Use of nouns, verbs…• Use of rare words• Positional and contextual distribution• Use of alternatives: “and/also”,

“since/because”, “scarcely/hardly”

Page 69: Language and Information

(C) 2000, The University of Michigan

69

Sample problem

• 15-th century Latin work “De Imitatione Christi”

• Was it written by Thomas a Kempis or Jean Charlier de Gerson?

• Answer: by Kempis• Why?

Page 70: Language and Information

(C) 2000, The University of Michigan

70

Yule’s K characteristic

• Vocabulary richness: measure of the probability that any randomly selected pair of words will be identical

K = 10,000 x (M2 - M1)/(M1 x M1)

• M1, M2 - distribution moments• M1 - total number of usages (words including repetitions)• M2 - sum of all vocabulary words in each frequency group, from 1 to the

maximum word frequency, multiplied by the square of the frequency

Page 71: Language and Information

(C) 2000, The University of Michigan

71

Example

• Text consisting of 12 words, where two of the words occur once, two occur twice, and two occur three times.

• M0 = 6• M1 = 12• M2 = (2 x 12) + (2 x 22) + (2 x 32) = 28• K increases as the diversity of the vocabulary

decreases.

Page 72: Language and Information

(C) 2000, The University of Michigan

72

Example (cont’d)

• Three criteria used:– total vocabulary size– frequency distribution of the different words– Yule’s K– the mean frequency of the word sin the sample– the number of nouns unique to a particular

sample• Pearson’s coefficient used

Page 73: Language and Information

(C) 2000, The University of Michigan

73

Federalist papers

• Published in 1787-1788 to persuade the population of New York state to ratify the new American constitution

• Published under the pseudonym Publius, the three authors were James Madison, John Jay, and Alexander Hamilton.

• Before dying in a duel, Hamilton claimed some portion of the essays.

• It was agreed that Jay wrote 5 essays, Hamilton - 43, Madison - 14. Three others were jointly written by Hamilton and Madison, and 12 were disputed

Page 74: Language and Information

(C) 2000, The University of Michigan

74

Method

• Mosteller and Wallace (1963) used Bayesian statistics to determine which papers were written by whom.

• Authors had tried to imitate each other. So - sentence length and other easily imitable features are not useful.

• Madison and Hamilton were found to vary in their use of “by” (H) and “to” (M), “enough” (H) and “whilst” (M).

Page 75: Language and Information

(C) 2000, The University of Michigan

75

Cluster Analysis

Page 76: Language and Information

(C) 2000, The University of Michigan

76

Clustering

• Idea: find similar objects and group them together

• Examples: – all news stories on the same topic– all documents from the same genre or language

• Types of clustering: classification (tracking) and categorization (detection)

Page 77: Language and Information

(C) 2000, The University of Michigan

77

Non-hierarchical clustering

• Concept of a centroid• document/centroid similarity• other parameters:

– number of clusters– maximum and minimum size for each cluster– vigilance parameter– overlap between clusters

Page 78: Language and Information

(C) 2000, The University of Michigan

78

Hierarchical clustering

• Similarity matrix (expensive: the SIM matrix needs to be updated after every iteration)

• Average linkage method• dendrograms

Page 79: Language and Information

(C) 2000, The University of Michigan

79

Introduction

• Abundance of newswire on the Web• Multiple sources reporting on the same

event• Multiple modalities (speech, text)• Summarization and filtering

Page 80: Language and Information

(C) 2000, The University of Michigan

80

Introduction

• TDT participation topic detection and tracking– CIDR

• Multi-document summarization – statistical, domain-dependent– knowledge-based (SUMMONS)

Page 81: Language and Information

(C) 2000, The University of Michigan

81

Topics and events

• Topic = event (single act) or activity (ongoing action)

• Defined by content, time, and place of occurrence [Allan et al. 1998, Yang et al. 1998]

• Examples:– Marine fighter pilot’s plane cuts cable in Italian Alps

(February 3, 1998)– Eduard Shavardnadze assassination attempt (February 9,

1998)– Jonesboro shooting (March 24, 1998)

Page 82: Language and Information

(C) 2000, The University of Michigan

82

TDT overview

• Event detection: monitoring a continuous stream of news articles and identifying new salient events

• Event tracking: identifying stories that belong to predefined event topics

• [Story segmentation: identifying topic boundaries]

Page 83: Language and Information

(C) 2000, The University of Michigan

83

The TDT-2 corpus

• Corpus described in [Doddington et al. 1999, Cieri et al. 1999]

• One hundred topics, 54K stories, 6 sources• Two newswire sources (AP, NYT); 2 TV

stations (ABC, CNN-HN); 2 radio stations (PRI, VOA)

• 11 participants (4 industrial sites, 7 universities)

Page 84: Language and Information

(C) 2000, The University of Michigan

84

Detection conditions

• Default:– Newswire + Audio - automatic transcription– Deferral period of 10 source files– Given boundaries for ASR

Page 85: Language and Information

(C) 2000, The University of Michigan

85

Description of the system

• Single-pass clustering algorithm• Normalized, tf*idf-modified, cosine-based

similarity between document and centroid • detection only, standard evaluation

conditions, no deferral

Page 86: Language and Information

(C) 2000, The University of Michigan

86

Research problems

• focus on speedup• search space of five experimental

parameters• tradeoffs between parallelization and

accuracy

Page 87: Language and Information

(C) 2000, The University of Michigan

87

Vector-based representationTerm 1

Term 2

Term 3

Document

Centroid

Page 88: Language and Information

(C) 2000, The University of Michigan

88

Vector-based matching

• The cosine measure

sim (D,C) = (dk . ck . idf(k))

(dk)2 . (ck)2k

k

k

Page 89: Language and Information

(C) 2000, The University of Michigan

89

Description of the system

sim T

Page 90: Language and Information

(C) 2000, The University of Michigan

90

Description of the system

sim > T sim < T

Page 91: Language and Information

(C) 2000, The University of Michigan

91

Centroid sizeC 10007 (N=11)

(10000)crashes 1.00safety 0.55

transportation

0.55drivers 0.45board 0.36flight 0.27buckle 0.27

pittsburgh 0.18graduating 0.18automobile 0.18

C 00008 (N=113)(10000)space 1.98

shuttle 1.17station 0.75nasa 0.51

columbia 0.37mission 0.33

mir 0.30astronaut

s0.14

steering 0.11safely 0.07

C 10062 (N=161)microsoft 3.24

justice 0.93departmen

t0.88

windows 0.98corp 0.61

software 0.57ellison 0.07hatch 0.06

netscape 0.04metcalfe 0.02

Page 92: Language and Information

(C) 2000, The University of Michigan

92

Centroid sizeC 00022 (N=44)

(10000)diana 1.93princess 1.52

C 00025 (N=19)(10000)albanians 3.00

C 00026 (N=10)(10000)universe 1.50

expansion 1.00bang 0.90

C 00035 (N=22)(10000)airlines 1.45

finnair 0.45

C 00031 (N=34)(10000)el 1.85

nino 1.56

Page 93: Language and Information

(C) 2000, The University of Michigan

93

Centroid sizeC 00022 (N=44)

(10000)diana 1.93princess 1.52

C 00025 (N=19)(10000)albanians 3.00

C 00026 (N=10)(10000)universe 1.50

expansion 1.00bang 0.90

C 10007 (N=11)(10000)crashes 1.00

safety 0.55transportat

ion0.55

drivers 0.45board 0.36flight 0.27buckle 0.27

pittsburgh 0.18graduating 0.18automobile 0.18

C 00035 (N=22)(10000)airlines 1.45

finnair 0.45

C 00031 (N=34)(10000)el 1.85

nino 1.56

C 00008 (N=113)(10000)space 1.98

shuttle 1.17station 0.75nasa 0.51

columbia 0.37mission 0.33

mir 0.30astronaut

s0.14

steering 0.11safely 0.07

C 10062 (N=161)microsoft 3.24

justice 0.93departmen

t0.88

windows 0.98corp 0.61

software 0.57ellison 0.07hatch 0.06

netscape 0.04metcalfe 0.02

Page 94: Language and Information

(C) 2000, The University of Michigan

94

Parameter space• Similarity

– DECAY: Number of words at beginning of document that will be considered in computing vector similarities (50 - 1000)

– IDF: Minimum value for idf so that a word is considered (1 - 10)– SIM: Similarity threshold (0.01 - 0.25)

• Centroids– KEEPI: Keep all words whose tf*idf scores are above a certain

threshold (1-10)– KEEP: Keep at least that many words in centroid (1-50)

Page 95: Language and Information

(C) 2000, The University of Michigan

95

Parameter selection (dev-test)

Page 96: Language and Information

(C) 2000, The University of Michigan

96

Cluster stability10000 docs

(10000)22443 docs

suharto 2.48 suharto 2.61jakarta 0.58 jakarta 0.58habibie 0.47 habibie 0.53students 0.45 students 0.43student 0.22 student 0.21

protesters 0.20 protesters 0.19asean 0.11 asean 0.10

campuses 0.05 campuses 0.04geertz 0.04 geertz 0.04medan 0.04 medan 0.04

10000 docs 22443 docsmicrosoft 3.31 microsoft 3.24

justice 1.06 justice 0.93department 1.01 department 0.88windows 0.90 windows 0.98

corp 0.60 corp 0.61software 0.51 software 0.57ellison 0.09 ellison 0.07hatch 0.06 hatch 0.06

netscape 0.05 netscape 0.04metcalfe 0.03 metcalfe 0.03

Page 97: Language and Information

(C) 2000, The University of Michigan

97

Parallelization

Page 98: Language and Information

(C) 2000, The University of Michigan

98

Parallelization

C(P)

Page 99: Language and Information

(C) 2000, The University of Michigan

99

Parallelization

Page 100: Language and Information

(C) 2000, The University of Michigan

100

Parallelization

Page 101: Language and Information

(C) 2000, The University of Michigan

101

Evaluation principles

CDet(R,H) = Cmiss.Pmiss(R,H).Ptopic + CFalseAlarm.PFalseAlarm (R,H).(1-Ptopic)

CMiss = 1CFalseAlarm = 1PMiss(R,H) = NMiss(R,H)/|R|PFalseAlarm(R,H) = NFalseAlarm(R,H)/|S-R|Ptopic = 0.02 (a priori probability)R - set of stories in a reference target topicH - set of stories in a system-defined topicS - set of stories to be scored in eval corpus

Task: to determine H(R) = argmin{CDet(R,H)}H

Page 102: Language and Information

(C) 2000, The University of Michigan

102

Official results

Page 103: Language and Information

(C) 2000, The University of Michigan

103

Story Weighted Topic Weighted# Parallel Sim Decay Idf Keep P(miss) P(fa) Cdet P(miss) P(fa) Cdet

1 yes .1 100 3 10 0.3861 0.0018 0.0095 0.3309 0.0018 0.0084

2 no .1 100 3 10 0.3164 0.0014 0.0077 0.3139 0.0014 0.0077

3 no .1 100 2 10 0.3178 0.0014 0.0077 0.2905 0.0014 0.0072

4 no .1 50 3 10 0.5045 0.0014 0.0114 0.3201 0.0014 0.0077

Results

Page 104: Language and Information

(C) 2000, The University of Michigan

104

Novelty detection<DOCID> reute960109.0101 </DOCID><HEADER> reute 01-09 0057 </HEADER>... German court convicts Vogel of extortion

BERLIN, Jan 9 (Reuter) - A German court on Tuesday convicted Wolfgang Vogel, the East Berlin lawyer famous for organising Cold War spy swaps, on charges that he extorted money from would-be East German emigrants. The Berlin court gave him a two-year suspended jail sentence and a fine -- less than the 3 3/8 years prosecutors had sought.

<DOCID> reute960109.0201 </DOCID><HEADER> reute 01-09 0582 </HEADER>... East German spy-swap lawyer convicted of extortion

BERLIN (Reuter) - The East Berlin lawyer who became famous for engineering Cold War spy swaps, Wolfgang Vogel, was convicted by a German court Tuesday of extorting money from East German emigrants eager to flee to the West. Vogel, a close confidant of former East German leader Erich Honecker and one of the Soviet bloc's rare millionaires, was found guilty of perjury, four counts of blackmail and five counts of falsifying documents. The Berlin court gave him the two-year suspended sentence and a $63,500 fine. Prosecutors had pressed for a jail sentence of 3 3/8 years and a $215,000 penalty...