information extraction for multi-participant, task- oriented, synchronous, computer-mediated...

30
Information extraction for multi- Information extraction for multi- participant, task-oriented, participant, task-oriented, synchronous, computer-mediated synchronous, computer-mediated communication: a corpus study of chat communication: a corpus study of chat data data Cassandre Creswell Nicholas Schwartzmyer Rohini Srihari Janya, Inc. www.janyainc.com 8 January 2007

Upload: connor-hewitt

Post on 27-Mar-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Information extraction for multi-participant, task-Information extraction for multi-participant, task-oriented, synchronous, computer-mediated oriented, synchronous, computer-mediated

communication: a corpus study of chat datacommunication: a corpus study of chat dataCassandre Creswell

Nicholas SchwartzmyerRohini Srihari

Janya, Inc.www.janyainc.com

8 January 2007

Page 2: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Significance of the Problem

• Synchronous Computer-mediated communication (SCMC) or chat is an increasingly important means of communication in many settings including intelligence and military domains

• Information extraction (IE) as applied to chat could aid activities reliant upon real-time decision-making (e.g. Entity tracking/targeting, monitoring teenager chat rooms etc.)

• Most IE applications, including our SemantexTM system, have been developed for optimal performance on well-written text

• This necessitates research into the unique characteristics of chat and how they affect IE performance

– Perform a corpus study, i.e. gaps analysis to prioritize tasks for chat IE system

– Add to the nascent study of chat as a discourse type

• Focus on task-oriented chat discourse

– involves participants exchanging information on a highly-complex collaborative task in unconstrained setting

– Task-oriented dialogs have played a significant role in understanding discourse-level coordination of meaning in conversation in the fields of psycholinguistics and computational linguistics

Page 3: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Task-oriented chat: The GeoTools corpus

• Need to develop a task-oriented corpus of chat data for research

• The GeoTools corpus

– 56 IRC logs for the GeoTools project (http://geotools.codehaus.org/IRC+Logs)

– Approximately 180K and 18,000 participant turns

• (cf. The COCONUT corpus < 14K)

– Interactions have form of business meeting

• Agenda of weekly problems/issues in the project, each discussed in some degree of detail

• Appropriate model of task-oriented discourse

Page 4: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Characteristics of chat data that distinguish it from narrative text

• It is dynamic. This results in:

Page 5: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Characteristics of chat data that distinguish it from narrative text

• It is dynamic. This results in:

– Epistemological uncertainty. Propositions and entities may be challenged and/or revised

– Structural Errors. Grammatical, spelling, and orthographic mistakes will be unedited in the text

rschulz also, did you get the CRSServices proj issue worked out? rschulz (Jody) jgarnett which one? I think we did - our wkt was not complete

on the shapefile.prg

jgarnett bleck - I mean shapefile.prj rschulz you had a test using crsservice to re-project

geometries. It gave wierd results for the bounding box rschulz I thought crsservice might be modifying its input geometries

jgarnett ah no I never figured that out

Page 6: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Characteristics of chat data that distinguish it from narrative text

• It is dynamic. This results in:

– Epistemological uncertainty. Propositions and entities may be challenged and/or revised

– Structural Errors. Grammatical, spelling, and orthographic mistakes will be unedited in the text

rschulz also, did you get the CRSServices proj issue worked out?

rschulz (Jody) « jgarnett which one? I think we did - our wkt was not complete

on the shapefile.prg « jgarnett bleck - I mean shapefile.prj « rschulz you had a test using crsservice to re-project

geometries. It gave wierd results for the bounding box « rschulz I thought crsservice might be modifying its input geometries « jgarnett ah « no I never figured that out «

Page 7: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Characteristics of chat data that distinguish it from narrative text

• It is interactive. Content has contribution from multiple participants, allowing:

Page 8: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Characteristics of chat data that distinguish it from narrative text

• It is interactive. Content has contribution from multiple participants, allowing:

– Misunderstanding and disagreement but also

– Extensive implicit, shared knowledge

rschulz also, did you get the CRSServices proj issue worked out ? rschulz (Jody)

jgarnett which one ? I think we did - our wkt was not complete on the shapefile.prg

jgarnett bleck - I mean shapefile.prj rschulz you had a test using crsservice to re-project

geometries. It gave wierd results for the bounding

box rschulz I thought crsservice might be modifying its input geometries jgarnett ah no I never figured that out

Page 9: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Characteristics of chat data that distinguish it from narrative text

• It is interactive. Content has contribution from multiple participants, allowing:

– Misunderstanding and disagreement but also

– Extensive implicit, shared knowledge

rschulz also, did you get the CRSServices proj issue worked out? rschulz (Jody)

jgarnett which one? I think we did - our wkt was not complete on the shapefile.prg

jgarnett bleck - I mean shapefile.prj rschulz you had a test using crsservice to re-project

geometries. It gave wierd results for the bounding

box rschulz I thought crsservice might be modifying its input geometries jgarnett ah no I never figured that out

Page 10: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Characteristics of chat data that distinguish it from narrative text

• It has a relatively unconstrained turn-taking structure

Page 11: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Characteristics of chat data that distinguish it from narrative text

• It has a relatively unconstrained turn-taking structure

– Distinguishes chat from conversation as well

– Allows for multi-threaded, interleaved conversations

– Complicates resolution tasks

Page 12: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Characteristics of chat data that distinguish it from narrative text

• It has a relatively unconstrained turn-taking structure

– Distinguishes chat from conversation as well

– Allows for multi-threaded, interleaved conversations

– Complicates resolution tasks

jgarnett While we wait dblasby I almost have the “empty” hsql datastore ready to commitcholmes Cool, I'd be happy to see itpolio ah, that was mejeichar and I added crs support to postgischolmes Successfully?dblasby thanks jodycholmes where did you get the changes?polio he made a crs factory thingy and I hooked it into

Page 13: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Characteristics of chat data that distinguish it from narrative text

• It has a relatively unconstrained turn-taking structure

– Distinguishes chat from conversation as well

– Allows for multi-threaded, interleaved conversations

– Complicates resolution tasks

jgarnett While we wait dblasby I almost have the “empty” hsql datastore ready to commitcholmes Cool, I'd be happy to see itpolio ah, that was mejeichar and I added crs support to postgischolmes Successfully?dblasby thanks jodycholmes where did you get the changes?polio he made a crs factory thingy and I hooked it into

Page 14: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Implications for Information Extraction: Surface Level

These characteristics complicate IE on two levels:

• Surface-level (Dynamic noise) affecting turn processing

– Spelling errors

ex: rschulz ...It gave wierd results. [sic]ex: jgarnett I would love a repalcement [sic]

– Non-standard punctuation/orthographic decisionsex: jmacgill sorry[,] gotta run[.]

– Ungrammatical/non-standard grammatical constructionsex: jmacgill sorry [I] gotta run

Page 15: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Implications for Information Extraction: Discourse Level

• Discourse-level (Interactive noise) affecting 'document' processing

– Topic segmentation

– Time and event normalization

– Anaphora resolution

jgarnett While we wait dblasby I almost have the “empty” hsql datastore ready to commit

cholmes Cool, I'd be happy to see it

polio ah, that was me

jeichar and I added crs support to postgis

cholmes Successfully?

dblasby thanks jody

cholmes where did you get the changes?

polio he made a crs factory thingy and I hooked it into postgis

Page 16: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Semantex™ 3.0

• Semantex– Tags key entities (people, places,

organizations,...), Relationships, Events

– Summarizes information in entity profiles

• Hybrid Model– Combines statistical and grammar-

based approaches in a cascade of over 60 modules

– FST grammars• Modular

– Semantex™ engine features plug and play ease of use.

– Easily integrate additional modules, such as a DoD acronym tagger

– Supports variety of data sources• FBIS, USMTF, Lexis-Nexis,

HUMINT, Factiva, Dialog, etc.

Page 17: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

SemantexGenerates Entity Profiles & Events from Documents

Organization Profile (EP402)

Profile Nameal-Barakaat

Descriptors money transfer terroristStaff Mohamed Barre (EP102)

Founder Osama bin Laden (EP103)

Events founding Who: EP103 Org: EP402

______________________According to the Boston Globe, the al-Barakaat network was founded by Osama bin Laden…Mohamed Barre has been the money transfer agency’s broker…Copyright 2001 Blethen Maine Newspapers, Inc. Portland Press Herald

Person Profile (EP101)

Profile NameHerman Cohen

Mentions official, Mr. Cohen

Aliases Cohen

Position assistant secretary of state

Where From US

________________________Assistant secretary of state, Herman Cohen,…as the US official currently visiting Khartoum, he has been …

Copyright 1992 Guardian Newspapers Limited The Guardian (London)

Page 18: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Concept of Operations

• 2 x 2 approach

– 2 levels of processing

• Turn processing

• Discourse processing

– 2 channels for system I/O

• Mission channel: IE chatbot monitors chat session

• Alert channel: system reports flagged information to interested parties

Page 19: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

The GeoTools corpus annotation schema

Surface-level phenomena annotation

– Turn-final punctuation mark-up (incl. omission)

– Misspelling, non-standard punctuation/orthography

• annotator provides correct form

– Ungrammatical constructions

• violations of syntactic rules

• constructions not found in narrative text

– blends: gotcha, dunno, etc.

– constituent drop (incl. apostrophes)

– Annotated 50% of the corpus due to density (23/56 logs)

Page 20: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

The GeoTools corpus annotation schema

Discourse-level phenomena annotation

– Location deixis

• Potentially important inferential construction, esp. for location normalization

ex: It's midnight here, what time is it there?

• mark deictic, describe what they resolve to

– Verb phrase ellipsis (VPE)

• Another potentially important inferential construction

ex: Although Max thinks I'll leave soon, I do want to

• mark the VPE and its antecedent

Page 21: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

The GeoTools corpus annotation schema

– Noun Phrases (NP)

• 20% of the corpus

• Inspired by DRAMA and MATE/GNOME schemas

• mark referentiality type– non-referential, anaphoric, non-NP antecedent

– mark antecedent, if applicable

– Sentential and non-sentential utterances

• 20% of the corpus

• Defined as a functionally independent clause

• Annotation activities include:

– Sentential status (+/-)

– Mark dialog act type (after DAMSL-SWDB schema)

– Link dependent utterances

Page 22: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

The GeoTools corpus annotation

Page 23: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Surface-level annotation analysis

Surface-form noise is in fact very common in chat

Annotation Type % of Turns in which it is present

Non-standard orthography 48%

Non-standard punctuation 46.5%

Ungrammatical Construction 17.5%

Misspelling 12.8%

Page 24: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Surface-level annotation analysis

• Orthographic errors, in particular, show a tendency toward reduced formality:

• This can also be seen in most common ungrammatical construction annotations

– constituent drop (particularly the subject)

– apostrophe omission (in contractions & possessives)

– blends

Type of NSO % of NSO annotations

Should have initial uppercase 59.1%

Should be all uppercase 12.6%

Should be all lowercase 0.4%

Page 25: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Utterance annotation analysis

Antecedent Dependent Pair Count

Statement, non-opinion Statement, non-opinion 1017

Statement, opinion Statement, non-opinion 147

Statement, non-opinion Statement, opinion 146

Statement, non-opinion Yes-No Question 96

Statement, non-opinion Accept 83

Yes-No Question Statement, non-opinion 81

Accept Statement, non-opinion 68

Statement, opinion Statement, opinion 59

Affirmative Answer Statement, non-opinion 50

Wh-Question Statement, non-opinion 47

Statement, non-opinion Wh-Question 47

Statement, non-opinion Offer-Suggest 47

Action-directive Statement, non-opinion 45

Statement, non-opinion Action-directive 40

Offer-Suggest Statement, non-opinion 38

Utterance dependency pairs occurring for more than 10% of utterances

Page 26: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Utterance annotation analysis

• 194 discrete utterance dependency chains

– Interleaved topics are still relatively local

• Median number of turns between linked utterances: 1

• Mean number of turns between linked utterances: 1.6

– These number hold for both sentential and non-sentential utterances

• Wider distribution of dialog act types than narrative discourse, as to be expected, but statements remain the dominant type

Page 27: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

NP annotation analysis

• 13,225 NP annotations (excl. turn-initial usernames)

• Furthermore, ~7% of definite NPs were coreference chain-initial, meaning the rest will have an inferable antecedent

Some implications:Some implications:

– Entities in chat discourse may not rely as heavily on implicit knowledge as hypothesized

– Majority of entities are introduced anew in every chat discourse

• This would therefore allow their antecedents to be recoverable via IE

Type % of NP annotations

Pronouns ( personal, possessive, reflexive ) 64.1%

Definite NP 21.6%

Other 9.7%

Demonstrative Pronouns 4.6%

Page 28: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Verb phrase ellipsis annotation analysis

• Rare in our corpus (61 instances)

• Paucity is likely not a domain effect

• 61/61 had its antecedent in a distinct utterance

• 53/61 had its antecedent in a distinct turn

• Therefore, VPE resolution will be rather local, but necessitates discourse-level processing

Page 29: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Location deixis annotation analysis

• Also rare in our corpus

• This is likely not a domain effect

• 138 instances of deictic here

– 59% refer to the present chat

– 14% refer to the participant's location in the real world

• 84 instances of deictic there

– 1% refer to the present chat

– 10% refer to a location in the real world

• This distribution may also be domain-dependent

Page 30: Information extraction for multi-participant, task- oriented, synchronous, computer-mediated communication: a corpus study of chat data Cassandre Creswell

Consequences for a chat IE system

• Low-level Chat IE (Turn Handling) requires little modifications to Semantex:

– Pre-existing case restoration modules, can be retrained for punctuation

– Robust shallow parsing can handle many ungrammaticalities

• Discourse-level Chat IE System:

– moderately local context of many inference phenomena makes for a more tractable problem

– Topic segmentation and discourse modelling aided by dialog act tagging (utterance type classification)

– Multi-threaded discourse model, Strictly linear model will not suffice

SolutionSolution: A tree structure will turn-level processing outputs at the nodes

• Model updated with each utterance added

– Dynamic world model, Parallel to discourse model

• Consists of Concepts and Mentions

– Concepts: Events, Entities, Relationships

– Mentions: Token-based references to these Concepts– Separation of concepts from mentions allows for truth value updating of a concept with each

addition mention