information extraction for multi-participant, task- oriented, synchronous, computer-mediated...
TRANSCRIPT
Information extraction for multi-participant, task-Information extraction for multi-participant, task-oriented, synchronous, computer-mediated oriented, synchronous, computer-mediated
communication: a corpus study of chat datacommunication: a corpus study of chat dataCassandre Creswell
Nicholas SchwartzmyerRohini Srihari
Janya, Inc.www.janyainc.com
8 January 2007
Significance of the Problem
• Synchronous Computer-mediated communication (SCMC) or chat is an increasingly important means of communication in many settings including intelligence and military domains
• Information extraction (IE) as applied to chat could aid activities reliant upon real-time decision-making (e.g. Entity tracking/targeting, monitoring teenager chat rooms etc.)
• Most IE applications, including our SemantexTM system, have been developed for optimal performance on well-written text
• This necessitates research into the unique characteristics of chat and how they affect IE performance
– Perform a corpus study, i.e. gaps analysis to prioritize tasks for chat IE system
– Add to the nascent study of chat as a discourse type
• Focus on task-oriented chat discourse
– involves participants exchanging information on a highly-complex collaborative task in unconstrained setting
– Task-oriented dialogs have played a significant role in understanding discourse-level coordination of meaning in conversation in the fields of psycholinguistics and computational linguistics
Task-oriented chat: The GeoTools corpus
• Need to develop a task-oriented corpus of chat data for research
• The GeoTools corpus
– 56 IRC logs for the GeoTools project (http://geotools.codehaus.org/IRC+Logs)
– Approximately 180K and 18,000 participant turns
• (cf. The COCONUT corpus < 14K)
– Interactions have form of business meeting
• Agenda of weekly problems/issues in the project, each discussed in some degree of detail
• Appropriate model of task-oriented discourse
Characteristics of chat data that distinguish it from narrative text
• It is dynamic. This results in:
Characteristics of chat data that distinguish it from narrative text
• It is dynamic. This results in:
– Epistemological uncertainty. Propositions and entities may be challenged and/or revised
– Structural Errors. Grammatical, spelling, and orthographic mistakes will be unedited in the text
rschulz also, did you get the CRSServices proj issue worked out? rschulz (Jody) jgarnett which one? I think we did - our wkt was not complete
on the shapefile.prg
jgarnett bleck - I mean shapefile.prj rschulz you had a test using crsservice to re-project
geometries. It gave wierd results for the bounding box rschulz I thought crsservice might be modifying its input geometries
jgarnett ah no I never figured that out
Characteristics of chat data that distinguish it from narrative text
• It is dynamic. This results in:
– Epistemological uncertainty. Propositions and entities may be challenged and/or revised
– Structural Errors. Grammatical, spelling, and orthographic mistakes will be unedited in the text
rschulz also, did you get the CRSServices proj issue worked out?
rschulz (Jody) « jgarnett which one? I think we did - our wkt was not complete
on the shapefile.prg « jgarnett bleck - I mean shapefile.prj « rschulz you had a test using crsservice to re-project
geometries. It gave wierd results for the bounding box « rschulz I thought crsservice might be modifying its input geometries « jgarnett ah « no I never figured that out «
Characteristics of chat data that distinguish it from narrative text
• It is interactive. Content has contribution from multiple participants, allowing:
Characteristics of chat data that distinguish it from narrative text
• It is interactive. Content has contribution from multiple participants, allowing:
– Misunderstanding and disagreement but also
– Extensive implicit, shared knowledge
rschulz also, did you get the CRSServices proj issue worked out ? rschulz (Jody)
jgarnett which one ? I think we did - our wkt was not complete on the shapefile.prg
jgarnett bleck - I mean shapefile.prj rschulz you had a test using crsservice to re-project
geometries. It gave wierd results for the bounding
box rschulz I thought crsservice might be modifying its input geometries jgarnett ah no I never figured that out
Characteristics of chat data that distinguish it from narrative text
• It is interactive. Content has contribution from multiple participants, allowing:
– Misunderstanding and disagreement but also
– Extensive implicit, shared knowledge
rschulz also, did you get the CRSServices proj issue worked out? rschulz (Jody)
jgarnett which one? I think we did - our wkt was not complete on the shapefile.prg
jgarnett bleck - I mean shapefile.prj rschulz you had a test using crsservice to re-project
geometries. It gave wierd results for the bounding
box rschulz I thought crsservice might be modifying its input geometries jgarnett ah no I never figured that out
Characteristics of chat data that distinguish it from narrative text
• It has a relatively unconstrained turn-taking structure
Characteristics of chat data that distinguish it from narrative text
• It has a relatively unconstrained turn-taking structure
– Distinguishes chat from conversation as well
– Allows for multi-threaded, interleaved conversations
– Complicates resolution tasks
Characteristics of chat data that distinguish it from narrative text
• It has a relatively unconstrained turn-taking structure
– Distinguishes chat from conversation as well
– Allows for multi-threaded, interleaved conversations
– Complicates resolution tasks
jgarnett While we wait dblasby I almost have the “empty” hsql datastore ready to commitcholmes Cool, I'd be happy to see itpolio ah, that was mejeichar and I added crs support to postgischolmes Successfully?dblasby thanks jodycholmes where did you get the changes?polio he made a crs factory thingy and I hooked it into
Characteristics of chat data that distinguish it from narrative text
• It has a relatively unconstrained turn-taking structure
– Distinguishes chat from conversation as well
– Allows for multi-threaded, interleaved conversations
– Complicates resolution tasks
jgarnett While we wait dblasby I almost have the “empty” hsql datastore ready to commitcholmes Cool, I'd be happy to see itpolio ah, that was mejeichar and I added crs support to postgischolmes Successfully?dblasby thanks jodycholmes where did you get the changes?polio he made a crs factory thingy and I hooked it into
Implications for Information Extraction: Surface Level
These characteristics complicate IE on two levels:
• Surface-level (Dynamic noise) affecting turn processing
– Spelling errors
ex: rschulz ...It gave wierd results. [sic]ex: jgarnett I would love a repalcement [sic]
– Non-standard punctuation/orthographic decisionsex: jmacgill sorry[,] gotta run[.]
– Ungrammatical/non-standard grammatical constructionsex: jmacgill sorry [I] gotta run
Implications for Information Extraction: Discourse Level
• Discourse-level (Interactive noise) affecting 'document' processing
– Topic segmentation
– Time and event normalization
– Anaphora resolution
jgarnett While we wait dblasby I almost have the “empty” hsql datastore ready to commit
cholmes Cool, I'd be happy to see it
polio ah, that was me
jeichar and I added crs support to postgis
cholmes Successfully?
dblasby thanks jody
cholmes where did you get the changes?
polio he made a crs factory thingy and I hooked it into postgis
Semantex™ 3.0
• Semantex– Tags key entities (people, places,
organizations,...), Relationships, Events
– Summarizes information in entity profiles
• Hybrid Model– Combines statistical and grammar-
based approaches in a cascade of over 60 modules
– FST grammars• Modular
– Semantex™ engine features plug and play ease of use.
– Easily integrate additional modules, such as a DoD acronym tagger
– Supports variety of data sources• FBIS, USMTF, Lexis-Nexis,
HUMINT, Factiva, Dialog, etc.
SemantexGenerates Entity Profiles & Events from Documents
Organization Profile (EP402)
Profile Nameal-Barakaat
Descriptors money transfer terroristStaff Mohamed Barre (EP102)
Founder Osama bin Laden (EP103)
Events founding Who: EP103 Org: EP402
______________________According to the Boston Globe, the al-Barakaat network was founded by Osama bin Laden…Mohamed Barre has been the money transfer agency’s broker…Copyright 2001 Blethen Maine Newspapers, Inc. Portland Press Herald
Person Profile (EP101)
Profile NameHerman Cohen
Mentions official, Mr. Cohen
Aliases Cohen
Position assistant secretary of state
Where From US
________________________Assistant secretary of state, Herman Cohen,…as the US official currently visiting Khartoum, he has been …
Copyright 1992 Guardian Newspapers Limited The Guardian (London)
Concept of Operations
• 2 x 2 approach
– 2 levels of processing
• Turn processing
• Discourse processing
– 2 channels for system I/O
• Mission channel: IE chatbot monitors chat session
• Alert channel: system reports flagged information to interested parties
The GeoTools corpus annotation schema
Surface-level phenomena annotation
– Turn-final punctuation mark-up (incl. omission)
– Misspelling, non-standard punctuation/orthography
• annotator provides correct form
– Ungrammatical constructions
• violations of syntactic rules
• constructions not found in narrative text
– blends: gotcha, dunno, etc.
– constituent drop (incl. apostrophes)
– Annotated 50% of the corpus due to density (23/56 logs)
The GeoTools corpus annotation schema
Discourse-level phenomena annotation
– Location deixis
• Potentially important inferential construction, esp. for location normalization
ex: It's midnight here, what time is it there?
• mark deictic, describe what they resolve to
– Verb phrase ellipsis (VPE)
• Another potentially important inferential construction
ex: Although Max thinks I'll leave soon, I do want to
• mark the VPE and its antecedent
The GeoTools corpus annotation schema
– Noun Phrases (NP)
• 20% of the corpus
• Inspired by DRAMA and MATE/GNOME schemas
• mark referentiality type– non-referential, anaphoric, non-NP antecedent
– mark antecedent, if applicable
– Sentential and non-sentential utterances
• 20% of the corpus
• Defined as a functionally independent clause
• Annotation activities include:
– Sentential status (+/-)
– Mark dialog act type (after DAMSL-SWDB schema)
– Link dependent utterances
The GeoTools corpus annotation
Surface-level annotation analysis
Surface-form noise is in fact very common in chat
Annotation Type % of Turns in which it is present
Non-standard orthography 48%
Non-standard punctuation 46.5%
Ungrammatical Construction 17.5%
Misspelling 12.8%
Surface-level annotation analysis
• Orthographic errors, in particular, show a tendency toward reduced formality:
• This can also be seen in most common ungrammatical construction annotations
– constituent drop (particularly the subject)
– apostrophe omission (in contractions & possessives)
– blends
Type of NSO % of NSO annotations
Should have initial uppercase 59.1%
Should be all uppercase 12.6%
Should be all lowercase 0.4%
Utterance annotation analysis
Antecedent Dependent Pair Count
Statement, non-opinion Statement, non-opinion 1017
Statement, opinion Statement, non-opinion 147
Statement, non-opinion Statement, opinion 146
Statement, non-opinion Yes-No Question 96
Statement, non-opinion Accept 83
Yes-No Question Statement, non-opinion 81
Accept Statement, non-opinion 68
Statement, opinion Statement, opinion 59
Affirmative Answer Statement, non-opinion 50
Wh-Question Statement, non-opinion 47
Statement, non-opinion Wh-Question 47
Statement, non-opinion Offer-Suggest 47
Action-directive Statement, non-opinion 45
Statement, non-opinion Action-directive 40
Offer-Suggest Statement, non-opinion 38
Utterance dependency pairs occurring for more than 10% of utterances
Utterance annotation analysis
• 194 discrete utterance dependency chains
– Interleaved topics are still relatively local
• Median number of turns between linked utterances: 1
• Mean number of turns between linked utterances: 1.6
– These number hold for both sentential and non-sentential utterances
• Wider distribution of dialog act types than narrative discourse, as to be expected, but statements remain the dominant type
NP annotation analysis
• 13,225 NP annotations (excl. turn-initial usernames)
• Furthermore, ~7% of definite NPs were coreference chain-initial, meaning the rest will have an inferable antecedent
Some implications:Some implications:
– Entities in chat discourse may not rely as heavily on implicit knowledge as hypothesized
– Majority of entities are introduced anew in every chat discourse
• This would therefore allow their antecedents to be recoverable via IE
Type % of NP annotations
Pronouns ( personal, possessive, reflexive ) 64.1%
Definite NP 21.6%
Other 9.7%
Demonstrative Pronouns 4.6%
Verb phrase ellipsis annotation analysis
• Rare in our corpus (61 instances)
• Paucity is likely not a domain effect
• 61/61 had its antecedent in a distinct utterance
• 53/61 had its antecedent in a distinct turn
• Therefore, VPE resolution will be rather local, but necessitates discourse-level processing
Location deixis annotation analysis
• Also rare in our corpus
• This is likely not a domain effect
• 138 instances of deictic here
– 59% refer to the present chat
– 14% refer to the participant's location in the real world
• 84 instances of deictic there
– 1% refer to the present chat
– 10% refer to a location in the real world
• This distribution may also be domain-dependent
Consequences for a chat IE system
• Low-level Chat IE (Turn Handling) requires little modifications to Semantex:
– Pre-existing case restoration modules, can be retrained for punctuation
– Robust shallow parsing can handle many ungrammaticalities
• Discourse-level Chat IE System:
– moderately local context of many inference phenomena makes for a more tractable problem
– Topic segmentation and discourse modelling aided by dialog act tagging (utterance type classification)
– Multi-threaded discourse model, Strictly linear model will not suffice
SolutionSolution: A tree structure will turn-level processing outputs at the nodes
• Model updated with each utterance added
– Dynamic world model, Parallel to discourse model
• Consists of Concepts and Mentions
– Concepts: Events, Entities, Relationships
– Mentions: Token-based references to these Concepts– Separation of concepts from mentions allows for truth value updating of a concept with each
addition mention