connecting the docs: integrating information from multiple documents presentation to asis&t pnc...
TRANSCRIPT
Connecting the Docs: Integrating Information from Multiple Documents
Presentation to ASIS&T PNC Annual Meeting
Mark Wasson
Senior Architect, Research Scientist
LexisNexis New Technology Research
May 14, 2004
May 14, 2004 Connecting the Docs - Mark Wasson 2
Talk Outline
• Introduction• Search and retrieval, classification and indexing• Clustering and summarization• Extraction and aggregation• Record linkage• Analysis, visualization and discovery• Closing remarks, Q&A• References and related materials
May 14, 2004 Connecting the Docs - Mark Wasson 3
Introduction
May 14, 2004 Connecting the Docs - Mark Wasson 4
What is Information Integration?
• Pull together an appropriate amount of information about some subject matter (company, person, topic, product, event, etc.) into a single information product
• Key steps– Target some subject matter– Find relevant information across all relevant sources– Focus on the particularly useful information– Connect information about the target found in different
documents, sources– Eliminate redundant information– Package the information
May 14, 2004 Connecting the Docs - Mark Wasson 5
Search and Retrieval,
Classification and Indexing
May 14, 2004 Connecting the Docs - Mark Wasson 6
Search and Retrieval
• Search basics– Choose sources, search tools– Formulate query– Submit search– Review results– Refine and repeat as appropriate
• The result is generally a set of documents
May 14, 2004 Connecting the Docs - Mark Wasson 7
Search and Retrieval
• Accuracy – all over the place– Recall (completeness)– Precision (correctness)
• What impacts results?– What you are searching for– Ambiguity, synonymy, variants– Source size and focus– Search functionality– Search engine algorithms, coverage– Data annotations and enhancements– Searcher’s skills, knowledge of the topic
• User still must analyze search results
May 14, 2004 Connecting the Docs - Mark Wasson 8
Google “Mark Wasson”
May 14, 2004 Connecting the Docs - Mark Wasson 9
Google “Mark Wasson” Results
• 57 references in Top 100 (April 22, 2004)– About me– My papers– My pictures– Conference programs and attendees lists– Cites to my papers– Links to my site and pictures
• Using the retrieval results– Need to know a lot about me to select, connect the 57– Look at most to get a fairly complete profile– Look at more than a few to get a solid introduction
(unless you turn up a really good page early on)
May 14, 2004 Connecting the Docs - Mark Wasson 10
Categorization and Indexing
Map documents to a taxonomy of topics• Current state of the technology
– State of art at 90-95% accuracy (recall, precision)– Many at 80-85% accuracy– Often designed to work with human editors– Academic research community skeptical
• Big commercial applications– Inxight/Factiva
• Machine learning technology/editorial hybrid
– LexisNexis SmartIndexing• Knowledge-based approach
– Thomson-West CaRE (used in West km)• Machine learning-based approach
May 14, 2004 Connecting the Docs - Mark Wasson 11
Categorization and Indexing Pros and Cons
• Pros– Creates sets of related documents– Higher accuracy (recall and precision)– With good organization and UI, can support ease of
search, retrieval
• Cons– Coverage gaps– Incompatible scopes– Different recall, precision priorities
And you’re still dealing with documents
May 14, 2004 Connecting the Docs - Mark Wasson 12
Clustering and Summarization
May 14, 2004 Connecting the Docs - Mark Wasson 13
Statistical Document Clustering
• Find sets of potentially related documents– Create a feature representation for each document
• Words, phrases, equivalences, variants, frequencies
• Classifications
• Publication attributes
– Compare, score feature similarity– Cluster most similar documents together
• You’re still working with documents– Select most representative documents, one or more of
those closest to a cluster’s centroid
May 14, 2004 Connecting the Docs - Mark Wasson 14
Clusters and Centroids
• Dots are documents• Ovals are clusters• Xs are centroids
Picture from CS5604 – Information Storage and Retrieval class notes, Ed Fox, Virginia Tech, http://ei.cs.vt.edu/~cs5604/
May 14, 2004 Connecting the Docs - Mark Wasson 15
Google News
May 14, 2004 Connecting the Docs - Mark Wasson 16
Google News
• Integrates information at the document level– Finds, retrieves, organizes, presents today’s news– Enough info is provided to provide a nice overview– Links are provided for those who want the details
• Beginning to go beyond documents– Sub-document
• Headlines
• Leading sentences
• Pictures
– Across documents• Story ranking based on cluster attributes
• Representative documents are selected
May 14, 2004 Connecting the Docs - Mark Wasson 17
The Information Unit
• Information takes lots of forms– Documents– Paragraphs– Sentences– Sentence fragments– Headlines, other document components– Tables– Databases– Directories– Lists– Facts– Ideas– Relationships (within, across documents)
May 14, 2004 Connecting the Docs - Mark Wasson 18
Multidocument Summarization
• Identify related documents and create a single summary that captures their highlights– Document classification and clustering– Statistical sentence analysis– Extract key sentences, sentence fragments– Recombine the extracted information– Natural language analysis and generation to improve
readability
May 14, 2004 Connecting the Docs - Mark Wasson 19
Columbia Newsblaster Daily Page
May 14, 2004 Connecting the Docs - Mark Wasson 20
Columbia Newsblaster Summary, Links
May 14, 2004 Connecting the Docs - Mark Wasson 21
Extraction and Aggregation
May 14, 2004 Connecting the Docs - Mark Wasson 22
Extraction and Aggregation
• Find related pieces of information across a document collection and package those pieces into a single information product
• Information can be spread across lots of sources• Information can be found in lots of formats• Information is not always explicitly linked
May 14, 2004 Connecting the Docs - Mark Wasson 23
LexisNexis Company Dossiers
• Users want good information about companies• Company information is found in numerous
news, directory, financial, government, legal and other sources– Literally dozens of searches needed to find everything
• Company names are not always used consistently across sources– Need ability to create a common search key across
content, e.g., normalized form of company names
• Information is presented in free text, lists, tables, databases and directory entry formats– Need ability to find and extract important information
May 14, 2004 Connecting the Docs - Mark Wasson 24
Company Dossier
May 14, 2004 Connecting the Docs - Mark Wasson 25
Company Dossier (cont.)
May 14, 2004 Connecting the Docs - Mark Wasson 26
Company Dossier (cont.)
May 14, 2004 Connecting the Docs - Mark Wasson 27
Company Dossier (cont.)
May 14, 2004 Connecting the Docs - Mark Wasson 28
Company Dossier (cont.)
May 14, 2004 Connecting the Docs - Mark Wasson 29
Record Linkage
May 14, 2004 Connecting the Docs - Mark Wasson 30
Record Linkage
• Record linkage techniques are used to connect related records when there is no explicit key– Data lacks explicit keys, such as ID numbers,
normalized company names, etc.– Data lacks consistent features, such as unique names,
presence of address or phone number, etc.
• Combine feature extraction and analysis– Identify, extract, normalize features as evidence– Compare features across records, looking for a
preponderance of evidence of relatedness– Apply other heuristics, e.g., top-ranked, score threshold
May 14, 2004 Connecting the Docs - Mark Wasson 31
Westlaw Profiler-related Research
• Users want background information on attorneys, judges and expert witnesses
• Information about attorneys and judges found in case law, jury verdicts, directories, etc.
• Information about expert witnesses found in jury verdicts, medical publications, news, websites, etc.
• People names are problematic– Many people with same names– Variation is common
• But set of attorneys, judges is somewhat defined by directories.
May 14, 2004 Connecting the Docs - Mark Wasson 32
Westlaw Profiler-related Research (cont.)
• Link judges, attorneys between case law and West Legal Directory (Dozier & Haschart, 2000)
• Case law feature extraction– Find critical sections within cases– For each attorney, attempt to extract first name, middle
name, last name, name suffix, firm name, city, state– For each judge, attempt to extract first name, middle
name, last name, name suffix, court, date– Package features into Template Records
• West Legal Directory feature extraction– Extract similar features from directory entries for judges
and attorneys– Package features into Biography Records
May 14, 2004 Connecting the Docs - Mark Wasson 33
Westlaw Profiler-related Research (cont.)
• Match Template Records to Biography Records– Attempt to match normalized features between pairs of
records to create a “match probability score”– For given attorney or judge Template Record, the match
to Biography Record with highest match probability score is likely correct match
• Additional heuristics– The dates must be compatible– Highest match probability score must exceed threshold– No match is made if a tie score occurs
May 14, 2004 Connecting the Docs - Mark Wasson 34
Westlaw Profiler-related Research (cont.)
• Attorney match accuracy– 99% precision, 92% recall
• Judge match accuracy– 98% precision, 90% recall
• Common causes of errors– Marriage-based name changes– Spelling errors in the data– Gaps in the directory, such as past positions
• See Dozier et al. (2003) for similar work with expert witness-related information
May 14, 2004 Connecting the Docs - Mark Wasson 35
Analysis, Visualization and Discovery
May 14, 2004 Connecting the Docs - Mark Wasson 36
From Integration to Exploration and Discovery
• Analytical, visualization and discovery tool uses– Summarize key information in a document set– Find and explain interesting facts, relationships and
patterns in a document set– Discover previously unknown information
• Key components– Extract entities, co-occurrence patterns, subject-verb-
object relationship– Coreference resolution, name variant linkage– Statistical analysis– Link analysis– Report generation tools– Data visualization tools
May 14, 2004 Connecting the Docs - Mark Wasson 37
Insightful’s InFact Concept Graph
Example from Insightful website
May 14, 2004 Connecting the Docs - Mark Wasson 38
ClearForest’s ClearResearch Relations Map
Example from ClearForest website
May 14, 2004 Connecting the Docs - Mark Wasson 39
Closing Remarks
May 14, 2004 Connecting the Docs - Mark Wasson 40
Closing Thoughts
“We have solved the information overload problem!”
• Content has exploded– Web: 0 pages > 1 billion pages > 6 billion pages?– Subscription services: Elsevier, Factiva, LexisNexis,
Westlaw, lots of others– Deep web: 500 times bigger than surface web
• Even if we solve retrieval, classification, indexing– Amount of highly relevant material often overwhelming
May 14, 2004 Connecting the Docs - Mark Wasson 41
Closing Thoughts
• Information integration is coming (some is here!)– Information retrieval– Document categorization and indexing– Document clustering– Entity identification– Information extraction– Relationship extraction– Information aggregation– Record linkage– Multidocument summarization– Analytical tools– Data visualization– Knowledge discovery
May 14, 2004 Connecting the Docs - Mark Wasson 42
The End
Any questions?
Mark Wasson
http://www.emarkwasson.com
(206) 728-7109
Product and service names are trademarks or registered trademarks of their holders.
May 14, 2004 Connecting the Docs - Mark Wasson 43
References and Related Materials
May 14, 2004 Connecting the Docs - Mark Wasson 44
References and Related Materials
• ClearForest– ClearForest, http://www.clearforest.com– ClearResearch,
http://www.clearforest.com/Products/Analytics/ClearResearch.asp
• Columbia– Columbia Natural Language Processing Group,
http://www.cs.columbia.edu/nlp/– Columbia Newsblaster, http://newsblaster.cs.columbia.edu/– Schiffman et al. (2002). Experiments in Multidocument
Summarization. 2002 Human Language Technology Conference.
– McKeown et al. (2003). Columbia's Newsblaster: New Features and Future Directions. 2003 Human Language Technology-North American Association for Computational Linguistics Conference.
May 14, 2004 Connecting the Docs - Mark Wasson 45
References and Related Materials
• Google– Google, http://www.google.com– Google News, http://news.google.com
• Insightful– Insightful, http://www.insightful.com– Insightful InFact,
http://www.insightful.com/products/infact/
• Inxight– Inxight, http://www.inxight.com– Inxight classification,
http://www.inxight.com/products/smartdiscovery/– Hersey (2003). Factiva Reaps Benefits from Automatic
Text Classification – An End User Case Study. 3rd Workshop on Operational Text Classification Systems.
May 14, 2004 Connecting the Docs - Mark Wasson 46
References and Related Materials
• LexisNexis– LexisNexis, http://www.lexisnexis.com– LexisNexis Company Dossier,
http://www.lexisnexis.com/companydossier/– Wasson (2000). Large-scale Controlled Vocabulary
Indexing for Named Entities. Language Technology Joint Conference: ANLP-NAACL 2000.
May 14, 2004 Connecting the Docs - Mark Wasson 47
References and Related Materials
• Thomson-West– Thomson-West, http://west.thomson.com– Westlaw Profiler,
http://west.thomson.com/store/product.asp?product%5Fid=Westlaw+Profiler&catalog%5Fname=wgstore
– Dozier & Haschart (2000). Automatic Extraction and Linking of Person Names in Legal Text. RIAO-2000.
– Dozier et al. (2003). Creation of an Expert Witness Database Through Text Mining. 9th International Conference on Artificial Intelligence and Law.
– Dabney et al. (2003). West km 2.0 – Classifying Document Collections with CaRE. Thomson-West white paper.