connecting the docs: integrating information from multiple documents presentation to asis&t pnc...

47
Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist LexisNexis New Technology Research [email protected] May 14, 2004

Upload: isabella-cole

Post on 27-Mar-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

Connecting the Docs: Integrating Information from Multiple Documents

Presentation to ASIS&T PNC Annual Meeting

Mark Wasson

Senior Architect, Research Scientist

LexisNexis New Technology Research

[email protected]

May 14, 2004

Page 2: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 2

Talk Outline

• Introduction• Search and retrieval, classification and indexing• Clustering and summarization• Extraction and aggregation• Record linkage• Analysis, visualization and discovery• Closing remarks, Q&A• References and related materials

Page 3: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 3

Introduction

Page 4: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 4

What is Information Integration?

• Pull together an appropriate amount of information about some subject matter (company, person, topic, product, event, etc.) into a single information product

• Key steps– Target some subject matter– Find relevant information across all relevant sources– Focus on the particularly useful information– Connect information about the target found in different

documents, sources– Eliminate redundant information– Package the information

Page 5: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 5

Search and Retrieval,

Classification and Indexing

Page 6: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 6

Search and Retrieval

• Search basics– Choose sources, search tools– Formulate query– Submit search– Review results– Refine and repeat as appropriate

• The result is generally a set of documents

Page 7: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 7

Search and Retrieval

• Accuracy – all over the place– Recall (completeness)– Precision (correctness)

• What impacts results?– What you are searching for– Ambiguity, synonymy, variants– Source size and focus– Search functionality– Search engine algorithms, coverage– Data annotations and enhancements– Searcher’s skills, knowledge of the topic

• User still must analyze search results

Page 8: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 8

Google “Mark Wasson”

Page 9: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 9

Google “Mark Wasson” Results

• 57 references in Top 100 (April 22, 2004)– About me– My papers– My pictures– Conference programs and attendees lists– Cites to my papers– Links to my site and pictures

• Using the retrieval results– Need to know a lot about me to select, connect the 57– Look at most to get a fairly complete profile– Look at more than a few to get a solid introduction

(unless you turn up a really good page early on)

Page 10: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 10

Categorization and Indexing

Map documents to a taxonomy of topics• Current state of the technology

– State of art at 90-95% accuracy (recall, precision)– Many at 80-85% accuracy– Often designed to work with human editors– Academic research community skeptical

• Big commercial applications– Inxight/Factiva

• Machine learning technology/editorial hybrid

– LexisNexis SmartIndexing• Knowledge-based approach

– Thomson-West CaRE (used in West km)• Machine learning-based approach

Page 11: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 11

Categorization and Indexing Pros and Cons

• Pros– Creates sets of related documents– Higher accuracy (recall and precision)– With good organization and UI, can support ease of

search, retrieval

• Cons– Coverage gaps– Incompatible scopes– Different recall, precision priorities

And you’re still dealing with documents

Page 12: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 12

Clustering and Summarization

Page 13: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 13

Statistical Document Clustering

• Find sets of potentially related documents– Create a feature representation for each document

• Words, phrases, equivalences, variants, frequencies

• Classifications

• Publication attributes

– Compare, score feature similarity– Cluster most similar documents together

• You’re still working with documents– Select most representative documents, one or more of

those closest to a cluster’s centroid

Page 14: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 14

Clusters and Centroids

• Dots are documents• Ovals are clusters• Xs are centroids

Picture from CS5604 – Information Storage and Retrieval class notes, Ed Fox, Virginia Tech, http://ei.cs.vt.edu/~cs5604/

Page 15: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 15

Google News

Page 16: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 16

Google News

• Integrates information at the document level– Finds, retrieves, organizes, presents today’s news– Enough info is provided to provide a nice overview– Links are provided for those who want the details

• Beginning to go beyond documents– Sub-document

• Headlines

• Leading sentences

• Pictures

– Across documents• Story ranking based on cluster attributes

• Representative documents are selected

Page 17: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 17

The Information Unit

• Information takes lots of forms– Documents– Paragraphs– Sentences– Sentence fragments– Headlines, other document components– Tables– Databases– Directories– Lists– Facts– Ideas– Relationships (within, across documents)

Page 18: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 18

Multidocument Summarization

• Identify related documents and create a single summary that captures their highlights– Document classification and clustering– Statistical sentence analysis– Extract key sentences, sentence fragments– Recombine the extracted information– Natural language analysis and generation to improve

readability

Page 19: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 19

Columbia Newsblaster Daily Page

Page 20: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 20

Columbia Newsblaster Summary, Links

Page 21: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 21

Extraction and Aggregation

Page 22: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 22

Extraction and Aggregation

• Find related pieces of information across a document collection and package those pieces into a single information product

• Information can be spread across lots of sources• Information can be found in lots of formats• Information is not always explicitly linked

Page 23: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 23

LexisNexis Company Dossiers

• Users want good information about companies• Company information is found in numerous

news, directory, financial, government, legal and other sources– Literally dozens of searches needed to find everything

• Company names are not always used consistently across sources– Need ability to create a common search key across

content, e.g., normalized form of company names

• Information is presented in free text, lists, tables, databases and directory entry formats– Need ability to find and extract important information

Page 24: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 24

Company Dossier

Page 25: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 25

Company Dossier (cont.)

Page 26: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 26

Company Dossier (cont.)

Page 27: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 27

Company Dossier (cont.)

Page 28: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 28

Company Dossier (cont.)

Page 29: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 29

Record Linkage

Page 30: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 30

Record Linkage

• Record linkage techniques are used to connect related records when there is no explicit key– Data lacks explicit keys, such as ID numbers,

normalized company names, etc.– Data lacks consistent features, such as unique names,

presence of address or phone number, etc.

• Combine feature extraction and analysis– Identify, extract, normalize features as evidence– Compare features across records, looking for a

preponderance of evidence of relatedness– Apply other heuristics, e.g., top-ranked, score threshold

Page 31: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 31

Westlaw Profiler-related Research

• Users want background information on attorneys, judges and expert witnesses

• Information about attorneys and judges found in case law, jury verdicts, directories, etc.

• Information about expert witnesses found in jury verdicts, medical publications, news, websites, etc.

• People names are problematic– Many people with same names– Variation is common

• But set of attorneys, judges is somewhat defined by directories.

Page 32: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 32

Westlaw Profiler-related Research (cont.)

• Link judges, attorneys between case law and West Legal Directory (Dozier & Haschart, 2000)

• Case law feature extraction– Find critical sections within cases– For each attorney, attempt to extract first name, middle

name, last name, name suffix, firm name, city, state– For each judge, attempt to extract first name, middle

name, last name, name suffix, court, date– Package features into Template Records

• West Legal Directory feature extraction– Extract similar features from directory entries for judges

and attorneys– Package features into Biography Records

Page 33: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 33

Westlaw Profiler-related Research (cont.)

• Match Template Records to Biography Records– Attempt to match normalized features between pairs of

records to create a “match probability score”– For given attorney or judge Template Record, the match

to Biography Record with highest match probability score is likely correct match

• Additional heuristics– The dates must be compatible– Highest match probability score must exceed threshold– No match is made if a tie score occurs

Page 34: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 34

Westlaw Profiler-related Research (cont.)

• Attorney match accuracy– 99% precision, 92% recall

• Judge match accuracy– 98% precision, 90% recall

• Common causes of errors– Marriage-based name changes– Spelling errors in the data– Gaps in the directory, such as past positions

• See Dozier et al. (2003) for similar work with expert witness-related information

Page 35: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 35

Analysis, Visualization and Discovery

Page 36: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 36

From Integration to Exploration and Discovery

• Analytical, visualization and discovery tool uses– Summarize key information in a document set– Find and explain interesting facts, relationships and

patterns in a document set– Discover previously unknown information

• Key components– Extract entities, co-occurrence patterns, subject-verb-

object relationship– Coreference resolution, name variant linkage– Statistical analysis– Link analysis– Report generation tools– Data visualization tools

Page 37: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 37

Insightful’s InFact Concept Graph

Example from Insightful website

Page 38: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 38

ClearForest’s ClearResearch Relations Map

Example from ClearForest website

Page 39: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 39

Closing Remarks

Page 40: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 40

Closing Thoughts

“We have solved the information overload problem!”

• Content has exploded– Web: 0 pages > 1 billion pages > 6 billion pages?– Subscription services: Elsevier, Factiva, LexisNexis,

Westlaw, lots of others– Deep web: 500 times bigger than surface web

• Even if we solve retrieval, classification, indexing– Amount of highly relevant material often overwhelming

Page 41: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 41

Closing Thoughts

• Information integration is coming (some is here!)– Information retrieval– Document categorization and indexing– Document clustering– Entity identification– Information extraction– Relationship extraction– Information aggregation– Record linkage– Multidocument summarization– Analytical tools– Data visualization– Knowledge discovery

Page 42: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 42

The End

Any questions?

Mark Wasson

[email protected]

http://www.emarkwasson.com

(206) 728-7109

Product and service names are trademarks or registered trademarks of their holders.

Page 43: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 43

References and Related Materials

Page 44: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 44

References and Related Materials

• ClearForest– ClearForest, http://www.clearforest.com– ClearResearch,

http://www.clearforest.com/Products/Analytics/ClearResearch.asp

• Columbia– Columbia Natural Language Processing Group,

http://www.cs.columbia.edu/nlp/– Columbia Newsblaster, http://newsblaster.cs.columbia.edu/– Schiffman et al. (2002). Experiments in Multidocument

Summarization. 2002 Human Language Technology Conference.

– McKeown et al. (2003). Columbia's Newsblaster: New Features and Future Directions. 2003 Human Language Technology-North American Association for Computational Linguistics Conference.

Page 45: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 45

References and Related Materials

• Google– Google, http://www.google.com– Google News, http://news.google.com

• Insightful– Insightful, http://www.insightful.com– Insightful InFact,

http://www.insightful.com/products/infact/

• Inxight– Inxight, http://www.inxight.com– Inxight classification,

http://www.inxight.com/products/smartdiscovery/– Hersey (2003). Factiva Reaps Benefits from Automatic

Text Classification – An End User Case Study. 3rd Workshop on Operational Text Classification Systems.

Page 46: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 46

References and Related Materials

• LexisNexis– LexisNexis, http://www.lexisnexis.com– LexisNexis Company Dossier,

http://www.lexisnexis.com/companydossier/– Wasson (2000).  Large-scale Controlled Vocabulary

Indexing for Named Entities.  Language Technology Joint Conference:  ANLP-NAACL 2000.

Page 47: Connecting the Docs: Integrating Information from Multiple Documents Presentation to ASIS&T PNC Annual Meeting Mark Wasson Senior Architect, Research Scientist

May 14, 2004 Connecting the Docs - Mark Wasson 47

References and Related Materials

• Thomson-West– Thomson-West, http://west.thomson.com– Westlaw Profiler,

http://west.thomson.com/store/product.asp?product%5Fid=Westlaw+Profiler&catalog%5Fname=wgstore

– Dozier & Haschart (2000). Automatic Extraction and Linking of Person Names in Legal Text. RIAO-2000.

– Dozier et al. (2003). Creation of an Expert Witness Database Through Text Mining. 9th International Conference on Artificial Intelligence and Law.

– Dabney et al. (2003). West km 2.0 – Classifying Document Collections with CaRE. Thomson-West white paper.