text mining & visualization - patent information users group · introduction –text mining...

Post on 12-Apr-2018

239 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Text Mining & Visualization

Impressions of emerging capabilities

Cynthia Barcelon-Yang (speaker)

Yun Yun Yang (speaker)

Lucy Akers

Bristol-Myers Squibb

2007 PIUG Northeast Conference

New Brunswick, New Jersey

�Introduction – Text Mining &Visualization

�Overview of Text Mining Tools

■Capabilities

■Data Sources

■Results

■Strengths

� Summary

Why do we need a tool to do text mining?

Welcome to the age of too much information...

Typical questions asked of IP Operations

�How many patents do we have concerning technology ‘x’?

�How does our portfolio compare with company ‘ABC’ ?

�Who is citing our portfolio?

�Which patents do business unit ‘xyz’ own?

�Which patents should we divest as a result of selling division

XYZ?

�How do our invention disclosures compare with current granted

patents?

�How do we improve our patent operations?

Often, the IP Operations group within an organization provides centralized support

to a wide range of business units, and is responsible for answering the following:

What is text mining?

(according to Marti Hearst of UC Berkeley School of Information)

■ The discovery of new, previously unknown information, by automatically extracting information from different written resources.

■ A variation on a field called data mining, that tries to find interesting patterns from large databases.

■ Many researchers think it will require a full simulation of how the mind works before we can write programs that read the way people do.

■ computational linguistics (also known as natural language processing)

■ Hearst distinguishes between "real" text mining, that discovers new pieces of knowledge, and approaches that find overall trends in textual data.

Text Mining Process

Courtesy of: Invention Machine Corp.

Common Tasks�List generation (can be displayed as histograms)

�List cleanup and grouping of concepts

�Co-occurrence matrices and other graphing

�Clustering, categorization, grouping and extraction of text

�Mapping document clusters or concepts

�Adding temporal components to maps

�Citation analysis

�Subject/Action/Object (SAO) functions (a.k.a. NLP)

�Federated searching e.g. on Internet or Intranets

Project Planning

■ Phase I

►Literature searches, key references, brainstorming of

text/data mining & visualization

►Identify potential tools to evaluate

►Vendor onsite demonstrations

► Summary of initial tool evaluations

■ Phase II

►Pilot selected tools

►Identify potential clients groups and interview

representative clients

Investigation & Process Approach

■ Scout the literature/internet sources & brainstorm

■ Benchmark

■ “Patinformatics – Tools and Tasks” by Tony Trippe,

World Patent Information 25 (2003) 211–221

■ “Data Visualization Tools - A Perspective from

the Pharmaceutical Industry” by Jeannette Eldridge, World Patent Information 28 (2006) 43–49

■ Vendor demos

Tools Initially Identified

AnaVist Matheo Patent

Anacubis OmniViz

Aureka PatAnalyst

Bioalma Quosa

BizInt Technology Watch

ClearForest Temis

Delphion VantagePoint

Entrieva (Semio) Vivisimo

GoldFire Wisdomain

Inxight Wistract

M-CAM

Vendor Tool Demonstrations

1.Quosa

2.Inxight

3.PatAnalyst

4.OmniViz

5.Temis

6.Aureka

7.Wisdomain

8.GoldFire

9.VantagePoint

10.ClearForest

11.m-CAM

12.RefViz

* Overview of Vendor Tools

�Type of Tool

�Capabilities

�Data Sources

�Results

�Strengths

�Summary

* Text mining tool slides are provided courtesy of the vendors.

Text Mining Capabilities

�Keyword Analysis■ Extracting nouns or noun phrases in text without understanding their meaning or relationships or counting the number of times the nouns appear

�Statistical Analysis ■ Frequency-based analysis – counting the number of times a word appears in the text

� Linguistic Analysis■ Natural language processing (NLP) – “Trained Agent”

■ Semantic analysis

Text Mining Data Sources

■Unstructured text

►full text document, emails

■Structured text

►database records, such as records from STN,

pubmed

■Hybrid content

►Patents, front page is structured, text is not

Data Sources

I. General Data Sources (Unstructured):ClearForest

GoldFire Innovator

Inxight

OmniViz

Temis

II. Bibliographic Data Sources (Structured):Quosa

RefViz

VantagePoint

III. Patent-Focused (Hybrid): Aureka

M-CAM

PatAnalyst

Wisdomain

Evaluation Template� Type of Tool

■ Text mining software tool

■ Database content provider

■ Both

� Capabilities■ Keyword analysis

■ Statistical analysis

■ Linguistic analysis

� Data Sources■ Structured bibliographic data sources

■ Unstructured sources – full-text web, email, corporate repositories, etc.

■ Hybrid sources – patents, combination of structured/unstructured

� Results■ Lists of documents

■ Tables

■ Charts/Graphs

■ Maps

� Strengths – Disclaimer: Our Impressions only!

� Summary

GoldFire Innovator

� Type of tool – text mining tool

GoldFire Innovator� Technology – Semantic Analysis

GoldFire Innovator

GoldFire Innovator

� Data Sources■ Unstructured information from personal data, corporate data, deep web, content, patents, internet

►15 MM worldwide patents

►Database of over 8000 scientific effects

►3000 cross-disciplinary scientific deep web websites

� Results■ Static categorization of key concepts

■ Accurate answers to questions

■ Dynamic document summarization

GoldFire Innovator - Strengths

�Precision retrieval of targeted R&D content►Retrieves information from context – semantic

indexing

►automated summaries and categorization

►Relevant filtering and ranking

�Using natural language query to search►Ask the right questions - How to dry paper? How to balance diets?

�Innovation Trend Analysis► Competitive analysis

► Technology analysis

► Patent relationship analysis – citation analysis

Inxight

� Type of Tool■ Text mining software tool.

� Capability■ Natural Language Processing

■ Contextual extractions (leaning towards semantic analysis)

� Data Source■ Unstructured text from websites, internal repositories, full-

text documents

■ Documents have to be pre-processed to extract meta-data and identify entity types

� Results■ Hierarchical categorization

Inxight - Strengths

� Federated Search capability

� Claim to have more accuracy than a

human reader

� Software can work in 32 languages

and can understand 27 entity types

� Can process 1.2Gigabytes per hour

� Claim to have the most powerful

linguistic algorithms in the field

Temis

�Type of tool ■ Text Mining Solutions - software

�Capability ■ Natural Language processing

►Insight DiscovererTM Extractor – info extraction sever poweredby Xe-LDA and used with specialized Skill Cartridges

►Insight DiscovererTM Categorizer – doc categorization sever

►Insight DiscovererTM Clusterer – automated classification sever

►XeLDA - Multilingual linguistic engine – natural language processing

►Skill Cartridge – A set of customizable knowledge components

that define the information to be extracted. The two major knowledge

components are multi-lingual dictionaries and multi-lingual

extraction rules (establish relationships between defined concepts

Skill Cartridge Overview

� Open architecture

■ Plug & Play annotation components

■ Each defines areas of interests & extraction rules

■ Extraction rules describe the sentence structure that characterizes a concept

XeLDA™

Text

(any kind, any format)

Words

(any concept)

Merger & Acquisition

Positive & Negative Sentiment Analysis

Meaning = Acquisition

• Target & buyer• Amount & date

...

Meaning = Satisfaction

• People, companies, Products

• Satisfaction• Support

...

Plug & Play

Skill Cartridges™…

InsightDiscoverer™Extractor

Temis

�Data Sources

■ Any kind, any format, Internal & external data,

documents, literature, patents, clinical trials,

chemistry and biology, bioinformatics, internet,

email, etc

�Results

■ Clusters, Rankings, Lists to discover information

trends and relationships

Temis - Strengths

�Searching by concepts►Selecting concepts from concept tree

�Specialized Skill cartridges►Life science Skill Cartridges

– Analytics

– Text Mining 360°

– Competitive Intelligence

– Human Resources Management

►General Skill Cartridges

– Biological Entity Relationships – best selling

– Medical Entity Relationships

– Chemical Entity Relationships

– Competitive Intelligence Life Sciences Edition

Temis - Strengths

�Strong extraction, categorization, and

clustering capabilities

�Robust XeLDA linguistic engine

�Quick trend analysis

�Chemical Document Browser – specialized

extraction module for chemical substance

nomenclature translation to chemical

structures.

OmniViz

� Type of tool■ visual based data/text mining software

� Capability■ algorithm based statistical analysis, not semantics

� Data source/type■ numeric, text, categorical, chem. structures, sequence,

structured/unstructured text

� Results

■ interactive visualizations maps such as CoMet,

Correlation, Galaxy Proximity, etc.

OmniViz

OmniViz- Strengths

■ Interactive visualizations

■ Supports analysis of large amounts of data (millions of documents) - numeric, categorical and full-text analysis, including patents.

■ Broad applications including gene expression, sequence & pathway analysis, chemical structures, cheminformatics, clinical trial, patent analysis, diagnosis and treatment, legal, marketing data, regulatory compliance, intelligence analysis, etc.

■ Flexible data import and merge capabilities

ClearForest � Type of Tool

■ Text mining tool (text analytics solution)

� Capability■ Semantic analysis/NLP

� Data Sources■ Unstructured text – websites

■ Patents

■ Internal documents

■ Meta-data

� Results■ Structured data entities

■ List of potential solutions for identified issues

■ Visualization tools – trend graphs, category maps

►Color and font are used to show intensity of relationships

ClearForest

Text Analytics: How it Works

Unified Analysis

Output

TaggingPlatform

UnstructuredText

Problem Condition

Fuel Pump Fails corroded

Pump Relay Shorts Cold

weather

Headlight Fails Running hot

Engine Stalls At low

speeds

Part

DB

Database

DatabaseText Fields

DB

XML

Extraction

Across RecordsIncluding domain specific

entities & relationships

Role-Based Interfaces

<PartProblemCondition>

<Part> Fuel Pump </Part>

<Problem> Fails </Problem>

<Condition> Corroded </Condition>

</PartProblemCondition>

DocumentsText, Word, Excel,

Email, WWW, PDF

Clear Forest

Packaged Extraction ModulesInputs

Outputs

Patents

Structured Data Entities� Agent� Application Number� Assignee� Assignee Address� Examiner� Filing Date� Inventor� Inventor Address� IPC� Issue Date� Number Of Claims� Patent Citations� Patent Number� US Class

Entities • Claim Element• Claim Invention• Extracted Terms• Invention Terms• Measurement Terms• Number of Claims• Patent Section• Problem Solved Terms• Problems Solved• Process Technology Terms• Technology Terms

U.S. PatentSearch

MicroPatentSearch

DatabaseFields

Text, Word,Excel, etc

ClearForest - Strengths

�Can be applied to a wide range of applications as evidenced by wide variety of available extraction modules■ Security/intelligence gathering

■ Product/customer information

■ Corporate/People profiles

■ Patents

■ Biomedical entities

�Analytics tool can discover unexpected relationships between entities that would not have been otherwise uncovered by standard, manual methods.

VantagePoint

� Type of the tool■ Text mining software mainly used for technology

assessment and company profiling

� Capability■ Uses pattern matching, rule-based, and natural language

processing techniques

� Data Sources■ Works best with structured data - text data from

bibliographic databases

� Results

■ summaries, charts, matrices, maps, and graphs

VantagePoint - Key Features

� Rapid navigation in large abstract collections

� Helps find relationships within your data

� Visually displays relationships

� Buckets documents to help in categorization

� Utilities for cleaning data

� User created thesauri for reducing data

� Scripting capabilities to automate knowledge-gathering

� Easily exports output to other applications

� Can be configured to text mine most forms of structured bibliographic data

VantagePoint - Strengths

� List Creation and Cleanup■ patent assignee, author, inventor

■ pre-built IPC, User created thesauri

� Analytical tool box■ rapid navigation in large abstract collections to answer who, where, what, when but not how and why

■ visually displays relationships

� Scripting capabilities to automate knowledge-gathering■ configure to extract from structured databases

RefViz

� Type of tool

■ Text Analysis and Data Visualization software

� Capability

■ Statistical and Linguistic analysis

►“mathematical signature” – relationship of words

►Uses a thesaurus tool

� Data Sources

■ Only structured data from title, abstracts/notes fields, or ISI Web of Science, PubMed, OCLC, Output

� Results

■ “Galaxy” & matrix visualization

RefViz - Strengths

■ Reference Retriever™ can search multiple online

sources simultaneously

■ can be used together with EndNote, ProCite, and

Reference Manager to provide an additional level

of analysis to existing reference collections

■ analyzes large numbers of references by thematic

content

■ interactive, visual landscape

Reveal trends and associations in references

The Galaxy view organizes references according to how they are related conceptually.

References on farming and herbs, either their

cultivation or use as herbicides, are found in

the upper left region of the Galaxy.

Groups in the lower right focus on herbs in

medicine.

The region in between farming and medicine contains a mix of

references about herbage diets in farm animals, herbal extracts

from plants, and research on health effects of herbicide exposure.

Quosa

� Type of tool■ Text mining tool based on concept extraction/clustering

� Capability■ Statistical analysis (term extraction, frequency ranking,

concept extraction using dynamic extraction algorithm from MIT/Harvard)

� Data sources■ unstructured text - PubMed, Ovid, Google Scholar

■ Patents

■ Internal documents

� Results■ Highly organized collection of documents (folders on

shared server or local machine)

■ Team sharing and annotating

Quosa - Strengths

Full-text retrieval and management of

scientific documents

■Get full-article from a journal or patent

gateway

► PubMed, Ovid, USPTO website

■Document Summary from My Article

Organizer

■Download to EndNote

M-CAM DoorsTM

� Type of tool■ Patent database provider, with text analysis and risk management

solution

� Capability■ Linguistic & semantic-based analysis, multi lingual

� Data Sources■ Patents from over 88 patenting authorities, 50 million patent doc.

■ journal articles (by the end of the summer 2006)

� Results■ “Compass” citation view

■ “Magellan” telescope & hourglass – patent life timeline

■ Patent uniqueness and enforceability analysis

■ Competitive intelligence analysis - financial risk analysis for merger/acquisition and stock trading

M-CAM DoorsTM

Hourglass view – shows behavior and intent

Red bar – cited patents

Blue bar – citing patents

Green bar – concurrent art – share pendency

Purple bar – volume of uncited patents

Orange bar – volume of patents that did not cite subject patent

M-CAM DoorsTM - Strengths

�Powerful visual interface for citation analysis with related family & legal status views

�Can rate each patent for its uniqueness, reliance on related patents, and enforcement potential – based on Hourglass view

�Can rank patent clusters by relevance to business objectives

�Competitive Intelligence/Investment Research ■ New Patent Thursday™ , Patent Portfolio Confidence Rating™ , Custom PPCR™

PatAnalyst

� Type of tool■ Patent database provider – integrated source (UNIPAT) of patent

databases from US, PCT, EPO, PAJ, Germany, UK, France and Switzerland

■ Patent search & examination service

� Capability■ No text mining algorithm

� Data Sources■ 51.5 MM patent documents – bibliographic data from 70 countries

from EPO

■ 15MM full-text documents – 8 countries/patenting authorities

� Results■ Viewer – analyze and orgnize the patent documents/families.

■ easy to use analytical colored text-highlighting of keywords

■ Organized folders of documents

PatAnalyst - Strengths

�Powerful user-interface with enhanced

display features

■ Highlight keywords are in different colors

■ Side-by-side views of full-text and standard

bibliographic data

■ Integrated IPC category trees

■ “Live” legal status & patent family tree view from

EPO Viewer (EPOQUE)

■ Combined search of full-text & bibliographic data

Aureka

� Type of tool

■ content and software tool specializing in visualization and

citation analysis

� Capability

■ Keyword and Statistical Analysis

� Data Sources

■ patent databases listed in MicroPatent’s FullText collection

� Results

■ ThemeScape maps, hyperbolic citations trees, text clusters

Aureka Themescape Map of

Stem Cell TechnologyA Themescape map of

a large set of

documents provides an

initial view of the

content. Additional

probing and analysis of

the map will help to

reveal more insight.

Citation Tree of Patent EP0778277

A cited patent provides insight into a corporation’s strategic intent with a patent;

build a picket fence, non-core patent, or lack of R&D interest.

Aureka – Strengths

� Strong citation analysis tool►Interactive citation tree – intelligence analysis

and strategic planning

� Annotation capabilities

� Strong visualization analysis►Patent mapping with ThemeScape

►Clustering by Vivisimo

Wisdomain

� Type of tool■ Content and software tool. Web-based searching and

citation tool. Analysis module is local

� Capability■ Keyword analysis, citation map visualized searching

� Data Sources■ Patents, specialized in US, EP, PCT, PAJ, INPADOC legal and family status, China abs, Korea abs

� Results■ Genealogy tree, Tables, charts

Wisdomain - Strengths

�Strong citation analysis capability►backward and forward citations, more than one nesting

►collateral citation analysis

►citation alerts

�Genealogy Tree►good in competitive analysis and licensing

strategy planning

� Graphic view of the search results

ISSUED

1993APPLIED

1990

PENDING PERIOD

SUBJECT PATENT

PATENT

PATENT

PATENT

PATENT

PATENT

PATENT

Collateral CitationIdentifying similar patents sharing the same pending period with the subject patent

PATENT

PATENT

PATENT

PATENT

PATENT

PATENT

PATENT

Key Collateral patentKey Collateral patentKey Collateral patentKey Collateral patent

7 collateral patents are identified based on indirect citation r7 collateral patents are identified based on indirect citation r7 collateral patents are identified based on indirect citation r7 collateral patents are identified based on indirect citation relations.elations.elations.elations.

Summary

R&D scientists,

Information Professionals

Strong collateral citation analysis Wisdomain

Information Professionals,

R&D scientists

Powerful full-text user interface

with display featuresPatAnalyst

Business Intelligence, Legal/Patent

Dept., Information Professionals

Patent uniqueness & enforcement

analysisM-CAM

Legal/Patent Dept., R&D scientists,

Information Professionals,

Strategic Planning, Business

Intelligence

Patent mapping, clustering &

citation analysisAureka

Information Professionals,

Business Intelligence

Analytical tool box for technology

or company assessmentVantagePoint

R&D scientists,

Information Professionals

Bibliographic data post-

processingRefViz

R&D scientistsFull-text retrieval & mgmtQuosa

R&D scientists,

Business Intelligence

Extraction using Specialized Skill

Cartridges Temis

R&D scientistsInteractive visualizationOmniViz

R&D InformaticsExtraction & Federated Search Inxight

R&D scientistsSophisticated semantic analysis

toolGoldFire

Business IntelligenceExtraction modulesClearForest

Potential User GroupsStrengthVendor Name

Path Forward

■Phase II

►Pilot selected tools

►Identify potential clients groups and interview

representative clients

Closing Remarks

Acknowledgements

Peter Mattei Aureka

Thomas Klose ClearForest

Shelley Pavlek GoldFire/Invention Machine

Joanne Freeman Inxight

Marlene Khouri M-CAM

Heahyun Yoo OmniViz

Tony Medina PatAnalyst

Michael Rogers Quosa

Karen Stesis RefViz

Tisha Zawisky Temis

Lou Ann DiNallo VantagePoint

Mary Talmadge-Grebenar Wisdomain

Joseph Bezek

Claudia Powers

Ramesh Durvasula (Informatics)

Ronald Stoner (Mead Johnson)

Questions

top related