augmentations of text mining for semantic knowledge discovery … · data sources and data types...

97
Christopher J. O. Baker University of New Brunswick, Canada …. …. …. …. Augmentations of Text Mining for Semantic Knowledge Discovery…. …. …. ….

Upload: others

Post on 04-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Christopher J. O. Baker

University of New Brunswick, Canada

…. …. …. …. Augmentations of Text Mining for

Semantic Knowledge Discovery….….….….

Page 2: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor
Page 3: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

In the future ….

• Users will be involved in the design of information systems

• Publishers will charge users for value added search:

(who will build such search systems)

• Users will search across semantically integration data sources and data types (how to facilitate system creation / adoption). Maybe they won’t know it !

• Knowledge driven systems - rapidly built / deployed with the engagement of domain experts in a knowledge engineering team

Page 4: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Literature-driven, Ontology-centric

Knowledge Integration and Navigation

Ontology

Visual Query

50 sentences

Reasoning

Ontology

Population

Content delivery using expressive semantics

Text Mining

Ontology

500 documents,

blogs, newsfeeds

to browse

50 sentences

to read

Page 5: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Outline

• Application Domains– Lipids

– Mutations

– Contact Centre

• Platforms• Platforms

• Knowledge Discovery– Knowledgator

– Semantic Assistant with Firefox

– SADI

• Scoring Candidate Object Property Assertions

Page 6: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Lipid Ontology

Lipid Hierarchy

Concept Definitions

DL AxiomsGraph fragment

Page 7: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Subject No.

Total No. of Classes 715

Primitive Class 449

Defined Class 266

No. Lipid Classes 428

No. Lipid Classes w/t DL

axioms

400

Total No. of Restrictions 901

Total No. of Properties 41

DL Expressivity: ALCHIQ(D)

> Implementation: OWL-DL

> Uses LIPIDMAPS systematic nomenclature

> Lipid instance:

LIPIDMAPS systematic name

> Depth: 8 levels

> Domain Knowledge vs information system metadata

Page 8: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Domian Ontology vs Mixed Metadata:

a literature specification

Page 9: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Ontology Population Workflow• Ontology based information retrieval

applies NLP to link documents toexisting ontologies

• Ontology-driven NLP - NLP that activelyuses ontological resources for NLP tasks

• Ontological NLP - ontologies used as aknowledge base for NLP tasks while alsoexporting the results of NLP analysesinto an ontology that can theninto an ontology that can thensubsequent semantic queries to theontology using description logicreasoners and a box reasoning

• Ontology based NLP - the results of NLPare exported to another ontology, usingexternal resources for text processing,

Witte etal. 2007

Page 10: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Text Mining

• Concept Instance Generation from full text

– Named entity recognition (gazetteer based)

– Dictionary based matching of text tokens to domain specific

vocabularies i.e. (LipidBank, Lipidmaps, KEGG, IUPAC) and

curated Swissprot terms and disease ontology of CGM

Normalization and grounding to canonical names– Normalization and grounding to canonical names

• Relation Detection - Role Assertions:

– Co-occurrence and Rule-based relation detection of binary

pairs from which knowledgebase instances are generated.

Primary set of binary interactions mined from text:

– Lipid-Protein, Lipid-Disease, Protein-Disease

– Domain specific library of curated biological relations.

Page 11: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mined Interactions

Page 12: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mined Interactions

in the ovarian cancer

literature

C. Baker; R. Kanagasabai; W. Ang; H. Low; M. Wenk; A. Fernandis; M. Choolani; K. NarasimhanMining to Find the Lipid Interaction Networks Involved in Ovarian Cancers.American Medical Informatics Association, 2009 Summit on Translational BioinformatcsMarch 15-17, San Francisco, California, 2009, AMIA-0154-T2009.

Page 13: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Knowledgebase Instantiation1) Rule based identification of Sentences containing target keywords 2) Instantiation with JENA API http://jena.sourceforge.net/ for this

purpose.

Target keywords found in sentences are instantiated to corresponding ontology class

• Lipid / Protein / Disease instances are instantiated to the respective ontology classes (as tagged by the gazetteer)

• Binary pairs instantiated to the respective Object Properties as role • Binary pairs instantiated to the respective Object Properties as role assertions

• Sentences instantiated to the respective Data type properties.

For each lipid identified in a sentence the corresponding data are instantiated to the ontology from Lipid Data Warehouse records requiring no further text processing.

• Lipid - LIPIDMAPS Systematic Name and its associated • Lipid - IUPAC Name, • Lipid - synonyms • Lipid - Database ID.

Page 14: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Knowledgebase Instantiation

Lipid Instance

Lipid Class Protein

Instance

Rule Based Sentence Processing<Lipid> AND <Protein> AND LipidProteinInteraction-TriggerWord e.g. "interact", "bind", "mediate"

<Lipid> AND <Disease> AND LipidDiseaseInteraction-TriggerWord e.g "involve", "cause"

Lipid Instance

Page 15: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Knowledge Integration and Query

Search

Engine

docs

tagged

with

NLP tagging

Papers identified: 262

121 papers with no lipid protein relations

141 papers contributed to ontology instantiation

186 lipid names

528 protein names

After normalisation and grounding:

Web content or

Full text papersUser input query

Ontology

instantiation

User

with

relevant

name

entities

Knowledge

Navigation

vehicle

Output for end userBaker CJ, Kanagasabai R, Ang WT, Veeramani A, Low HS, and

Wenk MR. Towards ontology-driven navigation of the lipid

bibliosphere. BMC Bioinformatics. 2008;9 Suppl 1:S5.

After normalisation and grounding:

92 Lipidmaps systematic names

52 IUPAC names, 412 exact synonyms, 6 broad synonyms, 319 protein names

Cross link to 59 Lipidbank entries

Sentences:

Co-occurrence before rules 1356 Sentences, After rules 683 Interaction sentences

92 Lipidmaps names instantiated to 35 classes (2.6 lipids per class)

Instantiation Time: 22 seconds

“Instantiated ontology”

Page 16: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Search

Engine

Knowledge Integration and Query

docs

tagged

with

NLP taggingUser input query Web content or

Full text papers

Ontology

instantiation

User

with

relevant

name

entities

Knowledge

Navigation

vehicle

Output for end userBaker CJ, Kanagasabai R, Ang WT, Veeramani A, Low HS, and

Wenk MR. Towards ontology-driven navigation of the lipid

bibliosphere. BMC Bioinformatics. 2008;9 Suppl 1:S5.

“Instantiated ontology”

Page 17: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Pathway Discovery from Documents

Page 18: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Text Mining Systems for Mutations

2004 – 2009

• MuteXt (Protein Point Mutation) (P=87.9% R=85.8%)

(P=49.3% R= 64.5%)

• MEMA (Regex DNA / Protein, HUGO) (P=75% R=98%)

• MutationFinder (Regex) + rules (P=98%, R=81%)

• ProMiner SNPExtraction and normalization / grounding (P=78%, R=67%)

• mSTRAP RegEx plus protein or organism name, (P=94.5% R=79.6%)

• mSTRAP Grounding / Normalization to db (P=91.8% R=80.9%)

• VTag: (CRF approach) in special context of cancer, no mapping to database

• OSIRIS: Query expansion: for all SNPs of a found gene: PubMed query) slow, limited to results of PubMed search engine (P=99% R=82 %)

Page 19: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Sentences

“As expected, complete loss in activity of W109L and sustained

activity of F151W were observed.”

”In order to further understand the catalytic mechanism we

constructed an Asp-124->Asn mutant enzyme.”

“DhlA shows only a small decrease in activity when Trp-125 is

"Haloalkane dehalogenase (DhlA) from Xanthobacter

autotrophicus GJI0 hydrolyses terminally chlorinated and

brominated n-alkanes to the corresponding alcohols."

Mutation Description Protein name Gene name Organism name

“DhlA shows only a small decrease in activity when Trp-125 is

replaced with phenylalanine.”

Page 20: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Extraction System Modules

Page 21: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Extraction System Modules

• Named entity recognition

– Protein and gene names

– Organism names

– Mutation descriptions – Mutation descriptions

• Named entity disambiguation

– Protein grounding

– Mutation grounding

Page 22: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Grounding

1. Mutation extraction system modules

2. Mutation grounding method2. Mutation grounding method

3. Mutation grounding results

4. Access to grounded mutations

Page 23: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Grounding Definitions

• Protein grounding

– Assign the correct UniProt id to each detected

protein entity.

• Mutation grounding

– Verify and, if necessary, positionally correct each

mutation location to match its corresponding

protein's sequence as obtained from UniProt.

Page 24: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Grounding

Page 25: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Entity Grounding (Resources)

1. Protein and gene names, part 1Gazetteer list / one word names / case sensitivesource: SwissProt (stop words removed / filtered through

whatizit)

2. Protein and gene names, part 2Gazetteer list / more than one word / case insensitiveGazetteer list / more than one word / case insensitive

source: SwissProt (stop words removed / filtered through whatizit)

3. Organism namesGazetteer list / case insensitive / source: SwissProt

4. Mutation initializer (JAPE)passes the document content to mutationfinder, update the

mutation gazetteer accordingly and adds the normalized (wNm) format as a feature to each entry

Page 26: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Entity Grounding (Resources)

5. Mutation descriptions

gazetteer list created/updated by mutation initializer.

6. Tokenizer and Sentence splitter

7. Mutation grounder (JAPE)

implements the methods presented at DILS 2010.

8. Mutation impact extractor, part 1 (JAPE)

i) directionality extractor.

ii) possible molecular function extractor (activity,binding )

Page 27: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Entity Grounding (Resources)

9. POS tagger

Only considers sentences containing possible molecular functions. prerquisite for MunPex.

10. MunPexnoun phrase extractor. prerequisite for molecular noun phrase extractor. prerequisite for molecular

function grounder.

11. Mutation impact extractor, part 2 (JAPE)i) molecular function grounder ii) kinetic variable extractor.iii) subsentence splitter, (divides sentences containing

more than one molecular function and or kinetic variable).iv) impact extractor (co-occurrence rules in incob paper)v) mutant-impact relation detector.

Page 28: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Protein grounding (partial)

• Co-dependence: Protein / Mutation Grounding

1. Retrieve all gene- and protein mentions.

2. Retrieve related accession numbers2. Retrieve related accession numbers

3. Filter with organism names

4. Use sequences related to most occurring ACs

for mutation grounding

Page 29: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Grounding

1. Retrieve and normalize mutations

2. For each candidate sequence

1. For each pair of mutations

1. Make regexp w .(N -N )w1. Make regexp w1.(N2-N1)w2

2. Match regexp to sequence

3. Check remaining residues at corrected positions.

3. Ground proteins and mutations to the

same AC / sequence

Page 30: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Grounding Example

Candidate sequences Candidate mutations

11

3

2

Page 31: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Grounding Example

Candidate sequences Candidate mutations

1 {1

3

2 {

Compute regular expression

Page 32: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Grounding Example

Candidate sequences Candidate mutations

1 {A B

1

3

2 {

Match with sequence 1

Page 33: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Grounding Example

Candidate sequences Candidate mutations

1A

X1

3

2

Extend match A

Page 34: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Grounding Example

Candidate sequences Candidate mutations

1B

XX X1

3

2

Extend match B

X X

Page 35: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Grounding Example

Candidate sequences Candidate mutations

11

3

2

Match with sequence 2 & 3

Page 36: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Grounding Example

Candidate sequences # Mutations / Offset

1 4 / -1

4 / 51

3

2

Choose best candidate sequence:

1. Most grounded mutations

2. Least absolute offset

4 / 5

2 / 0

Page 37: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Grounding Evaluation Corpora

• COSMIC

– Catalogue Of Somatic Mutations In Cancer

– PIK3CA, FGFR3, MEN1

– 63 documents– 63 documents

• Haloalkane dehalogenases

– Protein engineering literature

– 13 documents

Page 38: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Grounding Evaluation

Page 39: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Access to Grounded Mutations

• Export results to an RDF triple store

• Make both the pipeline and triple store

available through semantic web services

(SADI)(SADI)

• Make use of the semantic assistants

architecture to present knowledge directly

when browsing PubMed

http://sadiframework.org/

Page 40: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Access to Grounded

Mutations and Impacts

Page 41: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

SPARQL Queries

Page 42: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

• SADI simply comprises a set of standards-compliantconventions and suggested best-practices for datarepresentation and exchange between Web Services thatfully utilizes Semantic Web technologies.

• SADI mandates the inclusion of a single requiredannotation in the Web Service metadata that describes thebiological relationship ("predicate") that is created betweenthe input and output data of that Service

Page 43: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor
Page 44: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor
Page 45: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor
Page 46: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Impact Direction Term List

Page 47: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Impact Classification

Page 48: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Mutation Ontology

Page 49: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor
Page 50: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Semantic Desktop Assistant

René Witte and Thomas Gitzinger.

Semantic Assistants – User-Centric Natural Language Processing Services for Desktop Clients.

3rd Asian Semantic Web Conference (ASWC 2008), February 2–5, 2009, Bangkog, Thailand.

Springer LNCS 5367, pp. 360–374. (Acceptance rate: 31%)

1) Firefox

2) GreaseMonkey Plugin

3) Install Lipid SA service (Java Script / Tomcat)

Page 51: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Online Annotation workflow

Page 52: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor
Page 53: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor
Page 54: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Lipid Ontology

Lipid Hierarchy

Concept Definitions

DL AxiomsGraph fragment

http://www.lipidprofiles.com/LipidOntology/LiPrO-02042009.owl

Total No. of Classes 715

No. of Lipid Classes 492

Primitive Lipid Classes 107

Defined Lipid Classes 268

Total No. Restrictions 901

Total No. Properties 41

DL Expressivity ALCHIQ(D)

Page 55: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Organic Group

Total no. of simple organic group = 95 (2009)

hasPart

Lipid Organic_Group

PartOf

Extensions

required to

support lipid

characterization

Organic group

From Chemical

Ontology

Page 56: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Lipid Axioms

Page 57: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor
Page 58: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor
Page 59: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor
Page 60: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Acknowledgements

• NUS Office of Life Science (R-183-000-607-712),

• ARF (R-183-000-160-112),

• BMRC A*STAR (R-183-000-134-305),

• Singapore NRF under CRP award No. 2007-04,

NBIF, New Brunswick, Canada.• NBIF, New Brunswick, Canada.

• NSERC, Discovery grant. Canada

• Quebec -New Brunswick University Co-operationin Advanced Education - Research Program,Government of New Brunswick, Canada

• NSERC Discovery Grant, Canada 2009 - Baker CJO

Page 61: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Algorithm to populate Telecom domain OWL-DL ontology with A-

box object properties derived from Technical Support Documents

1Kouznetsov A, 2Shoebottom B, 1Baker CJO

1 University of New Brunswick, Saint John, Canada2 Innovatia, Inc, Saint John, Canada

Page 62: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Telecom Project:

Gate Pipeline and Resourses

• ANNIE Pipeline:

– Gazetteer

– Sentence Splitter

– Jape Transducer – Jape Transducer

• Resources:

– XML documents

– Telecom Gazetteer List

– T-box Ontology (optional for GATE pipeline)

Page 63: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Recognised “Entities”

• Named Entities

• Sentences

• Original Markup

– Document Segmentation – Document Segmentation

• Paragraphs,

• Sections,

• Headers and etc.

• JAPE rules for combining entities as triples

Page 64: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Using JAPE Rules to Build Triples

• Instantiation of Classes

• Telecom Named Entities (Chassy 107 is individual of Chassis)

• Sentence Individuals (Sen.210 is individual of Sentence Class

• Literature Specification Individuals (Parag55 is a Paragraph)

• Datatype Properties• Datatype Properties

• Has Text of Sentence – value dfffkclscmdscmsmvcmdsmv

• Has Release Number – value 10101.1

• Object Properties

• Telecom Entity occurs in Sentence

• Sentence belongs to Literature Specification Entity

Page 65: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Telecom Text Processing Framework

• GATE pipeline

1. Preprocessing (Java)

2. Text Processing (Jape/Gate)

3. Populating Ontology with Literature Related 3. Populating Ontology with Literature Related

Predicates Options : (A) GATE/JAPE/Ontology tool

plugin; (B) OWLAPI

4. Telecom Assertions Scoring (Java)

5. Populating with Object Property Asserions

(Java/OWLAPI)

• Output: Ontology A-box

Page 66: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Aim• Accurate extraction and population of relations between

the named entities and population as object properties

between A-box individuals in an OWL-DL ontology.

Domain Class

Man

Range Class

Woman

Object Property

hasSister

T-Box

Man WomanhasSister

Domain Instance

Samuel

Range Instance

Mary?A-Box

Page 67: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Methodology• Ontology-based information retrieval applies Natural Language

processing (NLP) to link – text segments,

– named entities and

– relations between named entities to existing ontologies.

• Algorithm leverages a customized gazetteer list, including lists specific to object property synonyms

• Algorithm leverages a customized gazetteer list, including lists specific to object property synonyms

• Score A-box property candidates by using functions of distance between co-occurred terms.

• A-box Property prediction and population based on these scores (Thresholds, Fuzzy approach)

Page 68: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Semi-Automatic Ontology populating pipeline

Pre

processing

Text

Segments

Processing

Sentences

Ontology

unpopulated

Term List

(Excel)

Ontology

Population

Named

Entities

Populated

Ontology

Using

Ontology

Connecting

Recourses

Source

Documents

XML

Synonyms

Lists

Text

Segments

Separation

Tables

Other Text

Segments

unpopulated

(OWL) Single

Relations

Multi

Relations

Ontology

Reasoning

Visualizing

Visual

Queries

Page 69: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Populating Ontology

Relation Framework

A-box

candidate

extraction

CandidateReasoning

Ontology

Scoring Framework

Co-occurrence

Based Scores

generator

Decision Framework

Decision

Module

/ Fuzzy

Scores

Focus

Labelled

DataTres

Page 70: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Generation of Scores

Relation Collection

• Framework to process Relation Objects

Relation Object

• Identified as : Domain Class: Domain Instance : Object • Identified as : Domain Class: Domain Instance : Object

Property : Range Class: Range Instance

• Integrates object property with:• all types of related text fragments

• ontology objects

• and score processing intermediate and final results

Page 71: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Co-occurrence Based Score Generator

A-box CandidateAll related

content

Relations Framework

Relation Object

Synonyms List

Co-occurrence Based Scores Generator

Scores

Tokenizer

Gazetteer

Score

Calculator

Integrator

Fragment

Processor

Page 72: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Scores Generator: Details

Score Calculator:

• Score calculation for text fragments associated

with the Relation.

• Current version based on distance between • Current version based on distance between

occurred entities and number of text fragments

with co-occurrence

• Involves Text Fragments Processor and

Integrator

Page 73: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Score Generation for Multiple Formats

Technical documentation contains knowledge displayed in multiple formats, each requiring different processing subroutines:

• Table Processing• Table Processing

• Sentence Processing

• Other segments

Page 74: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Extensible Data Model

Document Segment

Table SegmentText Segment

Document

Corpus

Doc ID

Table Segment

Data Cell

ID

Content

Row

Header

ID

Content

Column

Header

ID

Content

Table

Header

ID

Content

Text Segment

Sentence

ID

Content

Options: Sections, Paragraphs, Bullet lists, Headings

Page 75: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

A-Box Prop. Population

A-Box property Candidates

Text Mining

Gazetteer List

A-Box Obj. Properties (399)

Properties with

co-occurrence of

domain and range

Individuals (143)

Ontology processing

T-Box Obj. Properties

Corpus

Properties with

occurrence of domain

or range

Individuals (256)

Individuals (143)T-Box Obj. Properties

(102)

Page 76: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Evidences for A-box Obj. Property

candidatesEvidence for Current A-box (occurrence of Domain or Range)

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

A-Box scoring Current A-box Object Property Candidate

Evidence for Current A-box (co-occurrence of Domain and Range)

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Page 77: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Table Segments: Primary Scoring

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column Header

ID

Content

Table Header

ID

Content

A-Box Scoring

Current A-box Object Property Candidate

Domain Property Range

Page 78: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Table Segments: Secondary Scoring

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column Header

ID

Content

Table Header

ID

Content

A-Box scoring

Current A-box Object Property Candidate

Domain Property Range

Page 79: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Sentence Scoring Process

• A-box Object property Score for sentence

SentenceScore=1/(distance+1)+Bonus

• Integrated Object property Score over all related sentences

IntegratedScore= SUM(SentenceScore)IntegratedScore= SUM(SentenceScore)

• Summarize Integrated Score with Table Scores

• Normalized Object property Score

NormolizedScore= IntegratedScore/Norm

Page 80: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Sentence scoring Score=1/(distance+1)+Bonus

< > </ > 11 2 3D 4 R

< > </ > 21 2 3D 4 R 6 P

Distance: 4, Bonus =0, Score= 1/(4+1)+0=0.2

Distance: 6, Bonus =3, Score= 1/(6+1)+3=3.14

< > </ > 31 2 PD 4 R

Domain Synonym Range Synonym Object Property SynonymD R P

Distance: 4, Bonus =10, Score= 1/(4+1)+10=10.2

Page 81: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Example Sentence Type 1

Telecommunications_Chassis:Chassis:hasChassis_Components:Telecommunicatio

ns_Chassis_Power_Supply:Power_Supply

Property Synonyms:

•have

•has

Domain Synonyms:

•chassis

•switch chassis

Range Synonyms:

•Power Supply

•transformer

< > </ > 21 2 3D 4 R

sentence after cleaning:

In a chassis that includes two power supplies in a non redundant power configuration, you

must start both restrictions dual power supplies power supply units within 2 seconds of each

other.

Final Score=0.05Best Bonus=0.0 Final Distance=19

•has•switch chassis

•8000 series

•Chassis

•CO chassis

•transformer

•power supply

•power module

•Power supply

Page 82: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Example Sentence Type 3

Telecommunications_Chassis_Power_Supply:Power_Supply:isPart_of_Chassis:Telecommuni

cations_Chassis:Chassis

Property Synonyms:

•used in

•include

Domain Synonyms:

•Power Supply

•transformer

Range Synonyms:

•chassis

•switch chassis

< > </ > 41 2 PD 4 R

sentence after cleaning:

In a chassis that includes two power supplies in a non redundant power configuration, you

must start both restrictions dual power supplies power supply units within 2 seconds of each

other.

Final Score=10.05

Best Bonus=10.0 Final Distance=19

•include•transformer

•power supply

•power module

•Power supply

•switch chassis

•8000 series

•Chassis•CO chassis

Page 83: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Bonus Calculation

< > </ >

< > </ >1 2 3D R6P

Distance: 6, Bonus Constant =10, Tokens in Property=2, Score= 1/(6+1)+2*10=20.14

P

Bonus= Bonus Constant * Number of tokens in property

< > </ >1 2 PD 4 R6

Distance: 6, Bonus Constant=10, Tokens in Property=1, Score= 1/(6+1)+1*10=10.14

3

Sentence Example: Device X does not support Device Y

Object Properly Tokens Number Obtained Score

Support 1 1/(3+1)+1*10=10.25

Not Support 2 1/(3+1)+2*10=20.25 V

Page 84: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Normalization• Norm coefficient for A-box object property

Log(1.0+NSD *NSR )

NSD – Number Of Sentences Domain Occurred

NSR – Number Of Sentences Range Occurred

Page 85: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Gold Standard and Evaluation

FrameworkT-Box

Ontology

LabelsEvaluation

Report

Pre

processing

Text

SegmentsSenten

Term List(Excel)

OntologyName

d

Populated Ontology

Connect

ing

Populate

Ontology

Knowledge

Engineer

A-Box

Ontology

Source Documents

XML

processing

Synony

ms

Lists

Segments

Processing

Text

Segmen

ts

Separati

on

Senten

ces

Tables

Bullet

Lists

Ontologyunpopulated

(OWL)

Ontology

Populationd

Entitie

s

Single

Relati

ons

Multi

Relati

ons

Using

Ontology

Reasoni

ng

Visualizi

ng

VisualQueries

ing

Recours

es

Prediction Evaluation Framework

Evaluate

predicted

Properties

/

Update DB

Golden

Standard

Database

Import

labels

Page 86: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Thresholds: Decision Boundary

• All scores for each A-box property

candidate are summarized for based on

eligible sources of evidence for the A-box in

questionquestion

• Threshold in use

• Trade off - Recall vs. Precision

Page 87: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Results for Tables: Baseline result

Focus on Positive class Recall and Positive

class Precision

• Class of interest (Positive class)• Class of interest (Positive class)

• Recall =0.80

• Precision=0.85

Page 88: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Results for Tables: Continued

Focus on Positive class Precision

• Class of interest (Positive class)• Class of interest (Positive class)

• Recall =0.25

• Precision=1.0

Page 89: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Results for Tables: Continued

Focus on Positive class Recall

• Class of interest (Positive class)• Class of interest (Positive class)

• Recall =1.0

• Precision=77.5

Page 90: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Results for Sentences

Focus on Positive class Precision

• Class of interest (Positive class)• Class of interest (Positive class)

• Recall =0.14

• Precision=1.0

Page 91: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Results for Sentences and Tables

Focus on Positive class Precision

• Class of interest (Positive class)• Recall =0.4

• Precision=1.0

• Synergetic effect of using Sentences and Tables (wrt Precision=1.0):

Recall (sentences)= 0.14

Recall (tables)= 0.25

Recall (sentences & tables)= 0.4

Page 92: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Gold Standard Corpus and Evaluation

FrameworkT-Box

Ontology

LabelsEvaluation

Report

Pre

processing

Text

Segments

ProcessingSenten

ces

Term List(Excel)

Ontology

Population

Name

d

Entitie

s

Populated Ontology

Connect

ing

Recours

es

Populate

Ontology

Knowledge

Engineer

A-Box

Ontology

Source Documents

XML

Synony

ms

Lists

Text

Segmen

ts

Separati

on

Tables

Bullet

Lists

Ontologyunpopulated

(OWL)

Single

Relati

ons

Multi

Relati

ons

Using

Ontology

Reasoni

ng

Visualizi

ng

VisualQueries

Prediction Evaluation Framework

Evaluate

Predicted

Properties

/

Update DB

Golden

Standard

Database

Import

Labels

Page 93: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Semantic Knowledge Discovery:

Contact Centre

Contact Centre

Agents

• GATE NLP - Framework

• OWL-DL Ontology

• OWL API

• Custom Telecom Gazetteers

• Pellet Reasoner

• TopBraid Composer

• TopBraid Live/Ensemble

Page 94: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Contact Center Environment

• Tier 1:

– Information gathering/validation

– Initial problem solving

– Requires highly precise information

– Needs simple-to-use user interface

• Tier 2:

– Problem escalation or information not found

– Requires high information recall

– Requires advanced search capabilities

Page 95: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Visual Query over Telecom KB

Visual Query: Network Routing Server has a Configuring and Enabling Procedure

TopBraid Ensemble

Page 96: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Pilot Study: Positive Results%

Ch

an

ge

Co

mp

are

d t

o O

ld S

ea

rch

• Tier 1 found the right information with less need for escalation– Now able to find documents 90% of the time (old toolset 75%)

• Tier 2 has more tasks and toolset features to learn – longer learning curve

% C

ha

ng

e C

om

pa

red

to

Old

Se

arc

h

Too

lse

t

Page 97: Augmentations of Text Mining for Semantic Knowledge Discovery … · data sources and data types (how to facilitate system creation / adoption). Maybe they won’t ... impact extractor

Acknowledgements

• Atlantic Innovation Fund of the Atlantic Canada Opportunities Agency

• Project Team

– Christopher JO Baker1: Primary Investigator

– Bradley Shoebottom2: Knowledge Engineer– Bradley Shoebottom2: Knowledge Engineer

– Alex Kouznetzov1: Text Mining Engineer

– Michael Doyle2: Network Infrastructure

• Testing Team

– Karen Lewis2: Information Architect

– Innovatia Technical support team

• Dearran Townes, Amanda Chase, Darrell Flynn, Gregg Knight, Corey Harris, Andrew Madsen 1 UNBSJ, 2 Innovatia