augmentations of text mining for semantic knowledge discovery … · data sources and data types...

Christopher J. O. Baker

University of New Brunswick, Canada

…. …. …. …. Augmentations of Text Mining for

Semantic Knowledge Discovery….….….….

In the future ….

• Users will be involved in the design of information systems

• Publishers will charge users for value added search:

(who will build such search systems)

• Users will search across semantically integration data sources and data types (how to facilitate system creation / adoption). Maybe they won’t know it !

• Knowledge driven systems - rapidly built / deployed with the engagement of domain experts in a knowledge engineering team

Literature-driven, Ontology-centric

Knowledge Integration and Navigation

Ontology

Visual Query

50 sentences

Reasoning

Ontology

Population

Content delivery using expressive semantics

Text Mining

Ontology

500 documents,

blogs, newsfeeds

to browse

50 sentences

to read

Outline

• Application Domains– Lipids

– Mutations

– Contact Centre

• Platforms• Platforms

• Knowledge Discovery– Knowledgator

– Semantic Assistant with Firefox

– SADI

• Scoring Candidate Object Property Assertions

Lipid Ontology

Lipid Hierarchy

Concept Definitions

DL AxiomsGraph fragment

Subject No.

Total No. of Classes 715

Primitive Class 449

Defined Class 266

No. Lipid Classes 428

No. Lipid Classes w/t DL

axioms

400

Total No. of Restrictions 901

Total No. of Properties 41

DL Expressivity: ALCHIQ(D)

> Implementation: OWL-DL

> Uses LIPIDMAPS systematic nomenclature

> Lipid instance:

LIPIDMAPS systematic name

> Depth: 8 levels

> Domain Knowledge vs information system metadata

Domian Ontology vs Mixed Metadata:

a literature specification

Ontology Population Workflow• Ontology based information retrieval

applies NLP to link documents toexisting ontologies

• Ontology-driven NLP - NLP that activelyuses ontological resources for NLP tasks

• Ontological NLP - ontologies used as aknowledge base for NLP tasks while alsoexporting the results of NLP analysesinto an ontology that can theninto an ontology that can thensubsequent semantic queries to theontology using description logicreasoners and a box reasoning

• Ontology based NLP - the results of NLPare exported to another ontology, usingexternal resources for text processing,

Witte etal. 2007

Text Mining

• Concept Instance Generation from full text

– Named entity recognition (gazetteer based)

– Dictionary based matching of text tokens to domain specific

vocabularies i.e. (LipidBank, Lipidmaps, KEGG, IUPAC) and

curated Swissprot terms and disease ontology of CGM

Normalization and grounding to canonical names– Normalization and grounding to canonical names

• Relation Detection - Role Assertions:

– Co-occurrence and Rule-based relation detection of binary

pairs from which knowledgebase instances are generated.

Primary set of binary interactions mined from text:

– Lipid-Protein, Lipid-Disease, Protein-Disease

– Domain specific library of curated biological relations.

Mined Interactions

Mined Interactions

in the ovarian cancer

literature

C. Baker; R. Kanagasabai; W. Ang; H. Low; M. Wenk; A. Fernandis; M. Choolani; K. NarasimhanMining to Find the Lipid Interaction Networks Involved in Ovarian Cancers.American Medical Informatics Association, 2009 Summit on Translational BioinformatcsMarch 15-17, San Francisco, California, 2009, AMIA-0154-T2009.

Knowledgebase Instantiation1) Rule based identification of Sentences containing target keywords 2) Instantiation with JENA API http://jena.sourceforge.net/ for this

purpose.

Target keywords found in sentences are instantiated to corresponding ontology class

• Lipid / Protein / Disease instances are instantiated to the respective ontology classes (as tagged by the gazetteer)

• Binary pairs instantiated to the respective Object Properties as role • Binary pairs instantiated to the respective Object Properties as role assertions

• Sentences instantiated to the respective Data type properties.

For each lipid identified in a sentence the corresponding data are instantiated to the ontology from Lipid Data Warehouse records requiring no further text processing.

• Lipid - LIPIDMAPS Systematic Name and its associated • Lipid - IUPAC Name, • Lipid - synonyms • Lipid - Database ID.

Knowledgebase Instantiation

Lipid Instance

Lipid Class Protein

Instance

Rule Based Sentence Processing<Lipid> AND <Protein> AND LipidProteinInteraction-TriggerWord e.g. "interact", "bind", "mediate"

<Lipid> AND <Disease> AND LipidDiseaseInteraction-TriggerWord e.g "involve", "cause"

Lipid Instance

Knowledge Integration and Query

Search

Engine

docs

tagged

with

NLP tagging

Papers identified: 262

121 papers with no lipid protein relations

141 papers contributed to ontology instantiation

186 lipid names

528 protein names

After normalisation and grounding:

Web content or

Full text papersUser input query

Ontology

instantiation

User

with

relevant

name

entities

Knowledge

Navigation

vehicle

Output for end userBaker CJ, Kanagasabai R, Ang WT, Veeramani A, Low HS, and

Wenk MR. Towards ontology-driven navigation of the lipid

bibliosphere. BMC Bioinformatics. 2008;9 Suppl 1:S5.

After normalisation and grounding:

92 Lipidmaps systematic names

52 IUPAC names, 412 exact synonyms, 6 broad synonyms, 319 protein names

Cross link to 59 Lipidbank entries

Sentences:

Co-occurrence before rules 1356 Sentences, After rules 683 Interaction sentences

92 Lipidmaps names instantiated to 35 classes (2.6 lipids per class)

Instantiation Time: 22 seconds

“Instantiated ontology”

Search

Engine

Knowledge Integration and Query

docs

tagged

with

NLP taggingUser input query Web content or

Full text papers

Ontology

instantiation

User

with

relevant

name

entities

Knowledge

Navigation

vehicle

Output for end userBaker CJ, Kanagasabai R, Ang WT, Veeramani A, Low HS, and

Wenk MR. Towards ontology-driven navigation of the lipid

bibliosphere. BMC Bioinformatics. 2008;9 Suppl 1:S5.

“Instantiated ontology”

Pathway Discovery from Documents

Text Mining Systems for Mutations

2004 – 2009

• MuteXt (Protein Point Mutation) (P=87.9% R=85.8%)

(P=49.3% R= 64.5%)

• MEMA (Regex DNA / Protein, HUGO) (P=75% R=98%)

• MutationFinder (Regex) + rules (P=98%, R=81%)

• ProMiner SNPExtraction and normalization / grounding (P=78%, R=67%)

• mSTRAP RegEx plus protein or organism name, (P=94.5% R=79.6%)

• mSTRAP Grounding / Normalization to db (P=91.8% R=80.9%)

• VTag: (CRF approach) in special context of cancer, no mapping to database

• OSIRIS: Query expansion: for all SNPs of a found gene: PubMed query) slow, limited to results of PubMed search engine (P=99% R=82 %)

Mutation Sentences

“As expected, complete loss in activity of W109L and sustained

activity of F151W were observed.”

”In order to further understand the catalytic mechanism we

constructed an Asp-124->Asn mutant enzyme.”

“DhlA shows only a small decrease in activity when Trp-125 is

"Haloalkane dehalogenase (DhlA) from Xanthobacter

autotrophicus GJI0 hydrolyses terminally chlorinated and

brominated n-alkanes to the corresponding alcohols."

Mutation Description Protein name Gene name Organism name

“DhlA shows only a small decrease in activity when Trp-125 is

replaced with phenylalanine.”

Mutation Extraction System Modules

Mutation Extraction System Modules

• Named entity recognition

– Protein and gene names

– Organism names

– Mutation descriptions – Mutation descriptions

• Named entity disambiguation

– Protein grounding

– Mutation grounding

Mutation Grounding

1. Mutation extraction system modules

2. Mutation grounding method2. Mutation grounding method

3. Mutation grounding results

4. Access to grounded mutations

Grounding Definitions

• Protein grounding

– Assign the correct UniProt id to each detected

protein entity.

• Mutation grounding

– Verify and, if necessary, positionally correct each

mutation location to match its corresponding

protein's sequence as obtained from UniProt.

Mutation Grounding

Entity Grounding (Resources)

1. Protein and gene names, part 1Gazetteer list / one word names / case sensitivesource: SwissProt (stop words removed / filtered through

whatizit)

2. Protein and gene names, part 2Gazetteer list / more than one word / case insensitiveGazetteer list / more than one word / case insensitive

source: SwissProt (stop words removed / filtered through whatizit)

3. Organism namesGazetteer list / case insensitive / source: SwissProt

4. Mutation initializer (JAPE)passes the document content to mutationfinder, update the

mutation gazetteer accordingly and adds the normalized (wNm) format as a feature to each entry


5. Mutation descriptions

gazetteer list created/updated by mutation initializer.

6. Tokenizer and Sentence splitter

7. Mutation grounder (JAPE)

implements the methods presented at DILS 2010.

8. Mutation impact extractor, part 1 (JAPE)

i) directionality extractor.

ii) possible molecular function extractor (activity,binding )


9. POS tagger

Only considers sentences containing possible molecular functions. prerquisite for MunPex.

10. MunPexnoun phrase extractor. prerequisite for molecular noun phrase extractor. prerequisite for molecular

function grounder.

11. Mutation impact extractor, part 2 (JAPE)i) molecular function grounder ii) kinetic variable extractor.iii) subsentence splitter, (divides sentences containing

more than one molecular function and or kinetic variable).iv) impact extractor (co-occurrence rules in incob paper)v) mutant-impact relation detector.

Protein grounding (partial)

• Co-dependence: Protein / Mutation Grounding

1. Retrieve all gene- and protein mentions.

2. Retrieve related accession numbers2. Retrieve related accession numbers

3. Filter with organism names

4. Use sequences related to most occurring ACs

for mutation grounding

Mutation Grounding

1. Retrieve and normalize mutations

2. For each candidate sequence

1. For each pair of mutations

1. Make regexp w .(N -N )w1. Make regexp w1.(N2-N1)w2

2. Match regexp to sequence

3. Check remaining residues at corrected positions.

3. Ground proteins and mutations to the

same AC / sequence

Mutation Grounding Example

Candidate sequences Candidate mutations

11

3

2



1 {1

3

2 {

Compute regular expression



1 {A B

1

3

2 {

Match with sequence 1



1A

X1

3

2

Extend match A



1B

XX X1

3

2

Extend match B

X X



11

3

2

Match with sequence 2 & 3


Candidate sequences # Mutations / Offset

1 4 / -1

4 / 51

3

2

Choose best candidate sequence:

1. Most grounded mutations

2. Least absolute offset

4 / 5

2 / 0

Grounding Evaluation Corpora

• COSMIC

– Catalogue Of Somatic Mutations In Cancer

– PIK3CA, FGFR3, MEN1

– 63 documents– 63 documents

• Haloalkane dehalogenases

– Protein engineering literature

– 13 documents

Mutation Grounding Evaluation

Access to Grounded Mutations

• Export results to an RDF triple store

• Make both the pipeline and triple store

available through semantic web services

(SADI)(SADI)

• Make use of the semantic assistants

architecture to present knowledge directly

when browsing PubMed

http://sadiframework.org/

Access to Grounded

Mutations and Impacts

SPARQL Queries

• SADI simply comprises a set of standards-compliantconventions and suggested best-practices for datarepresentation and exchange between Web Services thatfully utilizes Semantic Web technologies.

• SADI mandates the inclusion of a single requiredannotation in the Web Service metadata that describes thebiological relationship ("predicate") that is created betweenthe input and output data of that Service

Impact Direction Term List

Impact Classification

Mutation Ontology

Semantic Desktop Assistant

René Witte and Thomas Gitzinger.

Semantic Assistants – User-Centric Natural Language Processing Services for Desktop Clients.

3rd Asian Semantic Web Conference (ASWC 2008), February 2–5, 2009, Bangkog, Thailand.

Springer LNCS 5367, pp. 360–374. (Acceptance rate: 31%)

1) Firefox

2) GreaseMonkey Plugin

3) Install Lipid SA service (Java Script / Tomcat)

Online Annotation workflow

Lipid Ontology

Lipid Hierarchy

Concept Definitions

DL AxiomsGraph fragment

http://www.lipidprofiles.com/LipidOntology/LiPrO-02042009.owl

Total No. of Classes 715

No. of Lipid Classes 492

Primitive Lipid Classes 107

Defined Lipid Classes 268

Total No. Restrictions 901

Total No. Properties 41

DL Expressivity ALCHIQ(D)

Organic Group

Total no. of simple organic group = 95 (2009)

hasPart

Lipid Organic_Group

PartOf

Extensions

required to

support lipid

characterization

Organic group

From Chemical

Ontology

Lipid Axioms

Acknowledgements

• NUS Office of Life Science (R-183-000-607-712),

• ARF (R-183-000-160-112),

• BMRC A*STAR (R-183-000-134-305),

• Singapore NRF under CRP award No. 2007-04,

NBIF, New Brunswick, Canada.• NBIF, New Brunswick, Canada.

• NSERC, Discovery grant. Canada

• Quebec -New Brunswick University Co-operationin Advanced Education - Research Program,Government of New Brunswick, Canada

• NSERC Discovery Grant, Canada 2009 - Baker CJO

Algorithm to populate Telecom domain OWL-DL ontology with A-

box object properties derived from Technical Support Documents

1Kouznetsov A, 2Shoebottom B, 1Baker CJO

1 University of New Brunswick, Saint John, Canada2 Innovatia, Inc, Saint John, Canada

Telecom Project:

Gate Pipeline and Resourses

• ANNIE Pipeline:

– Gazetteer

– Sentence Splitter

– Jape Transducer – Jape Transducer

• Resources:

– XML documents

– Telecom Gazetteer List

– T-box Ontology (optional for GATE pipeline)

Recognised “Entities”

• Named Entities

• Sentences

• Original Markup

– Document Segmentation – Document Segmentation

• Paragraphs,

• Sections,

• Headers and etc.

• JAPE rules for combining entities as triples

Using JAPE Rules to Build Triples

• Instantiation of Classes

• Telecom Named Entities (Chassy 107 is individual of Chassis)

• Sentence Individuals (Sen.210 is individual of Sentence Class

• Literature Specification Individuals (Parag55 is a Paragraph)

• Datatype Properties• Datatype Properties

• Has Text of Sentence – value dfffkclscmdscmsmvcmdsmv

• Has Release Number – value 10101.1

• Object Properties

• Telecom Entity occurs in Sentence

• Sentence belongs to Literature Specification Entity

Telecom Text Processing Framework

• GATE pipeline

1. Preprocessing (Java)

2. Text Processing (Jape/Gate)

3. Populating Ontology with Literature Related 3. Populating Ontology with Literature Related

Predicates Options : (A) GATE/JAPE/Ontology tool

plugin; (B) OWLAPI

4. Telecom Assertions Scoring (Java)

5. Populating with Object Property Asserions

(Java/OWLAPI)

• Output: Ontology A-box

Aim• Accurate extraction and population of relations between

the named entities and population as object properties

between A-box individuals in an OWL-DL ontology.

Domain Class

Man

Range Class

Woman

Object Property

hasSister

T-Box

Man WomanhasSister

Domain Instance

Samuel

Range Instance

Mary?A-Box

Methodology• Ontology-based information retrieval applies Natural Language

processing (NLP) to link – text segments,

– named entities and

– relations between named entities to existing ontologies.

• Algorithm leverages a customized gazetteer list, including lists specific to object property synonyms

• Algorithm leverages a customized gazetteer list, including lists specific to object property synonyms

• Score A-box property candidates by using functions of distance between co-occurred terms.

• A-box Property prediction and population based on these scores (Thresholds, Fuzzy approach)

Semi-Automatic Ontology populating pipeline

Pre

processing

Text

Segments

Processing

Sentences

Ontology

unpopulated

Term List

(Excel)

Ontology

Population

Named

Entities

Populated

Ontology

Using

Ontology

Connecting

Recourses

Source

Documents

XML

Synonyms

Lists

Text

Segments

Separation

Tables

Other Text

Segments

unpopulated

(OWL) Single

Relations

Multi

Relations

Ontology

Reasoning

Visualizing

Visual

Queries

Populating Ontology

Relation Framework

A-box

candidate

extraction

CandidateReasoning

Ontology

Scoring Framework

Co-occurrence

Based Scores

generator

Decision Framework

Decision

Module

/ Fuzzy

Scores

Focus

Labelled

DataTres

Generation of Scores

Relation Collection

• Framework to process Relation Objects

Relation Object

• Identified as : Domain Class: Domain Instance : Object • Identified as : Domain Class: Domain Instance : Object

Property : Range Class: Range Instance

• Integrates object property with:• all types of related text fragments

• ontology objects

• and score processing intermediate and final results

Co-occurrence Based Score Generator

A-box CandidateAll related

content

Relations Framework

Relation Object

Synonyms List

Co-occurrence Based Scores Generator

Scores

Tokenizer

Gazetteer

Score

Calculator

Integrator

Fragment

Processor

Scores Generator: Details

Score Calculator:

• Score calculation for text fragments associated

with the Relation.

• Current version based on distance between • Current version based on distance between

occurred entities and number of text fragments

with co-occurrence

• Involves Text Fragments Processor and

Integrator

Score Generation for Multiple Formats

Technical documentation contains knowledge displayed in multiple formats, each requiring different processing subroutines:

• Table Processing• Table Processing

• Sentence Processing

• Other segments

Extensible Data Model

Document Segment

Table SegmentText Segment

Document

Corpus

Doc ID

Table Segment

Data Cell

ID

Content

Row

Header

ID

Content

Column

Header

ID

Content

Table

Header

ID

Content

Text Segment

Sentence

ID

Content

Options: Sections, Paragraphs, Bullet lists, Headings

A-Box Prop. Population

A-Box property Candidates

Text Mining

Gazetteer List

A-Box Obj. Properties (399)

Properties with

co-occurrence of

domain and range

Individuals (143)

Ontology processing

T-Box Obj. Properties

Corpus

Properties with

occurrence of domain

or range

Individuals (256)

Individuals (143)T-Box Obj. Properties

(102)

Evidences for A-box Obj. Property

candidatesEvidence for Current A-box (occurrence of Domain or Range)

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

A-Box scoring Current A-box Object Property Candidate

Evidence for Current A-box (co-occurrence of Domain and Range)

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Text Segment

Sentence

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column

Header

ID

Content

Table Header

ID

Content

Table Segments: Primary Scoring

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column Header

ID

Content

Table Header

ID

Content

A-Box Scoring

Current A-box Object Property Candidate

Domain Property Range

Table Segments: Secondary Scoring

Table Segment

Data Cell

ID

Content

Row Header

ID

Content

Column Header

ID

Content

Table Header

ID

Content

A-Box scoring

Current A-box Object Property Candidate

Domain Property Range

Sentence Scoring Process

• A-box Object property Score for sentence

SentenceScore=1/(distance+1)+Bonus

• Integrated Object property Score over all related sentences

IntegratedScore= SUM(SentenceScore)IntegratedScore= SUM(SentenceScore)

• Summarize Integrated Score with Table Scores

• Normalized Object property Score

NormolizedScore= IntegratedScore/Norm

Sentence scoring Score=1/(distance+1)+Bonus

< > </ > 11 2 3D 4 R

< > </ > 21 2 3D 4 R 6 P

Distance: 4, Bonus =0, Score= 1/(4+1)+0=0.2


< > </ > 31 2 PD 4 R

Domain Synonym Range Synonym Object Property SynonymD R P


Example Sentence Type 1

Telecommunications_Chassis:Chassis:hasChassis_Components:Telecommunicatio

ns_Chassis_Power_Supply:Power_Supply

Property Synonyms:

•have

•has

Domain Synonyms:

•chassis

•switch chassis

Range Synonyms:

•Power Supply

•transformer

< > </ > 21 2 3D 4 R

sentence after cleaning:

In a chassis that includes two power supplies in a non redundant power configuration, you

must start both restrictions dual power supplies power supply units within 2 seconds of each

other.

Final Score=0.05Best Bonus=0.0 Final Distance=19

•has•switch chassis

•8000 series

•Chassis

•CO chassis

•transformer

•power supply

•power module

•Power supply

Example Sentence Type 3

Telecommunications_Chassis_Power_Supply:Power_Supply:isPart_of_Chassis:Telecommuni

cations_Chassis:Chassis

Property Synonyms:

•used in

•include

Domain Synonyms:

•Power Supply

•transformer

Range Synonyms:

•chassis

•switch chassis

< > </ > 41 2 PD 4 R

sentence after cleaning:

In a chassis that includes two power supplies in a non redundant power configuration, you

must start both restrictions dual power supplies power supply units within 2 seconds of each

other.

Final Score=10.05

Best Bonus=10.0 Final Distance=19

•include•transformer

•power supply

•power module

•Power supply

•switch chassis

•8000 series

•Chassis•CO chassis

Bonus Calculation

< > </ >

< > </ >1 2 3D R6P

Distance: 6, Bonus Constant =10, Tokens in Property=2, Score= 1/(6+1)+2*10=20.14

P

Bonus= Bonus Constant * Number of tokens in property

< > </ >1 2 PD 4 R6

Distance: 6, Bonus Constant=10, Tokens in Property=1, Score= 1/(6+1)+1*10=10.14

3

Sentence Example: Device X does not support Device Y

Object Properly Tokens Number Obtained Score

Support 1 1/(3+1)+1*10=10.25

Not Support 2 1/(3+1)+2*10=20.25 V

Normalization• Norm coefficient for A-box object property

Log(1.0+NSD *NSR )

NSD – Number Of Sentences Domain Occurred

NSR – Number Of Sentences Range Occurred

Gold Standard and Evaluation

FrameworkT-Box

Ontology

LabelsEvaluation

Report

Pre

processing

Text

SegmentsSenten

Term List(Excel)

OntologyName

d

Populated Ontology

Connect

ing

Populate

Ontology

Knowledge

Engineer

A-Box

Ontology

Source Documents

XML

processing

Synony

ms

Lists

Segments

Processing

Text

Segmen

ts

Separati

on

Senten

ces

Tables

Bullet

Lists

Ontologyunpopulated

(OWL)

Ontology

Populationd

Entitie

s

Single

Relati

ons

Multi

Relati

ons

Using

Ontology

Reasoni

ng

Visualizi

ng

VisualQueries

ing

Recours

es

Prediction Evaluation Framework

Evaluate

predicted

Properties

/

Update DB

Golden

Standard

Database

Import

labels

Thresholds: Decision Boundary

• All scores for each A-box property

candidate are summarized for based on

eligible sources of evidence for the A-box in

questionquestion

• Threshold in use

• Trade off - Recall vs. Precision

Results for Tables: Baseline result

Focus on Positive class Recall and Positive

class Precision

• Class of interest (Positive class)• Class of interest (Positive class)

• Recall =0.80

• Precision=0.85

Results for Tables: Continued

Focus on Positive class Precision


• Recall =0.25

• Precision=1.0

Results for Tables: Continued

Focus on Positive class Recall


• Recall =1.0

• Precision=77.5

Results for Sentences



• Recall =0.14

• Precision=1.0

Results for Sentences and Tables


• Class of interest (Positive class)• Recall =0.4

• Precision=1.0

• Synergetic effect of using Sentences and Tables (wrt Precision=1.0):

Recall (sentences)= 0.14

Recall (tables)= 0.25

Recall (sentences & tables)= 0.4

Gold Standard Corpus and Evaluation

FrameworkT-Box

Ontology

LabelsEvaluation

Report

Pre

processing

Text

Segments

ProcessingSenten

ces

Term List(Excel)

Ontology

Population

Name

d

Entitie

s

Populated Ontology

Connect

ing

Recours

es

Populate

Ontology

Knowledge

Engineer

A-Box

Ontology

Source Documents

XML

Synony

ms

Lists

Text

Segmen

ts

Separati

on

Tables

Bullet

Lists

Ontologyunpopulated

(OWL)

Single

Relati

ons

Multi

Relati

ons

Using

Ontology

Reasoni

ng

Visualizi

ng

VisualQueries

Prediction Evaluation Framework

Evaluate

Predicted

Properties

/

Update DB

Golden

Standard

Database

Import

Labels

Semantic Knowledge Discovery:

Contact Centre

Contact Centre

Agents

• GATE NLP - Framework

• OWL-DL Ontology

• OWL API

• Custom Telecom Gazetteers

• Pellet Reasoner

• TopBraid Composer

• TopBraid Live/Ensemble

Contact Center Environment

• Tier 1:

– Information gathering/validation

– Initial problem solving

– Requires highly precise information

– Needs simple-to-use user interface

• Tier 2:

– Problem escalation or information not found

– Requires high information recall

– Requires advanced search capabilities

Visual Query over Telecom KB

Visual Query: Network Routing Server has a Configuring and Enabling Procedure

TopBraid Ensemble

Pilot Study: Positive Results%

Ch

an

ge

Co

mp

are

d t

o O

ld S

ea

rch

• Tier 1 found the right information with less need for escalation– Now able to find documents 90% of the time (old toolset 75%)

• Tier 2 has more tasks and toolset features to learn – longer learning curve

% C

ha

ng

e C

om

pa

red

to

Old

Se

arc

h

Too

lse

t

Acknowledgements

• Atlantic Innovation Fund of the Atlantic Canada Opportunities Agency

• Project Team

– Christopher JO Baker1: Primary Investigator

– Bradley Shoebottom2: Knowledge Engineer– Bradley Shoebottom2: Knowledge Engineer

– Alex Kouznetzov1: Text Mining Engineer

– Michael Doyle2: Network Infrastructure

• Testing Team

– Karen Lewis2: Information Architect

– Innovatia Technical support team

• Dearran Townes, Amanda Chase, Darrell Flynn, Gregg Knight, Corey Harris, Andrew Madsen 1 UNBSJ, 2 Innovatia