augmentations of text mining for semantic knowledge discovery … · data sources and data types...
TRANSCRIPT
Christopher J. O. Baker
University of New Brunswick, Canada
…. …. …. …. Augmentations of Text Mining for
Semantic Knowledge Discovery….….….….
In the future ….
• Users will be involved in the design of information systems
• Publishers will charge users for value added search:
(who will build such search systems)
• Users will search across semantically integration data sources and data types (how to facilitate system creation / adoption). Maybe they won’t know it !
• Knowledge driven systems - rapidly built / deployed with the engagement of domain experts in a knowledge engineering team
Literature-driven, Ontology-centric
Knowledge Integration and Navigation
Ontology
Visual Query
50 sentences
Reasoning
Ontology
Population
Content delivery using expressive semantics
Text Mining
Ontology
500 documents,
blogs, newsfeeds
to browse
50 sentences
to read
Outline
• Application Domains– Lipids
– Mutations
– Contact Centre
• Platforms• Platforms
• Knowledge Discovery– Knowledgator
– Semantic Assistant with Firefox
– SADI
• Scoring Candidate Object Property Assertions
Lipid Ontology
Lipid Hierarchy
Concept Definitions
DL AxiomsGraph fragment
Subject No.
Total No. of Classes 715
Primitive Class 449
Defined Class 266
No. Lipid Classes 428
No. Lipid Classes w/t DL
axioms
400
Total No. of Restrictions 901
Total No. of Properties 41
DL Expressivity: ALCHIQ(D)
> Implementation: OWL-DL
> Uses LIPIDMAPS systematic nomenclature
> Lipid instance:
LIPIDMAPS systematic name
> Depth: 8 levels
> Domain Knowledge vs information system metadata
Domian Ontology vs Mixed Metadata:
a literature specification
Ontology Population Workflow• Ontology based information retrieval
applies NLP to link documents toexisting ontologies
• Ontology-driven NLP - NLP that activelyuses ontological resources for NLP tasks
• Ontological NLP - ontologies used as aknowledge base for NLP tasks while alsoexporting the results of NLP analysesinto an ontology that can theninto an ontology that can thensubsequent semantic queries to theontology using description logicreasoners and a box reasoning
• Ontology based NLP - the results of NLPare exported to another ontology, usingexternal resources for text processing,
Witte etal. 2007
Text Mining
• Concept Instance Generation from full text
– Named entity recognition (gazetteer based)
– Dictionary based matching of text tokens to domain specific
vocabularies i.e. (LipidBank, Lipidmaps, KEGG, IUPAC) and
curated Swissprot terms and disease ontology of CGM
Normalization and grounding to canonical names– Normalization and grounding to canonical names
• Relation Detection - Role Assertions:
– Co-occurrence and Rule-based relation detection of binary
pairs from which knowledgebase instances are generated.
Primary set of binary interactions mined from text:
– Lipid-Protein, Lipid-Disease, Protein-Disease
– Domain specific library of curated biological relations.
Mined Interactions
Mined Interactions
in the ovarian cancer
literature
C. Baker; R. Kanagasabai; W. Ang; H. Low; M. Wenk; A. Fernandis; M. Choolani; K. NarasimhanMining to Find the Lipid Interaction Networks Involved in Ovarian Cancers.American Medical Informatics Association, 2009 Summit on Translational BioinformatcsMarch 15-17, San Francisco, California, 2009, AMIA-0154-T2009.
Knowledgebase Instantiation1) Rule based identification of Sentences containing target keywords 2) Instantiation with JENA API http://jena.sourceforge.net/ for this
purpose.
Target keywords found in sentences are instantiated to corresponding ontology class
• Lipid / Protein / Disease instances are instantiated to the respective ontology classes (as tagged by the gazetteer)
• Binary pairs instantiated to the respective Object Properties as role • Binary pairs instantiated to the respective Object Properties as role assertions
• Sentences instantiated to the respective Data type properties.
For each lipid identified in a sentence the corresponding data are instantiated to the ontology from Lipid Data Warehouse records requiring no further text processing.
• Lipid - LIPIDMAPS Systematic Name and its associated • Lipid - IUPAC Name, • Lipid - synonyms • Lipid - Database ID.
Knowledgebase Instantiation
Lipid Instance
Lipid Class Protein
Instance
Rule Based Sentence Processing<Lipid> AND <Protein> AND LipidProteinInteraction-TriggerWord e.g. "interact", "bind", "mediate"
<Lipid> AND <Disease> AND LipidDiseaseInteraction-TriggerWord e.g "involve", "cause"
Lipid Instance
Knowledge Integration and Query
Search
Engine
docs
tagged
with
NLP tagging
Papers identified: 262
121 papers with no lipid protein relations
141 papers contributed to ontology instantiation
186 lipid names
528 protein names
After normalisation and grounding:
Web content or
Full text papersUser input query
Ontology
instantiation
User
with
relevant
name
entities
Knowledge
Navigation
vehicle
Output for end userBaker CJ, Kanagasabai R, Ang WT, Veeramani A, Low HS, and
Wenk MR. Towards ontology-driven navigation of the lipid
bibliosphere. BMC Bioinformatics. 2008;9 Suppl 1:S5.
After normalisation and grounding:
92 Lipidmaps systematic names
52 IUPAC names, 412 exact synonyms, 6 broad synonyms, 319 protein names
Cross link to 59 Lipidbank entries
Sentences:
Co-occurrence before rules 1356 Sentences, After rules 683 Interaction sentences
92 Lipidmaps names instantiated to 35 classes (2.6 lipids per class)
Instantiation Time: 22 seconds
“Instantiated ontology”
Search
Engine
Knowledge Integration and Query
docs
tagged
with
NLP taggingUser input query Web content or
Full text papers
Ontology
instantiation
User
with
relevant
name
entities
Knowledge
Navigation
vehicle
Output for end userBaker CJ, Kanagasabai R, Ang WT, Veeramani A, Low HS, and
Wenk MR. Towards ontology-driven navigation of the lipid
bibliosphere. BMC Bioinformatics. 2008;9 Suppl 1:S5.
“Instantiated ontology”
Pathway Discovery from Documents
Text Mining Systems for Mutations
2004 – 2009
• MuteXt (Protein Point Mutation) (P=87.9% R=85.8%)
(P=49.3% R= 64.5%)
• MEMA (Regex DNA / Protein, HUGO) (P=75% R=98%)
• MutationFinder (Regex) + rules (P=98%, R=81%)
• ProMiner SNPExtraction and normalization / grounding (P=78%, R=67%)
• mSTRAP RegEx plus protein or organism name, (P=94.5% R=79.6%)
• mSTRAP Grounding / Normalization to db (P=91.8% R=80.9%)
• VTag: (CRF approach) in special context of cancer, no mapping to database
• OSIRIS: Query expansion: for all SNPs of a found gene: PubMed query) slow, limited to results of PubMed search engine (P=99% R=82 %)
Mutation Sentences
“As expected, complete loss in activity of W109L and sustained
activity of F151W were observed.”
”In order to further understand the catalytic mechanism we
constructed an Asp-124->Asn mutant enzyme.”
“DhlA shows only a small decrease in activity when Trp-125 is
"Haloalkane dehalogenase (DhlA) from Xanthobacter
autotrophicus GJI0 hydrolyses terminally chlorinated and
brominated n-alkanes to the corresponding alcohols."
Mutation Description Protein name Gene name Organism name
“DhlA shows only a small decrease in activity when Trp-125 is
replaced with phenylalanine.”
Mutation Extraction System Modules
Mutation Extraction System Modules
• Named entity recognition
– Protein and gene names
– Organism names
– Mutation descriptions – Mutation descriptions
• Named entity disambiguation
– Protein grounding
– Mutation grounding
Mutation Grounding
1. Mutation extraction system modules
2. Mutation grounding method2. Mutation grounding method
3. Mutation grounding results
4. Access to grounded mutations
Grounding Definitions
• Protein grounding
– Assign the correct UniProt id to each detected
protein entity.
• Mutation grounding
– Verify and, if necessary, positionally correct each
mutation location to match its corresponding
protein's sequence as obtained from UniProt.
Mutation Grounding
Entity Grounding (Resources)
1. Protein and gene names, part 1Gazetteer list / one word names / case sensitivesource: SwissProt (stop words removed / filtered through
whatizit)
2. Protein and gene names, part 2Gazetteer list / more than one word / case insensitiveGazetteer list / more than one word / case insensitive
source: SwissProt (stop words removed / filtered through whatizit)
3. Organism namesGazetteer list / case insensitive / source: SwissProt
4. Mutation initializer (JAPE)passes the document content to mutationfinder, update the
mutation gazetteer accordingly and adds the normalized (wNm) format as a feature to each entry
Entity Grounding (Resources)
5. Mutation descriptions
gazetteer list created/updated by mutation initializer.
6. Tokenizer and Sentence splitter
7. Mutation grounder (JAPE)
implements the methods presented at DILS 2010.
8. Mutation impact extractor, part 1 (JAPE)
i) directionality extractor.
ii) possible molecular function extractor (activity,binding )
Entity Grounding (Resources)
9. POS tagger
Only considers sentences containing possible molecular functions. prerquisite for MunPex.
10. MunPexnoun phrase extractor. prerequisite for molecular noun phrase extractor. prerequisite for molecular
function grounder.
11. Mutation impact extractor, part 2 (JAPE)i) molecular function grounder ii) kinetic variable extractor.iii) subsentence splitter, (divides sentences containing
more than one molecular function and or kinetic variable).iv) impact extractor (co-occurrence rules in incob paper)v) mutant-impact relation detector.
Protein grounding (partial)
• Co-dependence: Protein / Mutation Grounding
1. Retrieve all gene- and protein mentions.
2. Retrieve related accession numbers2. Retrieve related accession numbers
3. Filter with organism names
4. Use sequences related to most occurring ACs
for mutation grounding
Mutation Grounding
1. Retrieve and normalize mutations
2. For each candidate sequence
1. For each pair of mutations
1. Make regexp w .(N -N )w1. Make regexp w1.(N2-N1)w2
2. Match regexp to sequence
3. Check remaining residues at corrected positions.
3. Ground proteins and mutations to the
same AC / sequence
Mutation Grounding Example
Candidate sequences Candidate mutations
11
3
2
Mutation Grounding Example
Candidate sequences Candidate mutations
1 {1
3
2 {
Compute regular expression
Mutation Grounding Example
Candidate sequences Candidate mutations
1 {A B
1
3
2 {
Match with sequence 1
Mutation Grounding Example
Candidate sequences Candidate mutations
1A
X1
3
2
Extend match A
Mutation Grounding Example
Candidate sequences Candidate mutations
1B
XX X1
3
2
Extend match B
X X
Mutation Grounding Example
Candidate sequences Candidate mutations
11
3
2
Match with sequence 2 & 3
Mutation Grounding Example
Candidate sequences # Mutations / Offset
1 4 / -1
4 / 51
3
2
Choose best candidate sequence:
1. Most grounded mutations
2. Least absolute offset
4 / 5
2 / 0
Grounding Evaluation Corpora
• COSMIC
– Catalogue Of Somatic Mutations In Cancer
– PIK3CA, FGFR3, MEN1
– 63 documents– 63 documents
• Haloalkane dehalogenases
– Protein engineering literature
– 13 documents
Mutation Grounding Evaluation
Access to Grounded Mutations
• Export results to an RDF triple store
• Make both the pipeline and triple store
available through semantic web services
(SADI)(SADI)
• Make use of the semantic assistants
architecture to present knowledge directly
when browsing PubMed
http://sadiframework.org/
Access to Grounded
Mutations and Impacts
SPARQL Queries
• SADI simply comprises a set of standards-compliantconventions and suggested best-practices for datarepresentation and exchange between Web Services thatfully utilizes Semantic Web technologies.
• SADI mandates the inclusion of a single requiredannotation in the Web Service metadata that describes thebiological relationship ("predicate") that is created betweenthe input and output data of that Service
Impact Direction Term List
Impact Classification
Mutation Ontology
Semantic Desktop Assistant
René Witte and Thomas Gitzinger.
Semantic Assistants – User-Centric Natural Language Processing Services for Desktop Clients.
3rd Asian Semantic Web Conference (ASWC 2008), February 2–5, 2009, Bangkog, Thailand.
Springer LNCS 5367, pp. 360–374. (Acceptance rate: 31%)
1) Firefox
2) GreaseMonkey Plugin
3) Install Lipid SA service (Java Script / Tomcat)
Online Annotation workflow
Lipid Ontology
Lipid Hierarchy
Concept Definitions
DL AxiomsGraph fragment
http://www.lipidprofiles.com/LipidOntology/LiPrO-02042009.owl
Total No. of Classes 715
No. of Lipid Classes 492
Primitive Lipid Classes 107
Defined Lipid Classes 268
Total No. Restrictions 901
Total No. Properties 41
DL Expressivity ALCHIQ(D)
Organic Group
Total no. of simple organic group = 95 (2009)
hasPart
Lipid Organic_Group
PartOf
Extensions
required to
support lipid
characterization
Organic group
From Chemical
Ontology
Lipid Axioms
Acknowledgements
• NUS Office of Life Science (R-183-000-607-712),
• ARF (R-183-000-160-112),
• BMRC A*STAR (R-183-000-134-305),
• Singapore NRF under CRP award No. 2007-04,
NBIF, New Brunswick, Canada.• NBIF, New Brunswick, Canada.
• NSERC, Discovery grant. Canada
• Quebec -New Brunswick University Co-operationin Advanced Education - Research Program,Government of New Brunswick, Canada
• NSERC Discovery Grant, Canada 2009 - Baker CJO
Algorithm to populate Telecom domain OWL-DL ontology with A-
box object properties derived from Technical Support Documents
1Kouznetsov A, 2Shoebottom B, 1Baker CJO
1 University of New Brunswick, Saint John, Canada2 Innovatia, Inc, Saint John, Canada
Telecom Project:
Gate Pipeline and Resourses
• ANNIE Pipeline:
– Gazetteer
– Sentence Splitter
– Jape Transducer – Jape Transducer
• Resources:
– XML documents
– Telecom Gazetteer List
– T-box Ontology (optional for GATE pipeline)
Recognised “Entities”
• Named Entities
• Sentences
• Original Markup
– Document Segmentation – Document Segmentation
• Paragraphs,
• Sections,
• Headers and etc.
• JAPE rules for combining entities as triples
Using JAPE Rules to Build Triples
• Instantiation of Classes
• Telecom Named Entities (Chassy 107 is individual of Chassis)
• Sentence Individuals (Sen.210 is individual of Sentence Class
• Literature Specification Individuals (Parag55 is a Paragraph)
• Datatype Properties• Datatype Properties
• Has Text of Sentence – value dfffkclscmdscmsmvcmdsmv
• Has Release Number – value 10101.1
• Object Properties
• Telecom Entity occurs in Sentence
• Sentence belongs to Literature Specification Entity
Telecom Text Processing Framework
• GATE pipeline
1. Preprocessing (Java)
2. Text Processing (Jape/Gate)
3. Populating Ontology with Literature Related 3. Populating Ontology with Literature Related
Predicates Options : (A) GATE/JAPE/Ontology tool
plugin; (B) OWLAPI
4. Telecom Assertions Scoring (Java)
5. Populating with Object Property Asserions
(Java/OWLAPI)
• Output: Ontology A-box
Aim• Accurate extraction and population of relations between
the named entities and population as object properties
between A-box individuals in an OWL-DL ontology.
Domain Class
Man
Range Class
Woman
Object Property
hasSister
T-Box
Man WomanhasSister
Domain Instance
Samuel
Range Instance
Mary?A-Box
Methodology• Ontology-based information retrieval applies Natural Language
processing (NLP) to link – text segments,
– named entities and
– relations between named entities to existing ontologies.
• Algorithm leverages a customized gazetteer list, including lists specific to object property synonyms
• Algorithm leverages a customized gazetteer list, including lists specific to object property synonyms
• Score A-box property candidates by using functions of distance between co-occurred terms.
• A-box Property prediction and population based on these scores (Thresholds, Fuzzy approach)
Semi-Automatic Ontology populating pipeline
Pre
processing
Text
Segments
Processing
Sentences
Ontology
unpopulated
Term List
(Excel)
Ontology
Population
Named
Entities
Populated
Ontology
Using
Ontology
Connecting
Recourses
Source
Documents
XML
Synonyms
Lists
Text
Segments
Separation
Tables
Other Text
Segments
unpopulated
(OWL) Single
Relations
Multi
Relations
Ontology
Reasoning
Visualizing
Visual
Queries
Populating Ontology
Relation Framework
A-box
candidate
extraction
CandidateReasoning
Ontology
Scoring Framework
Co-occurrence
Based Scores
generator
Decision Framework
Decision
Module
/ Fuzzy
Scores
Focus
Labelled
DataTres
Generation of Scores
Relation Collection
• Framework to process Relation Objects
Relation Object
• Identified as : Domain Class: Domain Instance : Object • Identified as : Domain Class: Domain Instance : Object
Property : Range Class: Range Instance
• Integrates object property with:• all types of related text fragments
• ontology objects
• and score processing intermediate and final results
Co-occurrence Based Score Generator
A-box CandidateAll related
content
Relations Framework
Relation Object
Synonyms List
Co-occurrence Based Scores Generator
Scores
Tokenizer
Gazetteer
Score
Calculator
Integrator
Fragment
Processor
Scores Generator: Details
Score Calculator:
• Score calculation for text fragments associated
with the Relation.
• Current version based on distance between • Current version based on distance between
occurred entities and number of text fragments
with co-occurrence
• Involves Text Fragments Processor and
Integrator
Score Generation for Multiple Formats
Technical documentation contains knowledge displayed in multiple formats, each requiring different processing subroutines:
• Table Processing• Table Processing
• Sentence Processing
• Other segments
Extensible Data Model
Document Segment
Table SegmentText Segment
Document
Corpus
Doc ID
Table Segment
Data Cell
ID
Content
Row
Header
ID
Content
Column
Header
ID
Content
Table
Header
ID
Content
Text Segment
Sentence
ID
Content
Options: Sections, Paragraphs, Bullet lists, Headings
A-Box Prop. Population
A-Box property Candidates
Text Mining
Gazetteer List
A-Box Obj. Properties (399)
Properties with
co-occurrence of
domain and range
Individuals (143)
Ontology processing
T-Box Obj. Properties
Corpus
Properties with
occurrence of domain
or range
Individuals (256)
Individuals (143)T-Box Obj. Properties
(102)
Evidences for A-box Obj. Property
candidatesEvidence for Current A-box (occurrence of Domain or Range)
Text Segment
Sentence
ID
Content
Text Segment
Sentence
ID
Content
Text Segment
Sentence
ID
Content
Text Segment
Sentence
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column
Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column
Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column
Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column
Header
ID
Content
Table Header
ID
Content
A-Box scoring Current A-box Object Property Candidate
Evidence for Current A-box (co-occurrence of Domain and Range)
Text Segment
Sentence
ID
Content
Text Segment
Sentence
ID
Content
Text Segment
Sentence
ID
Content
Text Segment
Sentence
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column
Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column
Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column
Header
ID
Content
Table Header
ID
Content
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column
Header
ID
Content
Table Header
ID
Content
Table Segments: Primary Scoring
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column Header
ID
Content
Table Header
ID
Content
A-Box Scoring
Current A-box Object Property Candidate
Domain Property Range
Table Segments: Secondary Scoring
Table Segment
Data Cell
ID
Content
Row Header
ID
Content
Column Header
ID
Content
Table Header
ID
Content
A-Box scoring
Current A-box Object Property Candidate
Domain Property Range
Sentence Scoring Process
• A-box Object property Score for sentence
SentenceScore=1/(distance+1)+Bonus
• Integrated Object property Score over all related sentences
IntegratedScore= SUM(SentenceScore)IntegratedScore= SUM(SentenceScore)
• Summarize Integrated Score with Table Scores
• Normalized Object property Score
NormolizedScore= IntegratedScore/Norm
Sentence scoring Score=1/(distance+1)+Bonus
< > </ > 11 2 3D 4 R
< > </ > 21 2 3D 4 R 6 P
Distance: 4, Bonus =0, Score= 1/(4+1)+0=0.2
Distance: 6, Bonus =3, Score= 1/(6+1)+3=3.14
< > </ > 31 2 PD 4 R
Domain Synonym Range Synonym Object Property SynonymD R P
Distance: 4, Bonus =10, Score= 1/(4+1)+10=10.2
Example Sentence Type 1
Telecommunications_Chassis:Chassis:hasChassis_Components:Telecommunicatio
ns_Chassis_Power_Supply:Power_Supply
Property Synonyms:
•have
•has
Domain Synonyms:
•chassis
•switch chassis
Range Synonyms:
•Power Supply
•transformer
< > </ > 21 2 3D 4 R
sentence after cleaning:
In a chassis that includes two power supplies in a non redundant power configuration, you
must start both restrictions dual power supplies power supply units within 2 seconds of each
other.
Final Score=0.05Best Bonus=0.0 Final Distance=19
•has•switch chassis
•8000 series
•Chassis
•CO chassis
•transformer
•power supply
•power module
•Power supply
Example Sentence Type 3
Telecommunications_Chassis_Power_Supply:Power_Supply:isPart_of_Chassis:Telecommuni
cations_Chassis:Chassis
Property Synonyms:
•used in
•include
Domain Synonyms:
•Power Supply
•transformer
Range Synonyms:
•chassis
•switch chassis
< > </ > 41 2 PD 4 R
sentence after cleaning:
In a chassis that includes two power supplies in a non redundant power configuration, you
must start both restrictions dual power supplies power supply units within 2 seconds of each
other.
Final Score=10.05
Best Bonus=10.0 Final Distance=19
•include•transformer
•power supply
•power module
•Power supply
•switch chassis
•8000 series
•Chassis•CO chassis
Bonus Calculation
< > </ >
< > </ >1 2 3D R6P
Distance: 6, Bonus Constant =10, Tokens in Property=2, Score= 1/(6+1)+2*10=20.14
P
Bonus= Bonus Constant * Number of tokens in property
< > </ >1 2 PD 4 R6
Distance: 6, Bonus Constant=10, Tokens in Property=1, Score= 1/(6+1)+1*10=10.14
3
Sentence Example: Device X does not support Device Y
Object Properly Tokens Number Obtained Score
Support 1 1/(3+1)+1*10=10.25
Not Support 2 1/(3+1)+2*10=20.25 V
Normalization• Norm coefficient for A-box object property
Log(1.0+NSD *NSR )
NSD – Number Of Sentences Domain Occurred
NSR – Number Of Sentences Range Occurred
Gold Standard and Evaluation
FrameworkT-Box
Ontology
LabelsEvaluation
Report
Pre
processing
Text
SegmentsSenten
Term List(Excel)
OntologyName
d
Populated Ontology
Connect
ing
Populate
Ontology
Knowledge
Engineer
A-Box
Ontology
Source Documents
XML
processing
Synony
ms
Lists
Segments
Processing
Text
Segmen
ts
Separati
on
Senten
ces
Tables
Bullet
Lists
Ontologyunpopulated
(OWL)
Ontology
Populationd
Entitie
s
Single
Relati
ons
Multi
Relati
ons
Using
Ontology
Reasoni
ng
Visualizi
ng
VisualQueries
ing
Recours
es
Prediction Evaluation Framework
Evaluate
predicted
Properties
/
Update DB
Golden
Standard
Database
Import
labels
Thresholds: Decision Boundary
• All scores for each A-box property
candidate are summarized for based on
eligible sources of evidence for the A-box in
questionquestion
• Threshold in use
• Trade off - Recall vs. Precision
Results for Tables: Baseline result
Focus on Positive class Recall and Positive
class Precision
• Class of interest (Positive class)• Class of interest (Positive class)
• Recall =0.80
• Precision=0.85
Results for Tables: Continued
Focus on Positive class Precision
• Class of interest (Positive class)• Class of interest (Positive class)
• Recall =0.25
• Precision=1.0
Results for Tables: Continued
Focus on Positive class Recall
• Class of interest (Positive class)• Class of interest (Positive class)
• Recall =1.0
• Precision=77.5
Results for Sentences
Focus on Positive class Precision
• Class of interest (Positive class)• Class of interest (Positive class)
• Recall =0.14
• Precision=1.0
Results for Sentences and Tables
Focus on Positive class Precision
• Class of interest (Positive class)• Recall =0.4
• Precision=1.0
• Synergetic effect of using Sentences and Tables (wrt Precision=1.0):
Recall (sentences)= 0.14
Recall (tables)= 0.25
Recall (sentences & tables)= 0.4
Gold Standard Corpus and Evaluation
FrameworkT-Box
Ontology
LabelsEvaluation
Report
Pre
processing
Text
Segments
ProcessingSenten
ces
Term List(Excel)
Ontology
Population
Name
d
Entitie
s
Populated Ontology
Connect
ing
Recours
es
Populate
Ontology
Knowledge
Engineer
A-Box
Ontology
Source Documents
XML
Synony
ms
Lists
Text
Segmen
ts
Separati
on
Tables
Bullet
Lists
Ontologyunpopulated
(OWL)
Single
Relati
ons
Multi
Relati
ons
Using
Ontology
Reasoni
ng
Visualizi
ng
VisualQueries
Prediction Evaluation Framework
Evaluate
Predicted
Properties
/
Update DB
Golden
Standard
Database
Import
Labels
Semantic Knowledge Discovery:
Contact Centre
Contact Centre
Agents
• GATE NLP - Framework
• OWL-DL Ontology
• OWL API
• Custom Telecom Gazetteers
• Pellet Reasoner
• TopBraid Composer
• TopBraid Live/Ensemble
Contact Center Environment
• Tier 1:
– Information gathering/validation
– Initial problem solving
– Requires highly precise information
– Needs simple-to-use user interface
• Tier 2:
– Problem escalation or information not found
– Requires high information recall
– Requires advanced search capabilities
Visual Query over Telecom KB
Visual Query: Network Routing Server has a Configuring and Enabling Procedure
TopBraid Ensemble
Pilot Study: Positive Results%
Ch
an
ge
Co
mp
are
d t
o O
ld S
ea
rch
• Tier 1 found the right information with less need for escalation– Now able to find documents 90% of the time (old toolset 75%)
• Tier 2 has more tasks and toolset features to learn – longer learning curve
% C
ha
ng
e C
om
pa
red
to
Old
Se
arc
h
Too
lse
t
Acknowledgements
• Atlantic Innovation Fund of the Atlantic Canada Opportunities Agency
• Project Team
– Christopher JO Baker1: Primary Investigator
– Bradley Shoebottom2: Knowledge Engineer– Bradley Shoebottom2: Knowledge Engineer
– Alex Kouznetzov1: Text Mining Engineer
– Michael Doyle2: Network Infrastructure
• Testing Team
– Karen Lewis2: Information Architect
– Innovatia Technical support team
• Dearran Townes, Amanda Chase, Darrell Flynn, Gregg Knight, Corey Harris, Andrew Madsen 1 UNBSJ, 2 Innovatia