knowledge assembly at scale with semantic and probabilistic techniques

Knowledge Assembly at Scale

with Semantic and Probabilistic Techniques

Szymon Klarman

Department of Computer Science Brunel University London

Connected Data London 2016

Scientific publishing deluge

50 mln papers published since 1665

2.5 mln papers published last year

publication output doubling every 9 years

Effects:

narrowing of science and scholarship – we cite a small pool of

mostly recent papers

narrowing of expertise

„publish or perish” principle affects the quality of results

Big Mechanism

Reading Assembly Explanation

Challanges

• ambiguity and vagueness of natural language

• general quality and reliability of the sources

• the inaccuracy of the information extraction tools

• the typical „Vs” of the big data, i.e.: volume, variety, volatility, velocity

• inconsistent, inconclusive or non-reproducible results

• gaps, omissions, contextual assumptions

In vitro curcumin downregulated the expression of Bcl-2, and Bcl-XL and upregulated the expression of

p53, Bax, Bak, PUMA, Noxa, and Bim at mRNA and protein levels in prostate cancer cells [14].

extraction

reconciliation

filtering

aggregation

evidence knowledgemodel formation

Knowledge assembly is a process of reconstructing complex knowledge from contextually

asserted atomic statements and data fragments (evidence).

Knowledge assembly

knowledge assembly„[…] A can associate with B […]” <A binding B>

extraction assemblyevidence (probabilistic)

knowledge

probabilistic inference

learning

model updates

Probabilistic knowledge assembly

expert input

In Probabilistic Knowledge Assembly (PANDA) framework, evidence with all contextual

information is part of the knowledge base to enable continuous update-assembly loop.

extraction assemblyevidence (probabilistic)

knowledge

probabilistic inference

learning

model updates

„A can associate with B”extraction acurracy = 0.7

published in: „Molecular Cancer”<A binding B> is supported to degree 0.7 Evidence contradicts the model to degree 0.7

<A binding B> is experimentally confirmed

Probabilistic knowledge assembly

expert input

In Probabilistic Knowledge Assembly (PANDA) framework, evidence with all contextual

information is part of the knowledge base to enable continuous update-assembly loop.

ontologies:

• biomedical (GO, BioPax, MI)

• uncertainty (UNO)

• information/document/provenance description

(IAO, Prov-O, VoID, Dublin Core)

(linked) open data via SPARQL endpoints and APIs:

• PubMed

• journal rankings (SciMago)

• bioinformatics databases (UniProt, Chebi, HGNC)

unique identifiers

• biochemical enitities

• journals / articles

Linked data resources

Event

Biochemical entity / Event

Statement

ArticleJournal

represents

is extracted from

Molecular interaction

has participant

type

published in

Uncertainty level

Textual evidence

Truth value evidence

has evidence

has truth value

has uncertainty

(of type X)

Knowledge graph: data model

knowledge

[...]

In addition, GRB2

can associate with

GAB1

[...]

Knowledge graph: example

statement_1

textual

evidence

0.8

extraction prob

True

truth value

PMC123456

extracted from

„In addition, GRB2 can

associate with GAB1”

Statement

Article

type

type

0.7

provenance prob

[...]

In addition, GRB2

can associate with

GAB1

[...]

Knowledge graph: example

GRB2 binding GAB1

statement_1

textual

evidence

0.8

extraction prob

GRB2_MOUSE GAB1_MOUSE

has participant A has participant B

True

truth value

PMC123456

extracted from



Event

Binding

Protein

Statement

Article

type

type

subclass of

typetype

type

represents0.7

provenance prob

[...]

In addition, GRB2

can associate with

GAB1

[...]

GRB2 binding GAB1

statement_1

textual

evidence

0.8

extraction prob

statement_..99

represents



True

truth value

PMC123456

extracted from



Event

Binding

Protein

Statement

Article

PMC654321 False

„GRB2 does not interact

directly with GAB1”

typetype

type

subclass of

typetype

type type

represents

extractedFrom

0.7

provenance prob

0.6

0.7provenance prob

extraction prob

textual

evidence

truth value

GRB2 binding GAB1

statement_1

textual

evidence

0.8

extraction prob

statement_..99

represents



True

truth value

PMC123456

extracted from



Event

Binding

Protein

Statement

Article

PMC654321 False

„GRB2 does not interact

directly with GAB1”

typetype

type

subclass of

typetype

type type

represents

extractedFrom

0.7

provenance prob

0.6

0.7provenance prob

extraction prob

textual

evidence

truth value

So what can we really say about

the truth of events?

event = <A binding B>

0

0,5

1

{s1} {s1, s2} {s1, s2, s3}

positive support

negative support

inconsistency

Statement Extraction accurracy Provenance uncertainty

S1 = event is true 0.8 0.7

S2 = event is false 0.8 0.7

S3 = event is false 0.9 0.6

Support aggregation

Positive

support

Negative

support

Event

likelihood

Doc_1

Doc_2

Stat_1

Stat_2

Provenance

uncertaintyExtraction

accurracy

Textual

uncertaintyStat...

Doc...

Document

part weight

Total uncertainty aggregationProbabilistic model (~Bayes net) over linked data expressed via probabilistic logic

programming (ProbLog).

Extraction Accuracy

Provenance Uncertainty

Total Uncertainty

ExperimentalConfirmation

T F -

0.9 0.1 0.5

Molecule Interaction GeneTotal Uncertainty

Before ExperimentExperimental Confirmation

Total UncertaintyAfter Experiment

curcuminnegative

regulationBCL2_MOUSE 0.3941 TRUE 0.7489

curcuminpositive

regulationP53_HUMAN 0.3924 FALSE 0.1569

curcuminnegative

regulationQ9H014_HUMAN 0.3929 - 0.3929

... ... ... ... ... ...

Expert input

Big Mechanism technology

We need to find generic solutions for extracting Big Mechanisms and enabling them to

computational agents.

Probabilistic Knowledge Assembly framework (semantics + probabilistic reasoning) offers:

• a powerful framework for scalable and flexible knowledge assembly tasks

• a uniform knowledge representation model and data access interface based on generic

tools and technologies (particularly W3C standards)

• the use of declarative formalisms facilitates provenance tracking

• continuous update-assembly loop for dynamic environments

[email protected]

Thank you!

knowledge assembly at scale with semantic and probabilistic techniques

Technology