early detection of cancer using nlp / limor lahiani

26
Early Cancer Diagnosis using NLP to analyze biomedical literature image: inside.miroculus.com

Upload: geektimecoil

Post on 14-Apr-2017

58 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Early detection of cancer using NLP / Limor Lahiani

Early Cancer Diagnosisusing NLP to analyze biomedical literature

image: inside.miroculus.com

Page 2: Early detection of cancer using NLP / Limor Lahiani

@LimorLahhttps://il.linkedin.com/in/limorl catalystcodehttp://limorl.com

Limor LahianiSDE Manager,DX Partner Catalyst, Microsoft

Page 3: Early detection of cancer using NLP / Limor Lahiani

Partner.

Generalize.

Share.

Page 4: Early detection of cancer using NLP / Limor Lahiani

image: miroculus.com/open

DISEASE DECODEDa simple blood test to detect disease at the molecular level

Page 5: Early detection of cancer using NLP / Limor Lahiani

image: miroculus.com/introducing-loom

Page 6: Early detection of cancer using NLP / Limor Lahiani

mir-1245a mir-146a

mir-17

mir-210mir-24-2

image credit: miroculus.com

BRCA2

Page 7: Early detection of cancer using NLP / Limor Lahiani

which microRNAs is related to which gene?

Page 8: Early detection of cancer using NLP / Limor Lahiani

miRNA

genesdiseases

Page 9: Early detection of cancer using NLP / Limor Lahiani
Page 10: Early detection of cancer using NLP / Limor Lahiani

scheduler delta

querying doc

processing

grap

h AP

I

classifying

relations

entity extractionrelation classifier

corpus

Relation Extraction(corpus-to-graph)

domain-specific generic

Page 11: Early detection of cancer using NLP / Limor Lahiani

microRNA-gene relation classifier

Page 12: Early detection of cancer using NLP / Limor Lahiani

Machine Learning 101

designing algorithms for inferring unknowns from knowns

supervised learningGiven known labeled data ,

find a function Given unlabeled data

find patterns or explain key features in the data

unsupervised learning

classification

regression

spam detectionhandwriting

stock predictiondemand forecasting

clustering

dimension

reductionsimilar profiles

genetic clustering

anomaly

detectionfraud detection

fault detectionmatrix

factorization for collaborative

filtering

semi-supervised learningactive learning

Page 13: Early detection of cancer using NLP / Limor Lahiani

given a sentence which contains microRNA and gene , determine

whether is related to

relation extraction classifier

Page 14: Early detection of cancer using NLP / Limor Lahiani

positive example

We report here the involvement of miR-146a and miR-146b-5p that bind to the same site in the 3'UTR of BRCA1 and down-regulate its expression as demonstrated using reporter assays. PubMed #21472990

BRCA1

mir-146a

mir-146b

Page 15: Early detection of cancer using NLP / Limor Lahiani

non-positive example

"The biological effects of miR-132 were assessed in CRC cell lines using the transwell assay” PubMed #24914372

Page 16: Early detection of cancer using NLP / Limor Lahiani

training data

feature extracti

onML

model

break to sentences (TextBlob)

extract entities (GNAT)

positive + non-positive samples

distant supervision

Page 17: Early detection of cancer using NLP / Limor Lahiani

distant supervision / positive unlabeled learning

Up-regulation of mirna-1245 targets BRCA2mirna-342 regulates BRCA1 expression…We didn’t find correlation between mirna-200 and BRCA1We tested for mirna-100 and BRCA1

mirna-1245 BRCA2mirna-342 BRCA1… …

unstructured: sentences structured: known relations db

distant supervisi

on

Up-regulation of mirna-1245 targets BRCA2 POSITIVEmirna-342 regulates BRCA1 expression POSITIVE… …We didn’t find correlation between mirna-200 and BRCA1

NON_POSITIVE

We assessed mirna-100 and BRCA1 NON_POSITIVE

Page 18: Early detection of cancer using NLP / Limor Lahiani

training data

feature extracti

onML

model

break to sentences (TextBlob)

extract entities (GNAT)

positive + non-positive samples

distant supervision

entity replacement

tokenizing (nltk)

mirna-335 was found to regulate BRCA1

ENTITYM was found to regulate ENTITYG

high levels of expression of miRNA-335 and miRNA-342 were found together with low levels of BRCA1

high levels of expression of ENTITYM and OTHER_ENTITY

were found together with low levels of ENTITYG

high levels of expression of OTHER_ENTITY and ENTITYM were found together with low levels of ENTITYG

Page 19: Early detection of cancer using NLP / Limor Lahiani

training data

feature extracti

onML

model

break to sentences (TextBlob)

extract entities (GNAT)

positive + non-positive samples

distant supervision

entity replacement

tokenizing (nltk)

trimming

mirna-335 was found to regulate BRCA1

ENTITY1 was found to regulate ENTITY2

We report here the involvement of ENTITYM that bind to

the same site in the ENTITYG  and down-regulate its expression as demonstrated using reporter assays.

We report here the involvement of ENTITYM that bind to

the same site in the ENTITYG  and down-regulate its expression as demonstrated using reporter assays.

cleaning, stemming, & normalizing

Page 20: Early detection of cancer using NLP / Limor Lahiani

training data

feature extracti

onML

model

break to sentences (TextBlob)

extract entities (GNAT)

positive + non-positive samples

distant supervision

entity replacement

tokenizing (nltk)

trimmingcleaning, stemming, &

normalizing

bag-of-words(scikit-learn)

syntactic-based

(spacy.io)

word embedding(doc2vec, genism)

part-of-speech tagging,

dependency parse tree

word vector representation

1-gram, 2-gram, 3-gram,

king – man + woman = queenparis – france + spain =

madrid

ENTITYM was found to regulate ENTITYG

[1, 1, 1, 1, 1]

Page 21: Early detection of cancer using NLP / Limor Lahiani

training data

feature extracti

onML

model

break to sentences (TextBlob)

extract entities (GNAT)

positive + non-positive samples

distant supervision

entity replacement

tokenizing (nltk)

trimmingcleaning, stemming, &

normalizing

bag-of-words(scikit-learn)

syntactic-based

(spacy.io)

word embedding(doc2vec, genism)

part-of-speech tagging,

dependency parse tree

word vector representation

1-gram, 2-gram, 3-gram,

split to 75% training,

25% evaluation

F1 score evaluation for all feature combination

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=𝑡𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠𝑟𝑒𝑐𝑎𝑙𝑙=𝑡𝑟𝑢𝑒𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑎𝑙𝑙𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

Page 22: Early detection of cancer using NLP / Limor Lahiani

features f1-scoreBOW 1-3 grams 0.87BOW 1-3 grams + POS Tags 3-gram 0.87BOW 1-3 grams + Doc2Vec 0.87BOW 1-gram 0.8BOW 2-gram 0.85BOW 3-gram 0.83Doc2Vec 0.65POS Tags 3-gram 0.62

final results

build on others: research academic work

try a simple approach first, before Deep Learning

Page 23: Early detection of cancer using NLP / Limor Lahiani

sharing is caring

CatalystCode/corpus-to-graph-pipelineCatalystCode/corpus-to-graph-mlhttps://aka.ms/dxdevblog

Page 24: Early detection of cancer using NLP / Limor Lahiani

from information to intelligence

image: Social_Network_Analysis_Visualization

Page 25: Early detection of cancer using NLP / Limor Lahiani

questions?

Page 26: Early detection of cancer using NLP / Limor Lahiani

thanks ;)