using text mining to inform genetic variant interpretation

Using text mining to inform genetic variant interpretation

Karin VerspoorDepartment of Computing and Information [email protected]

So you’re a medical doctor …

• With a very sick patient• You can’t work out what’s going on• You suspect a rare disease• You order a DNA analysis

(whole exome or genome)• And find a genetic mutation

What does it mean?

Clinical interpretation of variantsSample Data Flow

19 Apr 2013 BigData 7

Histology

DNA Extract

PCR

Sequencing

Alignment

Variant Calling

Annotation

DB Load

Filtering

Known Variant ?

Publish Report

Curation

Peter Mac Mutation

DB

External and Locus Specific

DBs

Yes

No

Patient Clinical Report

Document Assembly

Variant Normalisation

Report Editing and

Signoff

Manual Step

Automatic Step

Wet Lab Bioinformatics Clinical Informatics

Patient Sample

Sample Data Flow


Histology

DNA Extract

PCR

Sequencing

Alignment

Variant Calling

Annotation

DB Load

Filtering

Known Variant ?

Publish Report

Curation

Peter Mac Mutation

DB


DBs

Yes

No


Document Assembly


Report Editing and

Signoff

Manual Step

Automatic Step


Patient Sample

Sample Data Flow


Histology

DNA Extract

PCR

Sequencing

Alignment

Variant Calling

Annotation

DB Load

Filtering

Known Variant ?

Publish Report

Curation

Peter Mac Mutation

DB


DBs

Yes

No


Document Assembly


Report Editing and

Signoff

Manual Step

Automatic Step


Patient Sample

Image courtesy Kenneth Doig, Peter Mac. “PipeCleaner for your NGS Pipeline” HISA Big Data 2013.

What’s a mutation?

• Genomic variation: alteration in a sequence– hereditary (germ-line) mutations– acquired (somatic) mutations

• Examples of variation – SNP (single nucleotide polymorphism)– Protein mutation– insertions, deletions, duplications, inversions, . . .

• Types of variations– DNA variations that have no adverse effects on our cells and

occur frequently in the population are called polymorphisms – DNA variations that do affect the function of the protein

made from a gene and occur less often are called mutations

The Challenge: Interpreting variants

§ Identifying variation is becoming easier, interpreting it remains difficult

• Which changes are due to normal individual variation?

• Which are associated with a phenotype of interest?

Interpreting variation through context

• Analysis of functional significance of variants– Predicted impact of mutations– Conservation analysis– Allele frequencies from large genomic databases

• Existing knowledge captured in structured sources– UniProt site-specific protein annotations– The Cancer Gene Atlas genomic characterisation data– Disease-specific variant databases, e.g. COSMIC and

InSiGHT

• Techniques for annotating variants– Data aggregation from multiple sources– Data integration and inference to reveal shared pathways

Exponentialknowledgegrowth

• ~1550peer-reviewedgene-relateddatabasesinNARonlineMol Biocollection

• Over25millionPubMedentries(>2,000/day)

• Breakdownofdisciplinaryboundariesmakesmoreofitrelevanttoeachofus

Whybiomedical textmining?

0

200000

400000

600000

800000

1000000

1200000

1914

1918

1922

1926

1930

1934

1938

1942

1946

1950

1954

1958

1962

1966

1970

1974

1978

1982

1986

1990

1994

1998

2002

2006

2010

2014

Publ

icat

ions

per

yea

r

Year

ExponentialgrowthinsizeofPubmed

Structured resources are not enough:Literature is the primary repository of knowledge

0

20000

40000

60000

80000

100000

120000

1/02 1/03 1/04 1/05 1/06 1/07

# Sw

iss-

Prot

Pro

tein

s

Proteins missing a FUNCTION commentProteins gaining a FUNCTION comment

“Manualcurationisnotsufficientforannotationofgenomicdatabases”BaumgartneretalISMB2007

“Our entire understanding of biology and medicine is really contained in the published literature. And since people write in natural language, if you can’t

get computers to turn that information into databases and computable information, you’re

falling behind.”-- Russ Altman, MD PhD, Stanford University

Recovery of variants from the literature using text mining

Study:

Jimeno Yepes A, Verspoor K. (2014) Literature mining of genetic variants for curation: Quantifying the importance of supplementary material. Database: The Journal of Biological Databases and Curation, bau003. doi:10.1093/database/bau003 [PMID:24520105]

Study: Recall of curated variants through the application of text mining

• Given a curated resource of genetic variants,• with explicit links to the source literature for

each variant,• and a mutation extraction tool with

demonstrated good performance on intrinsic evaluation

… how many variants can text mining recover?

InSiGHTGene:

Variant:p.Lys286Gln

Lit. Reference:Takahashi et al 2007

Motivations

• Assess real-world applicability of text mining tools for supporting analysis of genetic variants

• Speed up curation of mutation databases

Two databases

• InSiGHT, Human Variome Project– MLH1, MSH2, MSH6 and PMS2 linked to

Lynch syndrome (germline mutations)

• COSMIC, Sanger Institute– Somatic mutations linked to cancer

Database

PMIDsassociated to

Mutations

Total Mutation

Count

Average Mutations per article Std Dev

InSiGHT 809 7022 8.68 18.55COSMIC 7898 198864 25.18 521.18

Literature mutation extraction

• Many tools exist to perform mutation annotation– MutationMiner, MutationFinder, EMU, tmVar, SETH,

...

• Research shows that they have high precision and recall on MEDLINE abstracts (> 90% F1)

• There are also tools to do named entity extraction of genes, diseases, body parts …

Jimeno Yepes A, Verspoor K. (2014) Mutation extraction tools can be combined for robust recognition of genetic variants in the literature. F1000Research 2014, 3:18. doi: 10.12688/f1000research.3-18.v2 [PMID:25285203]

How to extract mutations from text?

• Essentially a named entity recognition task. • Early attempts focused on SNPs and protein mutations (amino

acid residues). • e.g., MutationFinder1 patterns (simplified):

(?P<wt_res>AminoAcid)(?P<pos>[1-9][0-9]*)(?P<mut_res>AminoAcid)

Gly17SerSer97Pro

• where AminoAcid is: (CYS|ILE|SER|GLN|MET|ASN|PRO|LYS|ASP|THR|PHE|ALA|GLY|HIS|LEU|ARG| TRP|VAL|GLU|TYR)|(GLUTAMINE|GLUTAMIC ACID|LEUCINE|VALINE| ISOLEUCINE|LYSINE|ALANINE|GLYCINE|ASPARTATE|METHIONINE| THREONINE|HISTIDINE|ASPARTIC ACID|ARGININE|ASPARAGINE| TRYPTOPHAN|PROLINE|PHENYLALANINE|CYSTEINE|SERINE|GLUTAMATE| TYROSINE)

1http://mutationfinder.sourceforge.net/

Human Genome Variation Society nomenclature (excerpt)

• Pattern-based approach to identifying genetic variants– dbSNP identifiers and standard HGVS nomenclature

(e.g. SETH https://rockt.github.io/SETH)

– natural language expressions of mutationso This missense mutation converts a highly conserved glycine

(Gly17 of neurophysin) to a valine residue.o Killer of prune (Kpn) is a mutation in the awd gene which

substitutes Ser for Pro at position 97 and causes dominant lethality in individuals that do not have a functional prune gene.

o … where cysteines at positions 6, 42, 48, 90 and 393 were replaced by serine.

Extraction of mutations from text

Extractor of Mutations (Kann Lab)

Studytextsources

• PubMed– 22Mcitations;titleandabstract

• PubMedCentral– fulltext– 512kavailablefromPMC-OpenAccess

• Publishersitecrawling– Availabilitydependsonlicense– HTMLpagescanbenoisy

• C676T–>Arg226Stopvs.C676TâArg226Stop

Extraction with EMU over our data

• EMU: Extract mutation from text and link the mutations to co-occurring genes

• Normalize all mutation mentions to HGVS format– Format used in COSMIC and InSiGHT

• Match {gene, HGVS variant, PMID} to curated data

ResultsAbstracts and Full Text

NG = No Gene (ignoring gene in match)

Common/Cmn = PMIDs in common between database and corpus subset (recall with respect to articles for which mutation entity recogniser had at least one positive extraction)

Set Cmnart

Match mutation Recall Recall NG Mutations

commonRecall

commonRecall

CmnNG

COSMIC Abs 2200 1884 0.0095 0.0122 12,940 0.1456 0.1875COSMIC FT 2071 3656 0.0184 0.0215 104,756 0.0349 0.0408

COSMIC Abs + FT 3738 4754 0.0239 0.0289 114,279 0.0416 0.0503

InSiGHT Abs 195 230 0.0328 0.045 1233 0.1865 0.2562InSiGHT FT 150 404 0.0575 0.0612 1626 0.2484 0.2644

InSiGHT Abs + FT 295 588 0.0837 0.0961 2657 0.2213 0.254

High Throughput vs non-High Throughput

Set Cmnart

Match mutation Recall Recall NG Recall

commonRecall

CmnNGHT abstract 1650 1357 0.0072 0.0096 0.1209 0.1608HT full text 1545 2719 0.0145 0.0172 0.027 0.0319HT Abs + FT 2608 3501 0.0187 0.0231 0.032 0.0395

NHT abstract 550 530 0.0461 0.0543 0.3055 0.3597NHT full text 526 937 0.0815 0.0915 0.235 0.2639

NHT Abs + FT 841 1259 0.109 0.1243 0.2538 0.2895

Group PMIDs Count Average mutation SD Mutation

recallCOSMIC 7898 198 864 25.18 521.27 100.00%COSMIC-HT 6266 187 367 29.9 584.82 94.22%

COSMIC-NHT 1632 11 497 7.04 38.05 5.78%

Considering tables and Supplementary material

• Subset from COSMIC and InSiGHT available as PubMed Central Open Access articles

• Supplementary material: MS Word, PDF, MS Excel, PPT, images, …

InSiGHT COSMIC

Set Articles Matched Recall (%) Articles Matched Recall (%)

Abstracts 13 1 0.4 563 140 0.41

XML Full Text (FT) 9 20 7.94 487 694 2.05

PDF FT (PDFFT) 4 7 2.78 76 23 0.07

Tables 8 18 7.14 394 466 1.38

FT+PDFFT+Tables 13 44 17.46 563 929 2.75

Supp. Mat. 1 88 34.92 138 17015 50.59

All 13 115 45.63 563 17896 52.92

Recall still only 50%: Where are the rest?

• Expressed in semi-structured data sources– do not necessarily follow standard nomenclature more

predictably – data spread unpredictably across columns (Wong et al.

2009)

• Different reference position in text than database– curator correction or normalized to different build

• Nomenclature variation– c.482_483delGA vs c.482_483del2

• Linguistic expression of mutations– deletion of exon 3– C>T mutation at nucleotide 2131

Information in tables (spreadsheets, etc.)is expressed differently than in narrative text

Gene listed in column heading

Non-standard nomenclature“Del exon 7”

Text mining over semi-structured data?

• Access ?• Variability (!)

– File formats– How connected to the main text?

• Semantics (?!)– How to make sense of the data?– How to map to standardized nomenclature?

… processing supplementary material will require new strategies. Some technical solutions. Some research.

Extraction of gene-disease-mutation relations

Study:

Verspoor KM, Heo GH, Kang KY, Song M. (in press) Extraction of fine-grained semantic relations for the Human Variome. BMC Medical Informatics and Decision Making.

Variant interpretation using literature

• Evidence of prior significance of variants• Evidence of established connection of the variant

to specific patient cohorts• Use alone or in combination with other evidence

• We aim to extract the relations that connect genes, diseases and mutations

• Specific Objective of the work: relation extraction over theVariome Corpus

gene-mutation-disease-phenotype relations

• Variome Annotation Schema– a schema defining entities and relations of interest

to curation of genetic variants• Variome Corpus

– A corpus of full text articles annotated according to the Variome Annotation Schema

– To be used as training and evaluation data for text mining tools for extracting genetic variation information from the published literature

31

http://www.opennicta.com.au/home/health/variome

The Variome Corpus

10 full-text publications related to colorectal cancerEntities Relations

Gene Gene-has-MutationMutation Cohort/Patient-has-MutationDisease Mutation-relatedto-DiseaseBody part Disease-relatedto-GeneCohort/Patient Disease-relatedto-BodyPartSize Mutation-has-SizeAge Cohort/Patient-has-AgeGender Cohort/Patient-has-GenderEthnicity or Geo Location Cohort/Patient-has-EthnicityLocCharacteristic Cohort/Patient-has-Disease

Cohort-has-size

Verspoor K, Jimeno Yepes A, Cavedon L, McIntosh T, Herten-Crabb A, Thomas Z, Plazzer JP. (2013) Annotating the Biomedical Literature for the Human Variome. Database: The Journal of Biological Databases and Curation, bat019.

§ 43k words§ Double-

annotated§ IAA varies§ .88-.92 F for

entities§ Relations

much lower; reconciled manually

The Variome Corpus annotation

33

• Recognise genetic variants

• Named entity recognition for gene names– Supervised learning for recognizing characteristics and contexts– Combined with dictionaries to support normalisation

• Associating variants to genes– Simple co-occurrence – Combined with sequence verification– Machine learning for relation classification (PKDE4J)

Extraction of mutation relations from text

Information Extraction, Structuring text

From:A subset of colorectal tumour DNA samples from 17 patients carrying the p.Lys618Ala variant …

To:T60 body-part 1307 1317 colorectalT7 disease 1318 1324 tumourR17_m relatedTo Arg1:T60 Arg2:T7

(colorectal relatedTo tumour)T61_merge size 1342 1344 17T24 cohort-patient 1345 1353 patientsR46_2 has Arg1:T24 Arg2:T7

(patients has tumour)T62 mutation 1367 1378 p.Lys618AlaR18_m has Arg1:T24 Arg2:T61_merge

(patients has 17) = (patient group size 17)R19_m has Arg1:T24 Arg2:T62

(patients has p.Lys618Ala)

PKDE4J: Yonsei University IE system

• PKDE4J– Extensible, flexible text mining system for public knowledge

discovery – Entity and relation extraction from the unstructured text data– Extension of Stanford CoreNLP (Manning et al., 2014)– http://informatics.yonsei.ac.kr/pkde4j

• Differentiation of PKDE4J– Configurable system

• Dictionary based entity extraction• Extensible system• Wide range of relation extraction tasks developing an

extensible rule engine based on dependency parsing– Accurate performance

• PKDE4J outperforms many other competing algorithms for both entity and relation extraction

PKDE4J: Yonsei University IE system

• PKDE4J’s major two pipelines – Entity Extraction: Target entities based on

dictionaries by extending Stanford CoreNLP– Relation Extraction: relationships among entities

based on dependency tree based rules

PKDE4J – Named Entity Recognition

PKDE4J – Named Entity Recognition

• Extension of Stanford CoreNLP• Three major submodules

Pre-Processing Dictionary loading Entity annotation

• Flexible configuration (number and format of dictionaries)

• Trie data structure

• Abbreviation resolution• Tokenization: Stanford

PTBTokenizer• Sentence splitting, POS

tagging, Lemmatization: Stanford CoreNLP

• String normalization: Special characters processing

• N-gram matching: Apache Lucene ShingleWrapper

• Approximate string matching: Soft-TFIDF

• Regex NER (Rule-based): Stanford CoreNLP

• Candidate entities filtering: POS filtering, Stopwordremoval

• Labeling: B/I/O format, Entity type

PKDE4J – Relation Extraction

PKDE4J – Relation Extraction

• Based on dependency parse (grammatical structure) based rules

• To extract a relation

Step 1: Identify the verbs in a sentence

CategoryNumber of

VerbsType Verb Example

Positive 68

Increase Lead, Contribute, RiseTransmit Shift, Move, Migrate

Substitute Supplement, Alter

Negative 54Decrease Decline, Diffuse, Down-regulateRemove Deplete, Abrogate, Disassociate

Neutral 111

Contain Possess, Constitute, IncludeModify Methylate, Modulate , NormalizeMethod Bleach, Centrifuge, SpinReport Evaluate, Analyze, Examine

Plain 165 Plain Return, Switch, Balance

PKDE4J - RE

Step 2: Check structure of sentence• Syntactic rules based on deep parsing

– Dependency tree encodes grammatical relations between words in a sentences.

– The tree denotes syntactic dependencies between two entities.– Need to spot the portion of parse tree that is useful, pertinent to

location of entities in a sentence.

PKDE4J - RE

• Rule Extraction– Use Strategy design pattern– Capture predefined rules (17 strategies)

①Verb in dependency path ②No verb in dependency path ③Detect nominalization ④Weak nominalization ⑤Negation ⑥Tense (active / passive) ⑦Contain clause⑧Clause distance⑨Negation clause

⑩Number intervening entities ⑪Entities in between ⑫Surface distance ⑬Entity counts ⑭Same head ⑮Entity order ⑯Full tree path ⑰Path length

Evaluation: PKDE4J over Variome Corpus

• Experimental set-up– Data split– Features?– 10-fold cross-validation

• Focus on relations: Used gold standard entities

• Baseline co-occurrence system

Results of the evaluation

Relation Extraction results for relations with at least 100 examples in the corpus.

Observations

• By applying text mining we can transform the literature from an unstructured, difficult to use resource, to a structured resource.

• We can build systems that can recognise core biological entities in the published literature.

• With this, the information is more accessible– Formalised and normalised in a database– Directly query-able

• and can be used to facilitate more computation:– Information retrieval in terms of entities– Predictive modeling and hypothesis generation

Conclusions

• Variants are relatively easy to recognise in the literature, when the recommended nomenclature is followed (so please use it!).

• The relations between variants and other entities are harder to extract, but still we can do a reasonable job.

• There is lots of information that is in ancillary files associated to the literature (with some challenges for automated systems).

The literature can be effectively mined to identify variant-related information to assist biocuration

and clinical interpretation of variants.

using text mining to inform genetic variant interpretation

Science