using text mining to inform genetic variant interpretation
TRANSCRIPT
![Page 1: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/1.jpg)
Using text mining to inform genetic variant interpretation
Karin VerspoorDepartment of Computing and Information [email protected]
![Page 2: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/2.jpg)
So you’re a medical doctor …
• With a very sick patient• You can’t work out what’s going on• You suspect a rare disease• You order a DNA analysis
(whole exome or genome)• And find a genetic mutation
What does it mean?
![Page 3: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/3.jpg)
Clinical interpretation of variantsSample Data Flow
19 Apr 2013 BigData 7
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant Calling
Annotation
DB Load
Filtering
Known Variant ?
Publish Report
Curation
Peter Mac Mutation
DB
External and Locus Specific
DBs
Yes
No
Patient Clinical Report
Document Assembly
Variant Normalisation
Report Editing and
Signoff
Manual Step
Automatic Step
Wet Lab Bioinformatics Clinical Informatics
Patient Sample
Sample Data Flow
19 Apr 2013 BigData 7
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant Calling
Annotation
DB Load
Filtering
Known Variant ?
Publish Report
Curation
Peter Mac Mutation
DB
External and Locus Specific
DBs
Yes
No
Patient Clinical Report
Document Assembly
Variant Normalisation
Report Editing and
Signoff
Manual Step
Automatic Step
Wet Lab Bioinformatics Clinical Informatics
Patient Sample
Sample Data Flow
19 Apr 2013 BigData 7
Histology
DNA Extract
PCR
Sequencing
Alignment
Variant Calling
Annotation
DB Load
Filtering
Known Variant ?
Publish Report
Curation
Peter Mac Mutation
DB
External and Locus Specific
DBs
Yes
No
Patient Clinical Report
Document Assembly
Variant Normalisation
Report Editing and
Signoff
Manual Step
Automatic Step
Wet Lab Bioinformatics Clinical Informatics
Patient Sample
Image courtesy Kenneth Doig, Peter Mac. “PipeCleaner for your NGS Pipeline” HISA Big Data 2013.
![Page 4: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/4.jpg)
What’s a mutation?
• Genomic variation: alteration in a sequence– hereditary (germ-line) mutations– acquired (somatic) mutations
• Examples of variation – SNP (single nucleotide polymorphism)– Protein mutation– insertions, deletions, duplications, inversions, . . .
• Types of variations– DNA variations that have no adverse effects on our cells and
occur frequently in the population are called polymorphisms – DNA variations that do affect the function of the protein
made from a gene and occur less often are called mutations
![Page 5: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/5.jpg)
The Challenge: Interpreting variants
§ Identifying variation is becoming easier, interpreting it remains difficult
• Which changes are due to normal individual variation?
• Which are associated with a phenotype of interest?
![Page 6: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/6.jpg)
Interpreting variation through context
• Analysis of functional significance of variants– Predicted impact of mutations– Conservation analysis– Allele frequencies from large genomic databases
• Existing knowledge captured in structured sources– UniProt site-specific protein annotations– The Cancer Gene Atlas genomic characterisation data– Disease-specific variant databases, e.g. COSMIC and
InSiGHT
• Techniques for annotating variants– Data aggregation from multiple sources– Data integration and inference to reveal shared pathways
![Page 7: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/7.jpg)
Exponentialknowledgegrowth
• ~1550peer-reviewedgene-relateddatabasesinNARonlineMol Biocollection
• Over25millionPubMedentries(>2,000/day)
• Breakdownofdisciplinaryboundariesmakesmoreofitrelevanttoeachofus
![Page 8: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/8.jpg)
Whybiomedical textmining?
0
200000
400000
600000
800000
1000000
1200000
1914
1918
1922
1926
1930
1934
1938
1942
1946
1950
1954
1958
1962
1966
1970
1974
1978
1982
1986
1990
1994
1998
2002
2006
2010
2014
Publ
icat
ions
per
yea
r
Year
ExponentialgrowthinsizeofPubmed
![Page 9: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/9.jpg)
Structured resources are not enough:Literature is the primary repository of knowledge
0
20000
40000
60000
80000
100000
120000
1/02 1/03 1/04 1/05 1/06 1/07
# Sw
iss-
Prot
Pro
tein
s
Proteins missing a FUNCTION commentProteins gaining a FUNCTION comment
“Manualcurationisnotsufficientforannotationofgenomicdatabases”BaumgartneretalISMB2007
“Our entire understanding of biology and medicine is really contained in the published literature. And since people write in natural language, if you can’t
get computers to turn that information into databases and computable information, you’re
falling behind.”-- Russ Altman, MD PhD, Stanford University
![Page 10: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/10.jpg)
Recovery of variants from the literature using text mining
Study:
Jimeno Yepes A, Verspoor K. (2014) Literature mining of genetic variants for curation: Quantifying the importance of supplementary material. Database: The Journal of Biological Databases and Curation, bau003. doi:10.1093/database/bau003 [PMID:24520105]
![Page 11: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/11.jpg)
Study: Recall of curated variants through the application of text mining
• Given a curated resource of genetic variants,• with explicit links to the source literature for
each variant,• and a mutation extraction tool with
demonstrated good performance on intrinsic evaluation
… how many variants can text mining recover?
![Page 12: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/12.jpg)
InSiGHTGene:
Variant:p.Lys286Gln
Lit. Reference:Takahashi et al 2007
![Page 13: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/13.jpg)
Motivations
• Assess real-world applicability of text mining tools for supporting analysis of genetic variants
• Speed up curation of mutation databases
![Page 14: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/14.jpg)
Two databases
• InSiGHT, Human Variome Project– MLH1, MSH2, MSH6 and PMS2 linked to
Lynch syndrome (germline mutations)
• COSMIC, Sanger Institute– Somatic mutations linked to cancer
Database
PMIDsassociated to
Mutations
Total Mutation
Count
Average Mutations per article Std Dev
InSiGHT 809 7022 8.68 18.55COSMIC 7898 198864 25.18 521.18
![Page 15: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/15.jpg)
Literature mutation extraction
• Many tools exist to perform mutation annotation– MutationMiner, MutationFinder, EMU, tmVar, SETH,
...
• Research shows that they have high precision and recall on MEDLINE abstracts (> 90% F1)
• There are also tools to do named entity extraction of genes, diseases, body parts …
Jimeno Yepes A, Verspoor K. (2014) Mutation extraction tools can be combined for robust recognition of genetic variants in the literature. F1000Research 2014, 3:18. doi: 10.12688/f1000research.3-18.v2 [PMID:25285203]
![Page 16: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/16.jpg)
How to extract mutations from text?
• Essentially a named entity recognition task. • Early attempts focused on SNPs and protein mutations (amino
acid residues). • e.g., MutationFinder1 patterns (simplified):
(?P<wt_res>AminoAcid)(?P<pos>[1-9][0-9]*)(?P<mut_res>AminoAcid)
Gly17SerSer97Pro
• where AminoAcid is: (CYS|ILE|SER|GLN|MET|ASN|PRO|LYS|ASP|THR|PHE|ALA|GLY|HIS|LEU|ARG| TRP|VAL|GLU|TYR)|(GLUTAMINE|GLUTAMIC ACID|LEUCINE|VALINE| ISOLEUCINE|LYSINE|ALANINE|GLYCINE|ASPARTATE|METHIONINE| THREONINE|HISTIDINE|ASPARTIC ACID|ARGININE|ASPARAGINE| TRYPTOPHAN|PROLINE|PHENYLALANINE|CYSTEINE|SERINE|GLUTAMATE| TYROSINE)
1http://mutationfinder.sourceforge.net/
![Page 17: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/17.jpg)
Human Genome Variation Society nomenclature (excerpt)
![Page 18: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/18.jpg)
• Pattern-based approach to identifying genetic variants– dbSNP identifiers and standard HGVS nomenclature
(e.g. SETH https://rockt.github.io/SETH)
– natural language expressions of mutationso This missense mutation converts a highly conserved glycine
(Gly17 of neurophysin) to a valine residue.o Killer of prune (Kpn) is a mutation in the awd gene which
substitutes Ser for Pro at position 97 and causes dominant lethality in individuals that do not have a functional prune gene.
o … where cysteines at positions 6, 42, 48, 90 and 393 were replaced by serine.
Extraction of mutations from text
![Page 19: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/19.jpg)
Extractor of Mutations (Kann Lab)
![Page 20: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/20.jpg)
Studytextsources
• PubMed– 22Mcitations;titleandabstract
• PubMedCentral– fulltext– 512kavailablefromPMC-OpenAccess
• Publishersitecrawling– Availabilitydependsonlicense– HTMLpagescanbenoisy
• C676T–>Arg226Stopvs.C676TâArg226Stop
![Page 21: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/21.jpg)
Extraction with EMU over our data
• EMU: Extract mutation from text and link the mutations to co-occurring genes
• Normalize all mutation mentions to HGVS format– Format used in COSMIC and InSiGHT
• Match {gene, HGVS variant, PMID} to curated data
![Page 22: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/22.jpg)
ResultsAbstracts and Full Text
NG = No Gene (ignoring gene in match)
Common/Cmn = PMIDs in common between database and corpus subset (recall with respect to articles for which mutation entity recogniser had at least one positive extraction)
Set Cmnart
Match mutation Recall Recall NG Mutations
commonRecall
commonRecall
CmnNG
COSMIC Abs 2200 1884 0.0095 0.0122 12,940 0.1456 0.1875COSMIC FT 2071 3656 0.0184 0.0215 104,756 0.0349 0.0408
COSMIC Abs + FT 3738 4754 0.0239 0.0289 114,279 0.0416 0.0503
InSiGHT Abs 195 230 0.0328 0.045 1233 0.1865 0.2562InSiGHT FT 150 404 0.0575 0.0612 1626 0.2484 0.2644
InSiGHT Abs + FT 295 588 0.0837 0.0961 2657 0.2213 0.254
![Page 23: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/23.jpg)
High Throughput vs non-High Throughput
Set Cmnart
Match mutation Recall Recall NG Recall
commonRecall
CmnNGHT abstract 1650 1357 0.0072 0.0096 0.1209 0.1608HT full text 1545 2719 0.0145 0.0172 0.027 0.0319HT Abs + FT 2608 3501 0.0187 0.0231 0.032 0.0395
NHT abstract 550 530 0.0461 0.0543 0.3055 0.3597NHT full text 526 937 0.0815 0.0915 0.235 0.2639
NHT Abs + FT 841 1259 0.109 0.1243 0.2538 0.2895
Group PMIDs Count Average mutation SD Mutation
recallCOSMIC 7898 198 864 25.18 521.27 100.00%COSMIC-HT 6266 187 367 29.9 584.82 94.22%
COSMIC-NHT 1632 11 497 7.04 38.05 5.78%
![Page 24: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/24.jpg)
Considering tables and Supplementary material
• Subset from COSMIC and InSiGHT available as PubMed Central Open Access articles
• Supplementary material: MS Word, PDF, MS Excel, PPT, images, …
InSiGHT COSMIC
Set Articles Matched Recall (%) Articles Matched Recall (%)
Abstracts 13 1 0.4 563 140 0.41
XML Full Text (FT) 9 20 7.94 487 694 2.05
PDF FT (PDFFT) 4 7 2.78 76 23 0.07
Tables 8 18 7.14 394 466 1.38
FT+PDFFT+Tables 13 44 17.46 563 929 2.75
Supp. Mat. 1 88 34.92 138 17015 50.59
All 13 115 45.63 563 17896 52.92
![Page 25: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/25.jpg)
Recall still only 50%: Where are the rest?
• Expressed in semi-structured data sources– do not necessarily follow standard nomenclature more
predictably – data spread unpredictably across columns (Wong et al.
2009)
• Different reference position in text than database– curator correction or normalized to different build
• Nomenclature variation– c.482_483delGA vs c.482_483del2
• Linguistic expression of mutations– deletion of exon 3– C>T mutation at nucleotide 2131
![Page 26: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/26.jpg)
Information in tables (spreadsheets, etc.)is expressed differently than in narrative text
![Page 27: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/27.jpg)
Gene listed in column heading
Non-standard nomenclature“Del exon 7”
![Page 28: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/28.jpg)
Text mining over semi-structured data?
• Access ?• Variability (!)
– File formats– How connected to the main text?
• Semantics (?!)– How to make sense of the data?– How to map to standardized nomenclature?
… processing supplementary material will require new strategies. Some technical solutions. Some research.
![Page 29: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/29.jpg)
Extraction of gene-disease-mutation relations
Study:
Verspoor KM, Heo GH, Kang KY, Song M. (in press) Extraction of fine-grained semantic relations for the Human Variome. BMC Medical Informatics and Decision Making.
![Page 30: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/30.jpg)
Variant interpretation using literature
• Evidence of prior significance of variants• Evidence of established connection of the variant
to specific patient cohorts• Use alone or in combination with other evidence
• We aim to extract the relations that connect genes, diseases and mutations
• Specific Objective of the work: relation extraction over theVariome Corpus
![Page 31: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/31.jpg)
gene-mutation-disease-phenotype relations
• Variome Annotation Schema– a schema defining entities and relations of interest
to curation of genetic variants• Variome Corpus
– A corpus of full text articles annotated according to the Variome Annotation Schema
– To be used as training and evaluation data for text mining tools for extracting genetic variation information from the published literature
31
http://www.opennicta.com.au/home/health/variome
![Page 32: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/32.jpg)
The Variome Corpus
10 full-text publications related to colorectal cancerEntities Relations
Gene Gene-has-MutationMutation Cohort/Patient-has-MutationDisease Mutation-relatedto-DiseaseBody part Disease-relatedto-GeneCohort/Patient Disease-relatedto-BodyPartSize Mutation-has-SizeAge Cohort/Patient-has-AgeGender Cohort/Patient-has-GenderEthnicity or Geo Location Cohort/Patient-has-EthnicityLocCharacteristic Cohort/Patient-has-Disease
Cohort-has-size
Verspoor K, Jimeno Yepes A, Cavedon L, McIntosh T, Herten-Crabb A, Thomas Z, Plazzer JP. (2013) Annotating the Biomedical Literature for the Human Variome. Database: The Journal of Biological Databases and Curation, bat019.
§ 43k words§ Double-
annotated§ IAA varies§ .88-.92 F for
entities§ Relations
much lower; reconciled manually
![Page 33: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/33.jpg)
The Variome Corpus annotation
33
![Page 34: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/34.jpg)
• Recognise genetic variants
• Named entity recognition for gene names– Supervised learning for recognizing characteristics and contexts– Combined with dictionaries to support normalisation
• Associating variants to genes– Simple co-occurrence – Combined with sequence verification– Machine learning for relation classification (PKDE4J)
Extraction of mutation relations from text
![Page 35: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/35.jpg)
Information Extraction, Structuring text
From:A subset of colorectal tumour DNA samples from 17 patients carrying the p.Lys618Ala variant …
To:T60 body-part 1307 1317 colorectalT7 disease 1318 1324 tumourR17_m relatedTo Arg1:T60 Arg2:T7
(colorectal relatedTo tumour)T61_merge size 1342 1344 17T24 cohort-patient 1345 1353 patientsR46_2 has Arg1:T24 Arg2:T7
(patients has tumour)T62 mutation 1367 1378 p.Lys618AlaR18_m has Arg1:T24 Arg2:T61_merge
(patients has 17) = (patient group size 17)R19_m has Arg1:T24 Arg2:T62
(patients has p.Lys618Ala)
![Page 36: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/36.jpg)
PKDE4J: Yonsei University IE system
• PKDE4J– Extensible, flexible text mining system for public knowledge
discovery – Entity and relation extraction from the unstructured text data– Extension of Stanford CoreNLP (Manning et al., 2014)– http://informatics.yonsei.ac.kr/pkde4j
• Differentiation of PKDE4J– Configurable system
• Dictionary based entity extraction• Extensible system• Wide range of relation extraction tasks developing an
extensible rule engine based on dependency parsing– Accurate performance
• PKDE4J outperforms many other competing algorithms for both entity and relation extraction
![Page 37: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/37.jpg)
PKDE4J: Yonsei University IE system
• PKDE4J’s major two pipelines – Entity Extraction: Target entities based on
dictionaries by extending Stanford CoreNLP– Relation Extraction: relationships among entities
based on dependency tree based rules
![Page 38: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/38.jpg)
PKDE4J – Named Entity Recognition
![Page 39: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/39.jpg)
PKDE4J – Named Entity Recognition
• Extension of Stanford CoreNLP• Three major submodules
Pre-Processing Dictionary loading Entity annotation
• Flexible configuration (number and format of dictionaries)
• Trie data structure
• Abbreviation resolution• Tokenization: Stanford
PTBTokenizer• Sentence splitting, POS
tagging, Lemmatization: Stanford CoreNLP
• String normalization: Special characters processing
• N-gram matching: Apache Lucene ShingleWrapper
• Approximate string matching: Soft-TFIDF
• Regex NER (Rule-based): Stanford CoreNLP
• Candidate entities filtering: POS filtering, Stopwordremoval
• Labeling: B/I/O format, Entity type
![Page 40: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/40.jpg)
PKDE4J – Relation Extraction
![Page 41: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/41.jpg)
PKDE4J – Relation Extraction
• Based on dependency parse (grammatical structure) based rules
• To extract a relation
Step 1: Identify the verbs in a sentence
CategoryNumber of
VerbsType Verb Example
Positive 68
Increase Lead, Contribute, RiseTransmit Shift, Move, Migrate
Substitute Supplement, Alter
Negative 54Decrease Decline, Diffuse, Down-regulateRemove Deplete, Abrogate, Disassociate
Neutral 111
Contain Possess, Constitute, IncludeModify Methylate, Modulate , NormalizeMethod Bleach, Centrifuge, SpinReport Evaluate, Analyze, Examine
Plain 165 Plain Return, Switch, Balance
![Page 42: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/42.jpg)
PKDE4J - RE
Step 2: Check structure of sentence• Syntactic rules based on deep parsing
– Dependency tree encodes grammatical relations between words in a sentences.
– The tree denotes syntactic dependencies between two entities.– Need to spot the portion of parse tree that is useful, pertinent to
location of entities in a sentence.
![Page 43: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/43.jpg)
PKDE4J - RE
• Rule Extraction– Use Strategy design pattern– Capture predefined rules (17 strategies)
①Verb in dependency path ②No verb in dependency path ③Detect nominalization ④Weak nominalization ⑤Negation ⑥Tense (active / passive) ⑦Contain clause⑧Clause distance⑨Negation clause
⑩Number intervening entities ⑪Entities in between ⑫Surface distance ⑬Entity counts ⑭Same head ⑮Entity order ⑯Full tree path ⑰Path length
![Page 44: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/44.jpg)
Evaluation: PKDE4J over Variome Corpus
• Experimental set-up– Data split– Features?– 10-fold cross-validation
• Focus on relations: Used gold standard entities
• Baseline co-occurrence system
![Page 45: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/45.jpg)
Results of the evaluation
Relation Extraction results for relations with at least 100 examples in the corpus.
![Page 46: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/46.jpg)
Observations
• By applying text mining we can transform the literature from an unstructured, difficult to use resource, to a structured resource.
• We can build systems that can recognise core biological entities in the published literature.
• With this, the information is more accessible– Formalised and normalised in a database– Directly query-able
• and can be used to facilitate more computation:– Information retrieval in terms of entities– Predictive modeling and hypothesis generation
![Page 47: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/47.jpg)
Conclusions
• Variants are relatively easy to recognise in the literature, when the recommended nomenclature is followed (so please use it!).
• The relations between variants and other entities are harder to extract, but still we can do a reasonable job.
• There is lots of information that is in ancillary files associated to the literature (with some challenges for automated systems).
The literature can be effectively mined to identify variant-related information to assist biocuration
and clinical interpretation of variants.
![Page 48: Using text mining to inform genetic variant interpretation](https://reader031.vdocuments.net/reader031/viewer/2022020411/5875ae791a28ab8b618b571f/html5/thumbnails/48.jpg)
© Copyright The University of Melbourne 2016