![Page 1: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/1.jpg)
Machine-learning in building bioinformatics databases for infectious diseases
Victor TongInstitute for Infocomm ResearchA*STAR, Singapore
ASEAN-China International Bioinformatics Workshop 200817 Apr 2008
![Page 2: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/2.jpg)
Overview
Definitions and background
Architectures of existing immunological databases
Machine-learning for biological databases
Conclusion
![Page 3: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/3.jpg)
Biology produces more data than we can process >3000 HLA alleles 107-1015 different T-cell receptors 1011 linear 9mer epitopes Post-translational spliced epitopes
Data are stored in databases, literature, laboratory records, clinical records, …
A major issue: turning data into knowledge
The information centric world
![Page 4: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/4.jpg)
Impractical to do manual curation ≥ 16 million PubMed abstracts ~80K immunology related references
Large amounts of data that are difficult to interpret Protein-protein interaction extraction from text
Bioinformatics: systematic construction and updating of databases
Use of bioinformatics
![Page 5: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/5.jpg)
Ad hoc bioinformatics
Biological system
Computational analysis
Biological interpretation
![Page 6: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/6.jpg)
More systematic use of bioinformatics
Biological system
Computational analysis
Biological interpretation
Formal description
Mathematical problem
Conversion of results
![Page 7: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/7.jpg)
Knowledge discovery from databases is the process of automated extraction of useful information or knowledge from individual or multiple databases
![Page 8: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/8.jpg)
1) Data explosion
Current databases: Volume of data increasing exponentially GenBank, SWISS-PROT, IMGT, PubMed, etc
New databases:
Growth in numbers Increase in size More complex
Biologists: Maintain personal data bank Information relevant to their
research Define objectives for data
mining and analysis
![Page 9: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/9.jpg)
2) Data quality
Nature of biological data: Fuzzy and complex Varying interpretations
Problems with raw data:
Inconsistent Inaccurate Redundant Irrelevant Incomplete Incorrect
Data cleaning: Limit on the percentage
error that can be tolerated in the data
Prevent propagation of errors to our databases
Prevent depreciation of data quality
![Page 10: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/10.jpg)
3) Database creation and maintenance
Software tools and programming efforts: Data collection Constructing databases Integrating data mining tools Updating the databases
Nature of the databases:
Short lifespan Hard to maintain
![Page 11: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/11.jpg)
4) Data integration
Disparities in data sources: Data structures Data formats Views Search mechanisms Location
![Page 12: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/12.jpg)
Overview
Definitions and background
Architectures of existing immunological databases
Machine-learning for biological databases
Conclusion
![Page 13: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/13.jpg)
Web-resources for immune epitope information
Immune Epitope Database and Analysis Resource (IEDB)Contains B-cell epitopes, T-cell epitopes, MHC ligands for humans, non-human primates, rodents, and other animal species.URL: http://www.immuneepitope.org
The international ImMunoGeneTics information system (IMGT)Specializes in Ig, T-cell receptors, MHC, Ig superfamily, MHC superfamily, and related proteins of the immune system of human and other vertebrate species URL: http://imgt.cines.fr/
SYFPEITHIContains ~3,500 T-cell epitopes, MHC ligands and peptide motifs for humans and rodentsURL: http://www.syfpeithi.de/
![Page 14: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/14.jpg)
Web-resources for immuneepitope information
MHCBNContains T-cell epitopes, TAP ligands, MHC binding peptides and MHC non-binding peptides for humans and rodentsURL: http://www.imtech.res.in/raghava/mhcbn/
MPID-TContains 3D structural information of 187 T-cell receptors, MHCs and interacting epitopes for humans and rodents, spanning 40 allelesURL: http://surya.bic.nus.edu.sg/mpidt/
AntiJen/JenPepContains T-cell epitopes, MHC ligands, TAP ligands and B-cell epitopes.URL: http://www.jenner.ac.uk/antijen/
![Page 15: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/15.jpg)
The IEDB class diagram
![Page 16: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/16.jpg)
Relationships between an epitope & contexts
![Page 17: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/17.jpg)
![Page 18: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/18.jpg)
![Page 19: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/19.jpg)
Overview
Definitions and background
Architectures of existing immunological databases
Machine-learning for biological databases
Conclusion
![Page 20: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/20.jpg)
Naϊve Bayes classifiers
Attribute values are conditionally independent given the target value
Goal: to assign a new instance vj the most probable target value Vtarget given a set of attribute values <a1, a2, … an>
The target class may be defined as:
Vtarget = argmax P(vj)ΠP(ai|vj)
![Page 21: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/21.jpg)
Comparison of popular text classification algorithms
Dataset 20,910 PubMed abstracts 181,299 unique words
AROC NBC: 0.838 ANN: 0.831 SVM: 0.825 DT: 0.809
Wang et al., BMC Bioinformatics 2007, 8:269
![Page 22: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/22.jpg)
Feature selection (FS)
Data source PubMed abstracts Medical Subject Headings (MeSH) - National Library of
Medicine's controlled vocabulary used for indexing articles, for cataloging books and other holdings
Publication title Author(s) etc
![Page 23: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/23.jpg)
Feature selection (FS)
Algorithms Document frequency (DF) – ranks features based on the
number of abstracts they appear in Information gain (IG) – measures the number of bits of
information obtained for category prediction based on their occurrence in a document
IG(u) = -∑ P(ci) log P(ci) + P(u) ∑ P(ci|u) log P(ci|u) + P(t) ∑ P(ci|ū) log P(ci|ū)
where u is the feature of interest, ci (i = 1, …, m) denotes the set of categories the documents belong to
![Page 24: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/24.jpg)
Feature condensation (FC)
Stemming To reduce words to their common root
e.g. “binding, binds, bind” to bind Porter stemmer – AROC = 0.846 to AROC = 0.842 Domain specific vocabulary may be reduced to
unsuitable terms
![Page 25: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/25.jpg)
Feature extraction (FE)
Rules to capture immune related expressions and group them together Reduction of feature space (i.e. no. of unique words) Enrichment of information content Better performance?
![Page 26: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/26.jpg)
Feature extraction (FE)
Examples: Sequence length
– identify sequence length and replace with “~range<50~” or “~range>50~” if sequences to be mapped stretches 50 amino acids
MHC alleles– identify MHC alleles and replace with “~mhc_allele~”
Protein sequences– identify sequences as a) exclusively containing characters representing the 20 aa, b) in upper case, length > threshold,and replace with “~sequence~”
![Page 27: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/27.jpg)
Performance comparison
Wang et al., BMC Bioinformatics 2007, 8:269
![Page 28: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/28.jpg)
Overview
Definitions and background
Architectures of existing immunological databases
Machine-learning for biological databases
Conclusion
![Page 29: Machine-learning in building bioinformatics databases for infectious diseases](https://reader034.vdocuments.net/reader034/viewer/2022042718/56814b68550346895db85707/html5/thumbnails/29.jpg)
Conclusion
Machine-learning algorithms enable systematic approach to database construction and facilitates scientific discovery
It must be performed with due care and must
be scientifically and technically sound