prof. carolina ruiz computer science department bioinformatics and computational biology program wpi...
TRANSCRIPT
![Page 1: Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL](https://reader035.vdocuments.net/reader035/viewer/2022070411/56649cc45503460f9498db29/html5/thumbnails/1.jpg)
Prof. Carolina Ruiz
Computer Science Department
Bioinformatics and Computational Biology Program
WPI
WELCOME TO
BCB4003/CS4803
BCB503/CS583
BIOLOGICAL AND BIOMEDICAL DATABASE MINING
![Page 2: Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL](https://reader035.vdocuments.net/reader035/viewer/2022070411/56649cc45503460f9498db29/html5/thumbnails/2.jpg)
WHY THIS COURSE?
Biological and BiomedicalResearch Problems
Genome 1980’s-1990’sSequencing, sequence analysis, …
Proteome 1990’s-2000’s
Protein structure, protein-protein interactions, protein pathways
Central dogma: DNA (trascription) RNA (translation) Protein
Transcriptomemid 1990’s-2000’s Gene expression,
DNA/RNA microarrays
Biological Function
2000’s
Applications 2000’sOrganism-organism interactions
Organism-environment interactionsGenome-wide association studies
Cancer therapiesDrug development
![Page 3: Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL](https://reader035.vdocuments.net/reader035/viewer/2022070411/56649cc45503460f9498db29/html5/thumbnails/3.jpg)
THIS ALL HAS GENERATED …
• Data• Massive datasets and databases of sequence, gene, gene
expression, protein, biological function, clinical information, …
• Text• Annotations in data sources, abstracts (e.g., Medline), research
articles, medical literature (e.g., PubMed, NCBI Bookshelf, Google Scholar), patients records, …
• Ontologies• Description of terms and their relationship
• (e.g., Gene Ontology)
![Page 4: Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL](https://reader035.vdocuments.net/reader035/viewer/2022070411/56649cc45503460f9498db29/html5/thumbnails/4.jpg)
CURRENT CHALLENGES
• To make sense of and put to use all this information.
• How? Computational tools and techniques are needed to help humans in integrating, summarizing, understanding, and taking advantage of accumulated information• Data mining• Text mining• Data and text mining together
![Page 5: Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL](https://reader035.vdocuments.net/reader035/viewer/2022070411/56649cc45503460f9498db29/html5/thumbnails/5.jpg)
“Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [text]” (Fayyad et al., 1996)
• Raw Data [Text] Data [Text] Mining
• Patterns
• Analytical Patterns (rules, decision trees)
• Statistical Patterns (data distribution)
• Visual Patterns
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.
WHAT IS DATA [TEXT] MINING?OR MORE GENERALLY, KNOWLEDGE DISCOVERY IN DATABASES (KDD)
![Page 6: Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL](https://reader035.vdocuments.net/reader035/viewer/2022070411/56649cc45503460f9498db29/html5/thumbnails/6.jpg)
DATA MINING METHODS IN BIOINFORMATICS
• Clustering
• Sequence Mining
• Bayesian Methods
• Expectation Maximization (EM)
• Gibbs Sampling
• Hidden Markov Models
• Kernel methods
• Support Vector Machines
![Page 7: Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL](https://reader035.vdocuments.net/reader035/viewer/2022070411/56649cc45503460f9498db29/html5/thumbnails/7.jpg)
TEXT MINING IN BIOINFORMATICS• Document indexing
• Information retrieval
• Lexical analysis (Sentence tokenization, Word tokenization, Stemming, Stop word removal)
• Semantic analysis
• Query processing
• Text classification
• Text clustering
• Text summarization
• (Semi-) Automatic curation of literature repositories
• Knowledge discovery from text, hypothesis generation
![Page 8: Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL](https://reader035.vdocuments.net/reader035/viewer/2022070411/56649cc45503460f9498db29/html5/thumbnails/8.jpg)
0102030405060708090
1stQtr
2ndQtr
3rdQtr
4thQtr
East
West
North
DATA/TEXT MINING PROCESS (KDD)
information sources
data analysisdata mining• analytical• statistical• visual
models
model/patterns deployment• prediction
• decision supportnew data
data management• databases
• data warehouses“good” model
model/patternevaluation• quantitative• qualitative
data “pre”-processing
• noisy/missing data • feature selection
cleaneddata
data
![Page 9: Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL](https://reader035.vdocuments.net/reader035/viewer/2022070411/56649cc45503460f9498db29/html5/thumbnails/9.jpg)
PUTTING ALL TOGETHER …
• Data / Text / Information Integration• Mining over data and text combined
• Visualization
• Other real-world issues• Developing tools and techniques that are
efficient, scalable, and user friendly
![Page 10: Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL](https://reader035.vdocuments.net/reader035/viewer/2022070411/56649cc45503460f9498db29/html5/thumbnails/10.jpg)
• Biology and Biomedicine
• Contributes domain knowledge
• Machine Learning (AI)
• Contributes (semi-)automatic induction of empirical laws from observations & experimentation
• Statistics
• Contributes language, framework, and techniques
• Pattern Recognition
• Contributes pattern extraction and pattern matching techniques
• Natural Language Processing (AI) Computational Linguistics• Contributes text analysis techniques
• Databases• Contributes efficient data storage, data
cleansing, and data access techniques
• Data Visualization• Contributes visual data displays and
data exploration
• High Performance Comp.• Contributes techniques to efficiently
handling complexity
• Signal processing
• Image Processing …
INTERDISCIPLINARY TECHNIQUES COME FROM MULTIPLE FIELDS
![Page 11: Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL](https://reader035.vdocuments.net/reader035/viewer/2022070411/56649cc45503460f9498db29/html5/thumbnails/11.jpg)
QUESTIONS?
* Images in this presentation were downloaded from Google images