copyright © 2004 oracle corporation life sciences eseminar oracle data mining for life sciences...
TRANSCRIPT
Copyright © 2004 Oracle Corporation
Oracle Life Sciences eSeminarOracle Data Mining for Life Sciences Problems
http://conference.oracle.com Meeting Place:US Toll Free: 1-888-967-2253 US Only: 1-650-607-2253 Asia/Pacific: +61 2 8817 6100 Europe/M. East/Africa: +44 118 924 9000Meeting ID #: 407709Meeting Password: 407709
Charlie Berger ([email protected])Sr. Director of Product Management, Life Sciences and Data MiningPablo Tamayo ([email protected])Consulting Member of Technical Staff, Data Mining TechnologiesPat Hoffman ([email protected])Senior Principal Consultant, Oracle Consulting
Copyright © 2004 Oracle Corporation
Oracle Life Science Platform1. Access distributed data
Gateways, External Tables, SQL Loader, Streams, Oracle Gateway to Lion SRS, etc.
2. Integrate a variety of data typesXML DB, Intermedia, Text, etc.
3. Manage vast quantities of dataRAC, Partitioning, Grid, etc.
4. Collaborate securelyCollaboration Suite, iFS (Oracle FilesOnline), Portal, Security, etc.
5. Find patterns and insightsData Mining, BLAST, Statistics, Text, etc.
GenomicsGenomics
ProteomicsProteomics
PathwaysPathways
CheminformaticsCheminformatics
ClinicalClinical
Copyright © 2004 Oracle Corporation
Example Data Mining ApplicationsLife Sciences examples
Leukemia AML/ALL Golub et al.NCI-60 ChemoSensitivity data
Database MarketingTarget doctors likely to prescribe new drug(s)Target “best” patients
Discovery/DevelopmentDiscover target genes and proteinsIdentify promising leads for new drugsMedline literature miningPharmacovigilance
Health CarePredicting medical outcomes
DiabetesPneumoniaRespond to treatment
Fraud detection
Copyright © 2004 Oracle Corporation
Oracle Data Mining Algorithms & Example Applications
Attribute Importance• Identify most influential attributes
for a target attribute• Factors associated a disease• Promising leads
Classification and Prediction• Predict most likely to:
• Doctors who prescribe a new drug• Patients who respond to a treatment
• Regression• Predict a numeric value
• Predict a value • Predict the size tumor will be reduced
A1 A2 A3 A4 A5 A6 A7
Copyright © 2004 Oracle Corporation
Oracle Data Mining Algorithms & Example Applications
Clustering• Find naturally occurring groups
• Gene clusters• Find disease subgroups• Distinguish normal from non-normal behavior
Association Rules• Find co-occurring items
• Suggest interactions
Feature Extraction• Reduce a large dataset into representative
new attributes• Useful for clustering and text mining
F1 F2 F3 F4
Copyright © 2004 Oracle Corporation
Oracle Data Mining Algorithms & Example Applications
Text Mining• Combine data and text for better models
• Add unstructured text e.g. physician’s notes to structured data e.g. age, weight, height, etc., to predict outcomes
• Classify and cluster documents• Combined with Oracle Text to develop
advanced text mining applications e.g. Medline
BLAST• Sequence matching and alignment
• Find genes and proteins thatare “similar”
ATGCAATGCCAGGATTTCCA
CTGCAAGGCCAGGAAGTTCCAATGCGTTGCCAC…ATTTCCAGGC..TGCAATGCCAGGATGACCAATGCAATGTTAGGACCTCCA
Copyright © 2004 Oracle Corporation
5. Discover Patterns and Insights
Deductive Analysis
Inductive Analysis
Answer complex questions about the
relationships in genomic, clinical and
pharmacological data
Finding relationships for classification,
class discovery and prediction
Life Sciences data
Pharmacological databases
Proteomics Database
Clinical Databases
Functional Genomic
Databases
C A T G0 0 1 0 1
Copyright © 2004 Oracle Corporation
metagroup.comCopyright © 2004 META Group, Inc. All rights reserved. METAspectrum 60.1
Copyright © 2004 Oracle Corporation
Demo scenarios
• Gene expression analysis• Chemosensitivity analysis• Clinical data analysis• Clinical data analysis with text mining• Medline text mining
Copyright © 2004 Oracle Corporation
Oracle Data Miner• Data miner uses
Oracle Data Miner to build, evaluate, and apply ODM models• Mining Activity
Guide• Wizards approach
• Generate Java and SQL code to “operationalize”applications• Integrate “insights”
into other applications
Copyright © 2004 Oracle Corporation
Multiple Examples of tumor tissue (public data from Whitehead/MIT)
Oracle 10gSVM Classification of Multiple Tumor Types
DNA Microarray Data
Oracle Data Mining
Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR
BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 MELANOMA-ML 1 1 UTERUS-UT 2 LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 MESOTHELIOMA-MS
3
BRAIN-BR 4
78.25% accuracy
Green=Correct Red=Errors
We feed multiple cancer types data into the Oracle DB: 16,063 genes, 144 cancer
patients and 10 samples per class.
We mine the data using Support Vector Machines and create the confusion matrix
Copyright © 2004 Oracle Corporation
Classification of Multiple Tumor Types• Multiple examples of 14 tumor types • Training set: 144 samples. Test set: 46 samples• Microarrays gene expression profiles for 7,129 genes (features)• Problem: how well can a model distinguish between multiple
tumor types?• Datasets composition:
Tumor Class # Train # Test Tumor Class # Train # Test Breast (BR) 8 3 Uterus (UT) 8 2
Prostate (PR) 8 2 Leukemia (LE) 24 6
Lung (LU) 8 3 Renal (RE) 8 3
Colorectal (CO) 8 5 Pancreas (PA) 8 3
Lymphoma (LY) 16 6 Ovary (OV) 8 3
Bladder (BL) 8 3 Mesothelioma (MS) 8 3
Melanoma (ML) 8 2 Brain (BR) 16 4
Copyright © 2004 Oracle Corporation
Supervised Classification SVM Methodology
Multi-Tumor DatasetMulti-Tumor Dataset
Build SVM Model (Training)Build SVM Model (Training)
Evaluate Model on Test SetEvaluate Model on Test Set
Data Preparation (Scaling)Data Preparation (Scaling)
Read into RDMS as TableRead into RDMS as Table
Oracle Task
SQLLDR
SQL query
ODM Model Build
ODM Model Apply
Tumor Labels (Train)
Tumor Labels (Train)
Tumor Labels (Test)
Tumor Labels (Test)
Prediction ResultsPrediction Results
Copyright © 2004 Oracle Corporation
Gene Expression Data Table and Rescaling
• The datasets were downloaded from the web site and stored in flat files prior to loading them to the Oracle database.
• The data was loaded using SQLLDR to create a fact table of the following format:
NUMBERexpr
VARCHAR2(30)gene
NUMBERsid
typecolumn
NUMBERexpr
VARCHAR2(30)gene
NUMBERsid
typecolumn
• Rescaling: the values were divided by a constant (10000) to make them into small numbers near 1 (to keep the dot products between all samples in the dataset inside the [-1, 1] range.
Copyright © 2004 Oracle Corporation
Example of ODM PL/SQL Build and Apply Commands
DBMS_DATA_MINING.build( model_name => 'SVM_model', function => DBMS_DATA_MINING.classification, data_table_name => ‘multitumor_train', settings_table_name => 'svm_settings', case_id_column_name => 'id', target_column_name => ‘class');
DBMS_DATA_MINING.apply(model_name => ‘SVM_model’,data_table_name => ‘multitumor_test’,case_id_column_name => ‘id’,result_table_name => ‘multitumor_apply_result’);
Copyright © 2004 Oracle Corporation
Algorithm Settings for Support Vector Machines
svms_kernel_function Kernel: svms_linear (for Linear Kernel)svms_gaussian (for Gaussian Kernel)
svms_target_type Target Type for SVM – either of:svms_multi_targetsvms_single_target
Copyright © 2004 Oracle Corporation
SVM Results• Entire methodology implemented in Oracle RDBMS and ODM
• The SVM model works with all 7,129 input features (genes) genes and do not require feature selection.
• The SVM model is relatively fast: 9 minutes training time on 500MHz Netra.
• The SVM is very accurate for multi-tumor molecular classification: 78.25% accuracy.
(comparable to published results in Ramaswamy et al PNAS 2001 paper, they also found that k-NN = 63% and Weighted Voting = 46% accuracy).
Copyright © 2004 Oracle Corporation
Oracle 10gSVM Classification of Multiple Tumor Types
Actual\Predicted BR PR LU CO LY BL ML UT LE RE PA OV MS BR
BREAST-BR 1 1 PROSTATE-PR 1 1 LUNG-LU 1 2 COLON-CO 3 LYMPHOMA-LY 6 BLADDER-BL 1 2 MELANOMA-ML 1 1 UTERUS-UT 2 LEUKEMIA-LE 1 5 RENAL-RE 3 PANCREAS-PA 1 2 OVARY-OV 1 2 MESOTHELIOMA-MS
3
BRAIN-BR 4
78.25% accuracy
Green=Correct Red=Errors
Oracle Data Mining’s SVM models are able to accurately predict the multi-class tumor problem with
78.25% accuracy.
Copyright © 2004 Oracle Corporation
Benefits of Oracle’s ApproachOracle Data Mining Feature BenefitPlatform for Data Mining Applications
• Eliminates data movement and security exposure
• Fastest: Data InformationWide range of data mining algorithms
• Supports most data mining problems
Runs on multiple platforms • Applications may be developed and deployed
Built on Oracle Technology • Grid, RAC, integrated BI,…• SQL & PL/SQL available• Leverage existing skills
Copyright © 2004 Oracle Corporation
InforSense Oracle Edition Data Acquisition Data Analysis Multi - Search Discovery
Oracle Component
InforSense Component
Web Services
2
3.1
3.2
3.3
4
5
(Unified environment + heterogeneous components) enable complex process
Copyright © 2004 Oracle Corporation
InforSense/Oracle Integrated Advanced Analytics
InforSense analytics
Domain specific tools
Workflows
Warehousing
Deployment
Oracle Analytics
• Oracle Data Mining
• Oracle Text Mining
• Oracle Life Science
External Data
• Files
• XML
• SRS
Third-party analytics
NonNon--clinical clinical DataData(NIH)(NIH)
HealthcareHealthcareDataData(HTB)(HTB)
Copyright © 2004 Oracle Corporation
Value of InforSense Integration
Candidate Drug
LY317615
1 Candidate Drug1 Candidate Drug
1 Deployable Model1 Deployable Model
Reusable ProcessReusable Process
2 Weeks2 Weeks
8 Methodologies8 Methodologies
100 Components (Nodes)100 Components (Nodes)
1 Analyst1 Analyst
1 Workflow1 Workflow
•• Heterogeneous componentsHeterogeneous components
•• Complex process encapsulationComplex process encapsulation
•• Smooth integration of web servicesSmooth integration of web services
•• Rapid build and build for reuseRapid build and build for reuse
•• Adaptive to different usersAdaptive to different users
•• Leverage Oracle componentsLeverage Oracle components
InforSense – the Power of Integration