introduction to data mining / bioinformatics

37
Introduction to Bioinformatics: Mining Your Data Gerry Lushington Lushington in Silico modeling / informatics consultant

Upload: gerald-lushington

Post on 01-Jul-2015

472 views

Category:

Technology


5 download

DESCRIPTION

This presentation, prepared by Gerry Lushington, is a friendly introduction to the basics of data mining, as applied to biological problems. The intended audience is students and scientific researchers from a non-computational background.

TRANSCRIPT

Page 1: Introduction to Data Mining / Bioinformatics

Introduction to Bioinformatics:Mining Your Data

Gerry LushingtonLushington in Silico

modeling / informatics consultant

2/28/2012Intro to Data Mining / GHL1

Page 2: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL2

What is Data Mining?

Use of computational methods to perceive trends in data that can be used to explain or predict important outcomes or properties

Applicable across many disciplines:Molecular bioinformatics

Medical Informatics

Health Informatics

Biodiversity informatics

Page 3: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL3

Example Applications:

a) Relative gene expression datab) Relative protein abundance datac) Relative lipid & metabolite profilesd) Glycosylation variantse) SNPs, allelesf) Cellular traitsg) Organism traitsh) Behavioral traitsi) Case history

Find relationships between:

Convenient Observables vs.

1. Disease susceptibility2. Drug efficacy3. Toxin susceptibility4. Immunity5. Genetic disorders6. Microbial virulence7. Species adaptive success8. Species complementarity

Important Outcomes

Page 4: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL4

Goals for this lecture:

Focus on Data Mining: how to approach your data and use it to understand biology

Overview of available techniques

Understanding model validation

Try to think about data you’ve seen: what techniques might be useful?

Don’t worry about grasping everything:K-INBRE Bioinformatics Core is here to help!!

Page 5: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL5

Basic Data Mining:Find relationships between:a) Easy to measure properties vs.b) Important (but harder to measure) outcomes or attributes

Use relationships to understand the conceptual basis for outcomes in b)

Use relationships to predict outcomes in new cases where outcome has not yet been measured

Page 6: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL6

Basic Data Mining: simple measureables

Page 7: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL7

Basic Data Mining: general observation

Unhappy Happy

Page 8: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL8

Basic Data Mining: relationship (#1)

Unhappy Happy

Blue = happy; Red = unhappy accuracy = 12/20 = 60%

Page 9: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL9

Basic Data Mining: relationship (#2)

Unhappy Happy

Blue + BIG Red = happy; little red = unhappy accuracy = 17/20 = 85%

Page 10: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL10

Data Mining: procedure

1. Data Acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Page 11: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL11

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Key issues include:

a) format conversion from instrumentb) any necessary mathematical manipulations (e.g., Density = M/V)

Peak heights?

Peak positions?

Page 12: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL12

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Key issues include:

a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers

Page 13: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL13

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Key issues include:

a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers

C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3

Use controls to scale data

C

Page 14: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL14

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Key issues include:

a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers

Subjective(requires experience

and/or domain knowledge)

Page 15: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL15

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Which out of many measurable properties relate to outcome of interest?

a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training

Page 16: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL16

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Which out of many measurable properties relate to outcome of interest?

a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

x x

Page 17: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL17

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Which out of many measurable properties relate to outcome of interest?

a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

x

Page 18: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL18

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Which out of many measurable properties relate to outcome of interest?

a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

x

1 2 3 4

Page 19: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL19

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Which out of many measurable properties relate to outcome of interest?

a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training

• Train preliminary models based on random sets of properties

• Evaluate models according to correlative or predictive performance

• Experiment with promising sets adding or deleting descriptors to gauge impact on performance

Page 20: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL20

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

Page 21: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL21

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x

y

Page 22: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL22

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

y-n +n

YESNO

x

y

Page 23: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL23

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x2

x1

Page 24: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL24

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x2

x1

y2

y4y3

y1

y1

y2

y3

y4

= resistant to types I & II diabetes

= susceptible to types I & II

= susceptible only to type II

= susceptible only to type I

Page 25: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL25

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x2

x1Susceptible to type I

Resistant to type I

Page 26: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL27

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x2

x1

If x1 < c and x2 > a then resistantElse if x1 > c and x2 > b then resistantElse susceptible

a

E = 9

b

c

x2

x1Susceptible to type I

Resistant to type I

Page 27: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL28

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x1

Susc.Resistant

a

x2

Susc. Resistant

b

Fx1 - Gx2

c

If Fx1 - Gx2 < c then resistantElse susceptible

Susc.Resistant

Page 28: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL29

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

Page 29: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL30

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

x2

x1Susc.(Pos.)

Accuracy = (TP + TN)

TP + TN + FP + FN

= 142 / 154

Resistant (Neg.)

Page 30: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL31

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

x2

x1

Sensitivity = TP

TP + FN

= 67 / 72

FPR = FP

TN + FP= 6 / 81

Susc.(Pos.)

Resistant (Neg.)

Note: Specificity = 1 - FPR

Page 31: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL32

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

x2

x1

Sensitivity = TP

TP + FN

= 69 / 72

FPR = FP

TN + FP= 19 / 81

Varying model

stringency

less

more

Susc.(Pos.)

Resistant (Neg.)

Note: Specificity = 1 - FPR

Page 32: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL33

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

FPR

Sens

Page 33: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL34

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

Sens

Area under curve is excellent measure of model performance

1.0: perfect model0.5: random

FPR

Page 34: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL35

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

Predictions are imperfect due to:• Imperfect Algorithms• Imperfect Data

Page 35: Introduction to Data Mining / Bioinformatics

Cross-Validation:

• Carefully monitor features that are useful across different independent data subsets

• This can be accomplished with N-fold cross-validation:

• Best feature selection and classification algorithms will yield best consistent performance across independent trials

• Best features will be consistently important across trials

Train

Test

Trial 5Trial 4Trial 3Trial 2Trial 1

Model performance = mean predictive performance over 5 trials

2/28/2012Intro to Data Mining / GHL36

Page 36: Introduction to Data Mining / Bioinformatics

2/28/2012Intro to Data Mining / GHL37

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Analysis is only useful if it is used; only improves if it is tested

a) Good validation requires successful new predictionsb) Imperfect predictions can lead to method refinement and

greater understanding

Page 37: Introduction to Data Mining / Bioinformatics

Questions?

Lushington in Silico

Geraldlushington3117 at aol.com

Geraldlushington.org

2/28/2012Intro to Data Mining / GHL38