introduction to data mining / bioinformatics

Post on 01-Jul-2015

472 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation, prepared by Gerry Lushington, is a friendly introduction to the basics of data mining, as applied to biological problems. The intended audience is students and scientific researchers from a non-computational background.

TRANSCRIPT

Introduction to Bioinformatics:Mining Your Data

Gerry LushingtonLushington in Silico

modeling / informatics consultant

2/28/2012Intro to Data Mining / GHL1

2/28/2012Intro to Data Mining / GHL2

What is Data Mining?

Use of computational methods to perceive trends in data that can be used to explain or predict important outcomes or properties

Applicable across many disciplines:Molecular bioinformatics

Medical Informatics

Health Informatics

Biodiversity informatics

2/28/2012Intro to Data Mining / GHL3

Example Applications:

a) Relative gene expression datab) Relative protein abundance datac) Relative lipid & metabolite profilesd) Glycosylation variantse) SNPs, allelesf) Cellular traitsg) Organism traitsh) Behavioral traitsi) Case history

Find relationships between:

Convenient Observables vs.

1. Disease susceptibility2. Drug efficacy3. Toxin susceptibility4. Immunity5. Genetic disorders6. Microbial virulence7. Species adaptive success8. Species complementarity

Important Outcomes

2/28/2012Intro to Data Mining / GHL4

Goals for this lecture:

Focus on Data Mining: how to approach your data and use it to understand biology

Overview of available techniques

Understanding model validation

Try to think about data you’ve seen: what techniques might be useful?

Don’t worry about grasping everything:K-INBRE Bioinformatics Core is here to help!!

2/28/2012Intro to Data Mining / GHL5

Basic Data Mining:Find relationships between:a) Easy to measure properties vs.b) Important (but harder to measure) outcomes or attributes

Use relationships to understand the conceptual basis for outcomes in b)

Use relationships to predict outcomes in new cases where outcome has not yet been measured

2/28/2012Intro to Data Mining / GHL6

Basic Data Mining: simple measureables

2/28/2012Intro to Data Mining / GHL7

Basic Data Mining: general observation

Unhappy Happy

2/28/2012Intro to Data Mining / GHL8

Basic Data Mining: relationship (#1)

Unhappy Happy

Blue = happy; Red = unhappy accuracy = 12/20 = 60%

2/28/2012Intro to Data Mining / GHL9

Basic Data Mining: relationship (#2)

Unhappy Happy

Blue + BIG Red = happy; little red = unhappy accuracy = 17/20 = 85%

2/28/2012Intro to Data Mining / GHL10

Data Mining: procedure

1. Data Acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

2/28/2012Intro to Data Mining / GHL11

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Key issues include:

a) format conversion from instrumentb) any necessary mathematical manipulations (e.g., Density = M/V)

Peak heights?

Peak positions?

2/28/2012Intro to Data Mining / GHL12

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Key issues include:

a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers

2/28/2012Intro to Data Mining / GHL13

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Key issues include:

a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers

C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3

Use controls to scale data

C

2/28/2012Intro to Data Mining / GHL14

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Key issues include:

a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers

Subjective(requires experience

and/or domain knowledge)

2/28/2012Intro to Data Mining / GHL15

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Which out of many measurable properties relate to outcome of interest?

a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training

2/28/2012Intro to Data Mining / GHL16

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Which out of many measurable properties relate to outcome of interest?

a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

x x

2/28/2012Intro to Data Mining / GHL17

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Which out of many measurable properties relate to outcome of interest?

a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

x

2/28/2012Intro to Data Mining / GHL18

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Which out of many measurable properties relate to outcome of interest?

a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training

1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

x

1 2 3 4

2/28/2012Intro to Data Mining / GHL19

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Which out of many measurable properties relate to outcome of interest?

a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training

• Train preliminary models based on random sets of properties

• Evaluate models according to correlative or predictive performance

• Experiment with promising sets adding or deleting descriptors to gauge impact on performance

2/28/2012Intro to Data Mining / GHL20

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

2/28/2012Intro to Data Mining / GHL21

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x

y

2/28/2012Intro to Data Mining / GHL22

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

y-n +n

YESNO

x

y

2/28/2012Intro to Data Mining / GHL23

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x2

x1

2/28/2012Intro to Data Mining / GHL24

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x2

x1

y2

y4y3

y1

y1

y2

y3

y4

= resistant to types I & II diabetes

= susceptible to types I & II

= susceptible only to type II

= susceptible only to type I

2/28/2012Intro to Data Mining / GHL25

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x2

x1Susceptible to type I

Resistant to type I

2/28/2012Intro to Data Mining / GHL27

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x2

x1

If x1 < c and x2 > a then resistantElse if x1 > c and x2 > b then resistantElse susceptible

a

E = 9

b

c

x2

x1Susceptible to type I

Resistant to type I

2/28/2012Intro to Data Mining / GHL28

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability

x1

Susc.Resistant

a

x2

Susc. Resistant

b

Fx1 - Gx2

c

If Fx1 - Gx2 < c then resistantElse susceptible

Susc.Resistant

2/28/2012Intro to Data Mining / GHL29

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

2/28/2012Intro to Data Mining / GHL30

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

x2

x1Susc.(Pos.)

Accuracy = (TP + TN)

TP + TN + FP + FN

= 142 / 154

Resistant (Neg.)

2/28/2012Intro to Data Mining / GHL31

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

x2

x1

Sensitivity = TP

TP + FN

= 67 / 72

FPR = FP

TN + FP= 6 / 81

Susc.(Pos.)

Resistant (Neg.)

Note: Specificity = 1 - FPR

2/28/2012Intro to Data Mining / GHL32

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

x2

x1

Sensitivity = TP

TP + FN

= 69 / 72

FPR = FP

TN + FP= 19 / 81

Varying model

stringency

less

more

Susc.(Pos.)

Resistant (Neg.)

Note: Specificity = 1 - FPR

2/28/2012Intro to Data Mining / GHL33

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

FPR

Sens

2/28/2012Intro to Data Mining / GHL34

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

Sens

Area under curve is excellent measure of model performance

1.0: perfect model0.5: random

FPR

2/28/2012Intro to Data Mining / GHL35

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation

Predictions are imperfect due to:• Imperfect Algorithms• Imperfect Data

Cross-Validation:

• Carefully monitor features that are useful across different independent data subsets

• This can be accomplished with N-fold cross-validation:

• Best feature selection and classification algorithms will yield best consistent performance across independent trials

• Best features will be consistently important across trials

Train

Test

Trial 5Trial 4Trial 3Trial 2Trial 1

Model performance = mean predictive performance over 5 trials

2/28/2012Intro to Data Mining / GHL36

2/28/2012Intro to Data Mining / GHL37

Data Mining: procedure

1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Analysis is only useful if it is used; only improves if it is tested

a) Good validation requires successful new predictionsb) Imperfect predictions can lead to method refinement and

greater understanding

Questions?

Lushington in Silico

Geraldlushington3117 at aol.com

Geraldlushington.org

2/28/2012Intro to Data Mining / GHL38

top related