introduction to data mining / bioinformatics

Introduction to Bioinformatics:Mining Your Data

Gerry LushingtonLushington in Silico

modeling / informatics consultant

2/28/2012Intro to Data Mining / GHL1


What is Data Mining?

Use of computational methods to perceive trends in data that can be used to explain or predict important outcomes or properties

Applicable across many disciplines:Molecular bioinformatics

Medical Informatics

Health Informatics

Biodiversity informatics


Example Applications:

a) Relative gene expression datab) Relative protein abundance datac) Relative lipid & metabolite profilesd) Glycosylation variantse) SNPs, allelesf) Cellular traitsg) Organism traitsh) Behavioral traitsi) Case history

Find relationships between:

Convenient Observables vs.

1. Disease susceptibility2. Drug efficacy3. Toxin susceptibility4. Immunity5. Genetic disorders6. Microbial virulence7. Species adaptive success8. Species complementarity

Important Outcomes


Goals for this lecture:

Focus on Data Mining: how to approach your data and use it to understand biology

Overview of available techniques

Understanding model validation

Try to think about data you’ve seen: what techniques might be useful?

Don’t worry about grasping everything:K-INBRE Bioinformatics Core is here to help!!


Basic Data Mining:Find relationships between:a) Easy to measure properties vs.b) Important (but harder to measure) outcomes or attributes

Use relationships to understand the conceptual basis for outcomes in b)

Use relationships to predict outcomes in new cases where outcome has not yet been measured


Basic Data Mining: simple measureables


Basic Data Mining: general observation

Unhappy Happy


Basic Data Mining: relationship (#1)

Unhappy Happy

Blue = happy; Red = unhappy accuracy = 12/20 = 60%


Basic Data Mining: relationship (#2)

Unhappy Happy

Blue + BIG Red = happy; little red = unhappy accuracy = 17/20 = 85%


Data Mining: procedure

1. Data Acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration



1. Data acquisition2. Data Preprocessing3. Feature Selection4. Classification5. Validation6. Prediction & Iteration

Key issues include:

a) format conversion from instrumentb) any necessary mathematical manipulations (e.g., Density = M/V)

Peak heights?

Peak positions?




Key issues include:

a) Normalization to account for experimental biasb) Statistical detection of flagrant outliers




Key issues include:


C 1 2 3 C 1 2 3 C 1 2 3 C 1 2 3

Use controls to scale data

C




Key issues include:


Subjective(requires experience

and/or domain knowledge)




Which out of many measurable properties relate to outcome of interest?

a) Intrinsic information contentb) Redundancy relative to other propertiesc) Correlation with target attributed) Iterative model training






1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

x x






1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

x






1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4

x

1 2 3 4






• Train preliminary models based on random sets of properties

• Evaluate models according to correlative or predictive performance

• Experiment with promising sets adding or deleting descriptors to gauge impact on performance




Predict which sample will have which outcome?

a) Correlative methodsb) Distance-based clusteringc) Boundary detectiond) Rule learninge) Weighted probability






x

y






y-n +n

YESNO

x

y






x2

x1






x2

x1

y2

y4y3

y1

y1

y2

y3

y4

= resistant to types I & II diabetes

= susceptible to types I & II

= susceptible only to type II

= susceptible only to type I






x2

x1Susceptible to type I

Resistant to type I






x2

x1

If x1 < c and x2 > a then resistantElse if x1 > c and x2 > b then resistantElse susceptible

a

E = 9

b

c

x2

x1Susceptible to type I

Resistant to type I






x1

Susc.Resistant

a

x2

Susc. Resistant

b

Fx1 - Gx2

c

If Fx1 - Gx2 < c then resistantElse susceptible

Susc.Resistant




Define criteria and tests to prove model validity

a) Accuracyb) Sensitivity vs. Specificityc) Receiver Operating Characteristic (ROC) plotd) Cross-validation






x2

x1Susc.(Pos.)

Accuracy = (TP + TN)

TP + TN + FP + FN

= 142 / 154

Resistant (Neg.)






x2

x1

Sensitivity = TP

TP + FN

= 67 / 72

FPR = FP

TN + FP= 6 / 81

Susc.(Pos.)

Resistant (Neg.)

Note: Specificity = 1 - FPR






x2

x1

Sensitivity = TP

TP + FN

= 69 / 72

FPR = FP

TN + FP= 19 / 81

Varying model

stringency

less

more

Susc.(Pos.)

Resistant (Neg.)

Note: Specificity = 1 - FPR






FPR

Sens






Sens

Area under curve is excellent measure of model performance

1.0: perfect model0.5: random

FPR






Predictions are imperfect due to:• Imperfect Algorithms• Imperfect Data

Cross-Validation:

• Carefully monitor features that are useful across different independent data subsets

• This can be accomplished with N-fold cross-validation:

• Best feature selection and classification algorithms will yield best consistent performance across independent trials

• Best features will be consistently important across trials

Train

Test

Trial 5Trial 4Trial 3Trial 2Trial 1

Model performance = mean predictive performance over 5 trials





Analysis is only useful if it is used; only improves if it is tested

a) Good validation requires successful new predictionsb) Imperfect predictions can lead to method refinement and

greater understanding

Questions?

Lushington in Silico

Geraldlushington3117 at aol.com

Geraldlushington.org


introduction to data mining / bioinformatics

Technology