object orie’d data analysis, last time organizational matters

61
Object Orie’d Data Analysis, Last Time • Organizational Matters http://www.stat-or.unc.edu/webspace/courses/marron/UNCstor891OODA-2007/Stor891- 07Home.html • What is OODA? • Visualization by Projection • Object Space & Feature Space • Curves as Data • Data Representation Issues • PCA visualization

Upload: lindsay-bennett

Post on 01-Jan-2016

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Object Orie’d Data Analysis, Last Time Organizational Matters

Object Orie’d Data Analysis, Last Time

• Organizational Mattershttp://www.stat-or.unc.edu/webspace/courses/marron/UNCstor891OODA-2007/Stor891-

07Home.html

• What is OODA?

• Visualization by Projection

• Object Space & Feature Space

• Curves as Data

• Data Representation Issues

• PCA visualization

Page 2: Object Orie’d Data Analysis, Last Time Organizational Matters

Data Object Conceptualization

Object Space Feature Space

Curves

Images Manifolds

Shapes Tree Space

Trees

d

Page 3: Object Orie’d Data Analysis, Last Time Organizational Matters

Functional Data Analysis, Toy EG I

Page 4: Object Orie’d Data Analysis, Last Time Organizational Matters

Easy way to do these analyses

Matlab software (user friendly?) available:http://www.stat.unc.edu/postscript/papers/marron/Matlab7Software/

Download & put in Matlab Path:• General• Smoothing

Look first at:• curvdatSM.m• scatplotSM.m

Page 5: Object Orie’d Data Analysis, Last Time Organizational Matters

Easy way to do these analyses

Matlab software (user friendly?) available:http://www.stat.unc.edu/postscript/papers/marron/Matlab7Software/

????????????????????????????

??? Next time:

Spend some time going through these

As many students seem to want to use them

Page 6: Object Orie’d Data Analysis, Last Time Organizational Matters

Time Series of Curves

• Again a “Set of Curves”

• But now Time Order is Important!

• An approach:

Use color to code for time

Start

End

Page 7: Object Orie’d Data Analysis, Last Time Organizational Matters

Time Series Toy E.g.

Explore Question:

“Is Horizontal Motion Linear Variation?”

Example: Set of time shifted

Gaussian densities

View: Code time with colors as above

Page 8: Object Orie’d Data Analysis, Last Time Organizational Matters

T. S. Toy E.g., Raw Data

Page 9: Object Orie’d Data Analysis, Last Time Organizational Matters

T. S. Toy E.g., PCA ViewPCA gives “Modes of Variation”

But there are Many…

Intuitively Useful???

Like “harmonics”?

Isn’t there only 1 mode of variation?

Answer comes in 2-d scatterplots

Page 10: Object Orie’d Data Analysis, Last Time Organizational Matters

T. S. Toy E.g., PCA Scatterplot

Page 11: Object Orie’d Data Analysis, Last Time Organizational Matters

T. S. Toy E.g., PCA Scatterplot

• Where is the Point Cloud?

• Lies along a 1-d curve in

• So actually have 1-d mode of variation

• But a non-linear mode of variation

• Poorly captured by PCA (linear method)

• Will study more later

d

Page 12: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series

• Mass Spectrometry Measurements

• On an Aging Substance, called “Estane”

• Made over Logarithmic Time Grid, n = 60

• Each is a Spectrum

• What about Time Evolution?

• Approach: PCA & Time Coloring

Page 13: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series

Joint Work w/ E. Kober & J. Wendelberger

Los Alamos National Lab

Four Experimental Conditions:

1. Control

2. Aged 59 days in Dry Air

3. Aged 27 days in Humid Air

4. Aged 59 days in Humid Air

Page 14: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series, HA 27

Page 15: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series, HA 27

Raw Data:• All 60 spectra essentially the same• “Scale” of mean is much bigger than

variation about mean• Hard to see structure of all 1600 freq’s

Centered Data:• Now can see different spectra• Since mean subtracted off• Note much smaller vertical axis

Page 16: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series, HA 27

Page 17: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series, HA 27

Data zoomed to “important” freq’s:

Raw Data:• Now see slight differences• Smoother “natural looking” spectra

Centered Data:• Differences in spectra more clear• Maybe now have “real structure”

Scale is important

Page 18: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series, HA 27

Page 19: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series, HA 27

Use of Time Order Coloring:Raw Data:• Can see a little ordering, not muchCentered Data:• Clear time ordering• Shifting peaks? (compare to Raw)PC1:• Almost everything?PC1 Residuals:• Data nearly linear (same scale import’nt)

Page 20: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series, Control

Page 21: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series, Control

PCA View

• Clear systematic structure

• Time ordering very important

• Reminiscent of Toy Example

• A clear 1-d curve in Feature Space

• Physical Explanation?

Page 22: Object Orie’d Data Analysis, Last Time Organizational Matters

Toy Data Explanations

Simple Chemical Reaction Model:• Subst. 1 transforms into Subst. 2• Note: linear path in Feature Space

Page 23: Object Orie’d Data Analysis, Last Time Organizational Matters

Toy Data Explanations

Richer Chemical Reaction Model:• Subst. 1 Subst. 2 Subst. 3• Curved path in Feat. Sp. • 2 Reactions Curve lies in 2-dim’al subsp.

Page 24: Object Orie’d Data Analysis, Last Time Organizational Matters

Toy Data Explanations

Another Chemical Reaction Model:• Subst. 1 Subst. 2 & Subst. 5 Subst. 6• Curved path in Feat. Sp. • 2 Reactions Curve lies in 2-dim’al subsp.

Page 25: Object Orie’d Data Analysis, Last Time Organizational Matters

Toy Data Explanations

More Complex Chemical Reaction Model:• 1 2 3 4• Curved path in Feat. Sp. (lives in 3-d)• 3 Reactions Curve lies in 3-dim’al subsp.

Page 26: Object Orie’d Data Analysis, Last Time Organizational Matters

Toy Data Explanations

Even More Complex Chemical Reaction Model:• 1 2 3 4 5• Curved path in Feat. Sp. (lives in 4-d)• 4 Reactions Curve lies in 4-dim’al subsp.

Page 27: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series, Control

Page 28: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series, Control

Suggestions from Toy Examples:

• Clearly 3 reactions under way

• Maybe a 4th???

• Hard to distinguish from noise?

• Interesting statistical open problem!

Page 29: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric Time Series

What about the other experiments? Recall:

1. Control

2. Aged 59 days in Dry Air

3. Aged 27 days in Humid Air

4. Aged 59 days in Humid Air

Above results were “cherry picked”,

to best makes points

What about cases???

Page 30: Object Orie’d Data Analysis, Last Time Organizational Matters

Scatterplot Matrix, Control

Above E.g., maybe ~4d curve ~4 reactions

Page 31: Object Orie’d Data Analysis, Last Time Organizational Matters

Scatterplot Matrix, Da59

PC2 is “bleeding of CO2”, discussed below

Page 32: Object Orie’d Data Analysis, Last Time Organizational Matters

Scatterplot Matrix, Ha27

Only “3-d + noise”? Only 3 reactions

Page 33: Object Orie’d Data Analysis, Last Time Organizational Matters

Scatterplot Matrix, Ha59

Harder to judge???

Page 34: Object Orie’d Data Analysis, Last Time Organizational Matters

Object Space View, Control

Terrible discretization effect, despite ~4d …

Page 35: Object Orie’d Data Analysis, Last Time Organizational Matters

Object Space View, Da59

OK, except strange at beginning (CO2 …)

Page 36: Object Orie’d Data Analysis, Last Time Organizational Matters

Object Space View, Ha27

Strong structure in PC1 Resid (d < 2)

Page 37: Object Orie’d Data Analysis, Last Time Organizational Matters

Object Space View, Ha59

Lots at beginning, OK since “oldest”

Page 38: Object Orie’d Data Analysis, Last Time Organizational Matters

Problem with Da59

What about strange behavior for DA59?

Recall:

PC2 showed “really different behavior at start”

Chemists comments:

Ignore this, should have started measuring later…

Page 39: Object Orie’d Data Analysis, Last Time Organizational Matters

Problem with Da59

But still fun to look at broader spectra

Page 40: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric T. S. Joint View

• Throw them all together as big population

• Take Point Cloud View

Page 41: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric T. S. Joint View

Page 42: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric T. S. Joint View

• Throw them all together as big population• Take Point Cloud View• Note 4d space of interest, driven by:

• 4 clusters (3d)• PC1 of chemical reaction (1-d)

• But these don’t appear as the 4 PCs• Chem. PC1 “spread over PC2,3,4”

• Essentially a “rotation of interesting dir’ns”• How to “unrotate”???

Page 43: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric T. S. Joint View

Interesting Variation:

Remove cluster means

Allows clear comparison of within curve

variation

Page 44: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric T. S. Joint View (- mean)

Page 45: Object Orie’d Data Analysis, Last Time Organizational Matters

Chemo-metric T. S. Joint View

Interesting Variation:

Remove cluster means

Allows clear comparison of within curve

variation:• PC1 versus others are quite revealing

(note different “rotations”)• Others don’t show so much

Page 46: Object Orie’d Data Analysis, Last Time Organizational Matters

Demography DataJoint Work with: Andres Alonso

Univ. Carlos III, Madrid

• Mortality, as a function of age

• “Chance of dying”, for Males, in Spain

• of each 1-year age group

• Curves are years

• 1908 - 2002

• PCA of the family of curves

Page 47: Object Orie’d Data Analysis, Last Time Organizational Matters

Demography Data• PCA of the family of curves for Males• Babies & elderly “most mortal” (Raw)• All getting better over time (Raw &

PC1)• Except 1918 - Influenza Pandemic• (see Color Scale)• Middle age most mortal (PC2):

– 1918– Early 1930s - Spanish Civil War– 1980 – 1994 (then better) auto wrecks

• Decade Rounding (several places)

Page 48: Object Orie’d Data Analysis, Last Time Organizational Matters

Demography Data

• PCA for Females in Spain

• Most aspects similar

• (see Color Scale)

• No War Changes– Steady improvement until 70s (PC2)

– When auto accidents kicked in

Page 49: Object Orie’d Data Analysis, Last Time Organizational Matters

Demography Data

• PCA for Males in Switzerland

• Most aspects similar

• No decade rounding (better records)

• 1918 Flu – Different Color (PC2)

• (see Color Scale)

• No War Changes– Steady improvement until 70s (PC2)

– When auto accidents kicked in

Page 50: Object Orie’d Data Analysis, Last Time Organizational Matters

Demography Data• Dual PCA

• Idea: Rows and Columns trade places

• Terminology: from optimization

• Insights come from studying “primal” & “dual” problems

Page 51: Object Orie’d Data Analysis, Last Time Organizational Matters

Primal / Dual PCAConsider

“Data Matrix”

dndid

jnjij

ni

xxx

xxx

xxx

1

1

1111

Page 52: Object Orie’d Data Analysis, Last Time Organizational Matters

Primal / Dual PCAConsider

“Data Matrix”

Primal Analysis:

Columns are data vectors

dndid

jnjij

ni

xxx

xxx

xxx

1

1

1111

Page 53: Object Orie’d Data Analysis, Last Time Organizational Matters

Primal / Dual PCAConsider

“Data Matrix”

Dual Analysis:

Rows are data vectors

dndid

jnjij

ni

xxx

xxx

xxx

1

1

1111

Page 54: Object Orie’d Data Analysis, Last Time Organizational Matters

Demography Data• Dual PCA

• Idea: Rows and Columns trade places

• Demographic Primal View:

• Curves are Years, Coord’s are Ages

• Demographic Dual View:

• Curves are Ages, Coord’s are Years

Dual PCA View, Spanish Males

Page 55: Object Orie’d Data Analysis, Last Time Organizational Matters

Demography DataDual PCA View, Spanish Males

• Old people have const. mortality (raw)

• But improvement for rest (raw)

• Bad for 1918 (flu) & Spanish Civil War, but generally improving (mean)

• Improves for ages 1-6, then worse (PC1)

• Big Improvement for young (PC2)

• (Age Color Key)

Page 56: Object Orie’d Data Analysis, Last Time Organizational Matters

Primal / Dual PCAReference:

Gabriel, K. R. (1971) The biplot display of matrices with application to principal component analysis, Biometrika, 58, 467.

• Will study more later

• “Centering” is a critical issue

Page 57: Object Orie’d Data Analysis, Last Time Organizational Matters

Yeast Cell Cycle Data• “Gene Expression” – Micro-array data

• Data (after major preprocessing): Expression “level” of:

• thousands of genes (d ~ 1,000s)

• but only dozens of “cases” (n ~ 10s)

• Interesting statistical issue:

High Dimension Low Sample Size data

(HDLSS)

Page 58: Object Orie’d Data Analysis, Last Time Organizational Matters

Yeast Cell Cycle Data

Data from:

Spellman, P. T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998), “Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization”, Molecular Biology of the Cell, 9, 3273-3297.

Page 59: Object Orie’d Data Analysis, Last Time Organizational Matters

Yeast Cell Cycle Data

Analysis here is from:

Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14, 789-808

Page 60: Object Orie’d Data Analysis, Last Time Organizational Matters

Yeast Cell Cycle Data• Lab experiment:

• Chemically “synchronize cell cycles”, of yeast cells

• Do cDNA micro-arrays over time

• Used 18 time points, over “about 2 cell cycles”

• Studied 4,489 genes (whole genome)

• Time series view of data:

4,489 time series of length 18

• Functional Data View:

4,489 “curves”

Page 61: Object Orie’d Data Analysis, Last Time Organizational Matters

Yeast Cell Cycle Data, FDA View

Central question:Which genes are “periodic” over 2 cell cycles?