object orie’d data analysis, last time •clustering –quantify with cluster index –simple 1-d...
TRANSCRIPT
![Page 1: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/1.jpg)
Object Orie’d Data Analysis, Last Time
• Clustering– Quantify with Cluster Index
– Simple 1-d examples
– Local mininizers
– Impact of outliers
• SigClust– When are clusters really there?
– Gaussian Null Distribution
– Which Gaussian for HDLSS settings?
![Page 2: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/2.jpg)
2-means Clustering
![Page 3: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/3.jpg)
2-means Clustering
![Page 4: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/4.jpg)
2-means Clustering
![Page 5: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/5.jpg)
2-means Clustering
![Page 6: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/6.jpg)
SigClust
• Statistical Significance of Clusters• in HDLSS Data• When is a cluster “really there”?
From Liu, et al. (2007)
![Page 7: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/7.jpg)
Interesting Statistical Problem
For HDLSS data:When clusters seem to appear
E.g. found by clustering method
How do we know they are really there?Question asked by Neil Hayes
Define appropriate statistical significance?
Can we calculate it?
![Page 8: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/8.jpg)
Simple Gaussian Example
Clearly only 1 Cluster in this ExampleBut Extreme Relabelling looks differentExtreme T-stat strongly significantIndicates 2 clusters in data
![Page 9: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/9.jpg)
Statistical Significance of Clusters
Basis of SigClust Approach:
What defines: A Cluster?A Gaussian distribution (Sarle & Kou 1993)
So define SigClust test based on:2-means cluster index (measure) as statisticGaussian null distributionCurrently compute by simulationPossible to do this analytically???
![Page 10: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/10.jpg)
SigClust Statistic – 2-Means Cluster Index
Measure of non-Gaussianity:
2-means Cluster Index:
Class Index Sets Class Means
“Within Class Var’n” / “Total Var’n”
22
1
2
1
|| ||
,|| ||
k
k
cj
k j C
n
jj
x x
CIx x
![Page 11: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/11.jpg)
SigClust Gaussian null distribut’n
Estimated Mean (of Gaussian dist’n)?1st Key Idea: Can ignore thisBy appealing to shift invariance of CI
When data are (rigidly) shiftedCI remains the same
So enough to simulate with mean 0Other uses of invariance ideas?
![Page 12: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/12.jpg)
SigClust Gaussian null distribut’n
Challenge: how to estimate cov. Matrix?Number of parameters:E.g. Perou 500 data:
Dimension
so
But Sample Size Impossible in HDLSS settings????
Way around this problem?
2
)1( dd
533n
46,797,9752
)1(
dd9674d
![Page 13: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/13.jpg)
SigClust Gaussian null distribut’n
2nd Key Idea: Mod Out RotationsReplace full Cov. by diagonal matrixAs done in PCA eigen-analysis
But then “not like data”???OK, since k-means clustering (i.e. CI) is
rotation invariant
(assuming e.g. Euclidean Distance)
tMDM
![Page 14: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/14.jpg)
SigClust Gaussian null distribut’n
2nd Key Idea: Mod Out Rotations
Only need to estimate diagonal matrix
But still have HDLSS problems?
E.g. Perou 500 data:
Dimension
Sample Size
Still need to estimate param’s9674d
533n9674d
![Page 15: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/15.jpg)
SigClust Gaussian null distribut’n
3rd Key Idea: Factor Analysis Model
Model Covariance as: Biology + Noise
Where
is “fairly low dimensional”
is estimated from background noise2NB
INB 2
![Page 16: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/16.jpg)
SigClust Gaussian null distribut’n
Estimation of Background Noise :
Reasonable model (for each gene):
Expression = Signal + Noise
“noise” is roughly Gaussian
“noise” terms essentially independent
(across genes)
2N
![Page 17: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/17.jpg)
SigClust Gaussian null distribut’n
Estimation of Background Noise :
Model OK, since
data come from
light intensities
at colored spots
2N
![Page 18: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/18.jpg)
SigClust Gaussian null distribut’n
Estimation of Background Noise :
For all expression values (as numbers)
Use robust estimate of scale
Median Absolute Deviation (MAD)
(from the median)
Rescale to put on same scale as s. d.:
2N
)1,0(
ˆN
data
MADMAD
![Page 19: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/19.jpg)
SigClust Estimation of Background Noise
![Page 20: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/20.jpg)
Q-Q plots
An aside:
Fitting probability distributions to data
• Does Gaussian distribution “fit”???
• If not, why not?
• Fit in some part of the distribution?
(e.g. in the middle only?)
![Page 21: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/21.jpg)
Q-Q plotsApproaches to:
Fitting probability distributions to data
• Histograms
• Kernel Density Estimates
Drawbacks: often not best view
(for determining goodness of fit)
![Page 22: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/22.jpg)
Q-Q plotsSimple Toy Example, non-Gaussian!
![Page 23: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/23.jpg)
Q-Q plotsSimple Toy Example, non-Gaussian(?)
![Page 24: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/24.jpg)
Q-Q plotsSimple Toy Example, Gaussian
![Page 25: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/25.jpg)
Q-Q plotsSimple Toy Example, Gaussian?
![Page 26: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/26.jpg)
Q-Q plotsNotes:
• Bimodal see non-Gaussian with histo
• Other cases: hard to see
• Conclude:
Histogram poor at assessing Gauss’ity
Kernel density estimate any better?
![Page 27: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/27.jpg)
Q-Q plotsKernel Density Estimate, non-Gaussian!
![Page 28: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/28.jpg)
Q-Q plotsKernel Density Estimate, Gaussian
![Page 29: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/29.jpg)
Q-Q plotsKDE (undersmoothed), Gaussian
![Page 30: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/30.jpg)
Q-Q plotsKDE (oversmoothed), Gaussian
![Page 31: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/31.jpg)
Q-Q plotsKernel Density Estimate, Gaussian
![Page 32: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/32.jpg)
Q-Q plotsKernel Density Estimate, Gaussian?
![Page 33: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/33.jpg)
Q-Q plotsHistogram poor at assessing Gauss’ity
Kernel density estimate any better?
• Unfortunately doesn’t seem to be
• Really need a better viewpoint
Interesting to compare to:
• Gaussian Distribution
• Fit by Maximum Likelihood (avg. & s.d.)
![Page 34: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/34.jpg)
Q-Q plotsKDE vs. Max. Lik. Gaussian Fit, Gaussian?
![Page 35: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/35.jpg)
Q-Q plotsKDE vs. Max. Lik. Gaussian Fit, Gaussian?
• Looks OK?
• Many might think so…
• Strange feature:
– Peak higher than Gaussian fit
– Usually lower, due to smoothing bias
– Suggests non-Gaussian?
• Dare to conclude non-Gaussian?
![Page 36: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/36.jpg)
Q-Q plotsKDE vs. Max. Lik. Gaussian Fit, Gaussian
![Page 37: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/37.jpg)
Q-Q plotsKDE vs. Max. Lik. Gaussian Fit, Gaussian
• Substantially more noise
– Because of smaller sample size
– n is only 1000 …
• Peak is lower than Gaussian fit
– Consistent with Gaussianity
• Weak view for assessing Gaussianity
![Page 38: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/38.jpg)
Q-Q plotsKDE vs. Max. Lik. Gauss., non-Gaussian(?)
![Page 39: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/39.jpg)
Q-Q plotsKDE vs. Max. Lik. Gau’n Fit, non-
Gaussian(?)
• Still have large noise
• But peak clearly way too high
• Seems can conclude non-Gaussian???
![Page 40: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/40.jpg)
Q-Q plots• Conclusion:
KDE poor for assessing Gaussianity
How about a SiZer approach?
![Page 41: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/41.jpg)
Q-Q plotsSiZer Analysis, non-Gaussian(?)
![Page 42: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/42.jpg)
Q-Q plotsSiZer Analysis, non-Gaussian(?)
• Can only find one mode
• Consistent with Gaussianity
• But no guarantee
• Multi-modal non-Gaussianity
• But converse is not true
• SiZer is good at finding multi-modality
• SiZer is poor at checking Gaussianity
![Page 43: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/43.jpg)
Q-Q plotsStandard approach to checking Gaussianity
• QQ – plots
Background:
Graphical Goodness of Fit
Fisher (1983)
![Page 44: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/44.jpg)
Q-Q plots
Background: Graphical Goodness of Fit
Basis:
Cumulative Distribution Function (CDF)
Probability quantile notation:
for "probability” and "quantile"
xXPxF
p q
qFp pFq 1
![Page 45: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/45.jpg)
Q-Q plots
Probability quantile notation:
for "probability” and "quantile“
Thus is called the quantile function 1F
p q
qFp pFq 1
![Page 46: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/46.jpg)
Q-Q plots
Two types of CDF:
1.Theoretical
2.Empirical, based on data nXX ,,1
qXPqFp
n
qXiqFp i
:#ˆˆ
![Page 47: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/47.jpg)
Q-Q plots
Direct Visualizations:
1. Empirical CDF plot:
plot
vs. grid of (sorted data) values
2. Quantile plot (inverse):
plot
vs.
q̂
nin
ip ,,1,ˆ
q̂
p̂
![Page 48: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/48.jpg)
Q-Q plots
Comparison Visualizations:
(compare a theoretical with an empirical)
3.P-P plot:
plot vs.
for a grid of values
4.Q-Q plot:
plot vs.
for a grid of values
q
q̂
p̂ p
p
q
![Page 49: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/49.jpg)
Q-Q plotsIllustrative graphic (toy data set):
![Page 50: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/50.jpg)
Q-Q plotsEmpirical Quantiles (sorted data points)
1q̂2q̂
3q̂
4q̂5q̂
![Page 51: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/51.jpg)
Q-Q plotsCorresponding ( matched) Theoretical Quantiles
1q̂2q̂
3q̂
4q̂5q̂ 1q
2q
3q
4q
5q
p
![Page 52: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/52.jpg)
Q-Q plotsIllustrative graphic (toy data set):
Main goal of Q-Q Plot:
Display how well quantiles compare
vs.iq̂ iq ni ,,1
![Page 53: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/53.jpg)
Q-Q plotsIllustrative graphic (toy data set):
1q̂2q̂
3q̂
4q̂5q̂ 1q
2q
3q
4q
5q
![Page 54: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/54.jpg)
Q-Q plotsIllustrative graphic (toy data set):
1q̂2q̂
3q̂
4q̂5q̂ 1q
2q
3q
4q
5q
![Page 55: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/55.jpg)
Q-Q plotsIllustrative graphic (toy data set):
1q̂2q̂
3q̂
4q̂5q̂ 1q
2q
3q
4q
5q
![Page 56: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/56.jpg)
Q-Q plotsIllustrative graphic (toy data set):
1q̂2q̂
3q̂
4q̂5q̂ 1q
2q
3q
4q
5q
![Page 57: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/57.jpg)
Q-Q plotsIllustrative graphic (toy data set):
1q̂2q̂
3q̂
4q̂5q̂ 1q
2q
3q
4q
5q
![Page 58: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/58.jpg)
Q-Q plotsIllustrative graphic (toy data set):
![Page 59: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/59.jpg)
Q-Q plotsIllustrative graphic (toy data set):
Empirical Qs near Theoretical Qs
when
Q-Q curve is near 450 line
(general use of Q-Q plots)
![Page 60: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/60.jpg)
Q-Q plotsnon-Gaussian! departures from line?
![Page 61: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/61.jpg)
Q-Q plotsnon-Gaussian! departures from line?
• Seems different from line?
• 2 modes turn into wiggles?
• Less strong feature
• Been proposed to study modality
• But density view + SiZer is
much better for finding modality
![Page 62: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/62.jpg)
Q-Q plotsnon-Gaussian (?) departures from line?
![Page 63: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/63.jpg)
Q-Q plotsnon-Gaussian (?) departures from line?
• Seems different from line?
• Harder to say this time?
• What is signal & what is noise?
• Need to understand sampling variation
![Page 64: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/64.jpg)
Q-Q plotsGaussian? departures from line?
![Page 65: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/65.jpg)
Q-Q plotsGaussian? departures from line?
• Looks much like?
• Wiggles all random variation?
• But there are n = 10,000 data points…
• How to assess signal & noise?
• Need to understand sampling variation
![Page 66: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/66.jpg)
Q-Q plotsNeed to understand sampling variation
• Approach: Q-Q envelope plot
– Simulate from Theoretical Dist’n
– Samples of same size
– About 100 samples gives
“good visual impression”
– Overlay resulting 100 QQ-curves
– To visually convey natural sampling variation
![Page 67: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/67.jpg)
Q-Q plotsnon-Gaussian! departures from line?
![Page 68: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/68.jpg)
Q-Q plotsnon-Gaussian! departures from line?
• Envelope Plot shows:
• Departures are significant
• Clear these data are not Gaussian
• Q-Q plot gives clear indication
![Page 69: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/69.jpg)
Q-Q plotsnon-Gaussian (?) departures from line?
![Page 70: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/70.jpg)
Q-Q plotsnon-Gaussian (?) departures from line?
• Envelope Plot shows:
• Departures are significant
• Clear these data are not Gaussian
• Recall not so clear from e.g. histogram
• Q-Q plot gives clear indication
• Envelope plot reflects sampling variation
![Page 71: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/71.jpg)
Q-Q plotsGaussian? departures from line?
![Page 72: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/72.jpg)
Q-Q plotsGaussian? departures from line?
• Harder to see
• But clearly there
• Conclude non-Gaussian
• Really needed n = 10,000 data points…
(why bigger sample size was used)
• Envelope plot reflects sampling variation
![Page 73: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/73.jpg)
Q-Q plotsWhat were these distributions?
• Non-Gaussian!
– 0.5 N(-1.5,0.752) + 0.5 N(1.5,0.752)
• Non-Gaussian (?)
– 0.4 N(0,1) + 0.3 N(0,0.52) + 0.3 N(0,0.252)
• Gaussian
• Gaussian?
– 0.7 N(0,1) + 0.3 N(0,0.52)
![Page 74: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/74.jpg)
Q-Q plotsNon-Gaussian! .5 N(-1.5,0.752) + 0.5 N(1.5,0.752)
![Page 75: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/75.jpg)
Q-Q plotsNon-Gaussian (?) 0.4 N(0,1) + 0.3 N(0,0.52) + 0.3 N(0,0.252)
![Page 76: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/76.jpg)
Q-Q plotsGaussian
![Page 77: Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When](https://reader036.vdocuments.net/reader036/viewer/2022062303/551a687d550346b52d8b4bdb/html5/thumbnails/77.jpg)
Q-Q plotsGaussian? 0.7 N(0,1) + 0.3 N(0,0.52)