1 peter fox data analytics – itws-4963/itws-6965 week 2a, february 3, 2015, lally 102 data and...
TRANSCRIPT
![Page 1: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/1.jpg)
1
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 2a, February 3, 2015, LALLY 102
Data and Information Resources, Role of Hypothesis, Synthesis
and Model Choices
![Page 2: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/2.jpg)
Contents• Data sources
– Cyber– Human
• “Munging”
• Exploring– Distributions…– Summaries– Visualization
• Testing and evaluating the results (beginning) 2
![Page 3: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/3.jpg)
Lower layers in the Analytics Stack
3
![Page 4: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/4.jpg)
“Cyber Data” …
4
![Page 5: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/5.jpg)
“Human Data” …
5
![Page 6: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/6.jpg)
Data Prepared for Analysis = Munging
• Missing values, null values, etc.• E.g. in the EPI_data – they use “--”
– Most data applications provide built ins for these higher-order functions – in R “NA” is used and functions such as is.na(var), etc. provide powerful filtering options (we’ll cover these on Friday)
• Of course, different variables often are missing “different” values
• In R – higher-order functions such as: Reduce, Filter, Map, Find, Position and Negate will become your enemies and then your friends: http://www.johnmyleswhite.com/notebook/2010/09/23/higher-order-functions-in-r/
6
![Page 7: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/7.jpg)
Getting started – summarize data• Summary statistic
– Ranges, “hinges”– Tukey’s five numbers
• Look for a distribution match
• Tests…for…– Normality – shapiro-wilks – returns a statistic (W!)
and a p-value – what is the null hypothesis here?
> shapiro.test(EPI_data$EPI)
Shapiro-Wilk normality test
data: EPI_data$EPI
W = 0.9866, p-value = 0.1188 7
![Page 8: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/8.jpg)
Accept or Reject?• Reject the null hypothesis if the p-value is
less than the level of significance.
• You will fail to reject the null hypothesis if the p-value is greater than or equal to the level of significance.
• Typical significance 0.05 (!)8
![Page 9: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/9.jpg)
Another variable in EPI> shapiro.test(EPI_data$DALY)
Shapiro-Wilk normality test
data: EPI_data$DALY
W = 0.9365, p-value = 1.891e-07
Accept or reject?
9
![Page 10: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/10.jpg)
Distribution tests• Binomial, …. most distributions have tests
• Wilcoxon (Mann-Whitney)– Comparing populations – versus to a distribution
• Kolmogorov-Smirnov (KS)
• …
• It got out of control when people realized they can name the test after themselves, v. someone else… 10
![Page 11: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/11.jpg)
Getting started – look at the data• Visually
– What is the improvement in the understanding of the data as compared to the situation without visualization?
– Which visualization techniques are suitable for one's data?
• Scatter plot diagrams• Box plots (min, 1st quartile, median, 3rd quartile, max)• Stem and leaf plots• Frequency plots• Group Frequency Distributions plot• Cumulative Frequency plots• Distribution plots
11
![Page 12: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/12.jpg)
Why visualization?• Reducing amount of data, quantization
• Patterns
• Features
• Events
• Trends
• Irregularities
• Leading to presentation of data, i.e. information products
• Exit points for analysis12
![Page 13: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/13.jpg)
Exploring the distribution> summary(EPI) # stats
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
32.10 48.60 59.20 58.37 67.60 93.50 68
> boxplot(EPI)
> fivenum(EPI,na.rm=TRUE)[1] 32.1 48.6 59.2 67.6 93.5
Tukey: min, lower hinge, median, upper hinge, max
13
![Page 14: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/14.jpg)
Stem and leaf plot> stem(EPI) # like-a histogram The decimal point is 1 digit(s) to the right of the | - but the scale of the stem is 10… watch carefully..
3 | 234
3 | 66889
4 | 00011112222223344444
4 | 5555677788888999
5 | 0000111111111244444
5 | 55666677778888999999
6 | 000001111111222333344444
6 | 5555666666677778888889999999
7 | 000111233333334
7 | 5567888
8 | 11
8 | 669
9 | 414
![Page 15: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/15.jpg)
Grouped Frequency Distribution aka binning> hist(EPI) #defaults
15
![Page 16: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/16.jpg)
Distributions• Shape
• Character
• Parameter(s)
• Which one fits?
16
![Page 17: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/17.jpg)
17
> hist(EPI, seq(30., 95., 1.0), prob=TRUE)
> lines (density(EPI,na.rm=TRUE,bw=1.))
> rug(EPI)or> lines (density(EPI,na.rm=TRUE,bw=“SJ”))
![Page 18: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/18.jpg)
18
> hist(EPI, seq(30., 95., 1.0), prob=TRUE)
> lines (density(EPI,na.rm=TRUE,bw=“SJ”))
![Page 19: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/19.jpg)
Why are histograms so unsatisfying?
19
![Page 20: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/20.jpg)
> xn<-seq(30,95,1)
> qn<-dnorm(xn,mean=63, sd=5,log=FALSE)
> lines(xn,qn)
> lines(xn,.4*qn)
> ln<-dnorm(xn,mean=44, sd=5,log=FALSE)
> lines(xn,.26*ln)
20
![Page 21: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/21.jpg)
Exploring the distribution> summary(DALY) # stats
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.00 37.19 60.35 53.94 71.97 91.50 39
> fivenum(DALY,na.rm=TRUE)[1] 0.000 36.955 60.350 72.320 91.500
21EPI DALY
![Page 22: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/22.jpg)
Stem and leaf plot> stem(DALY) # The decimal point is 1 digit(s) to the right of the |
0 | 0000111244
0 | 567899
1 | 0234
1 | 56688
2 | 000123
2 | 5667889
3 | 00001134
3 | 5678899
4 | 00011223444
4 | 555799
5 | 12223344
5 | 556667788999999
6 | 0000011111222233334444
6 | 6666666677788889999
7 | 00000000223333444
7 | 66888999
8 | 1113333333
8 | 555557777777777799999
9 | 22
22
![Page 23: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/23.jpg)
Beyond histograms• Cumulative distribution function: probability that a
real-valued random variable X with a given probability distribution will be found at a value less than or equal to x.
> plot(ecdf(EPI), do.points=FALSE, verticals=TRUE) 23
![Page 24: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/24.jpg)
Beyond histograms• Quantile ~ inverse cumulative density function –
points taken at regular intervals from the CDF, e.g. 2-quantiles=median, 4-quantiles=quartiles
• Quantile-Quantile (versus default=normal dist.)> par(pty="s")
> qqnorm(EPI); qqline(EPI)
24
![Page 25: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/25.jpg)
Beyond histograms
• Simulated data from t-distribution (random):> x <- rt(250, df = 5)
> qqnorm(x); qqline(x)
25
![Page 26: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/26.jpg)
Beyond histograms
• Q-Q plot against the generating distribution: x<-seq(30,95,1)> qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t dsn")
> qqline(x)
26
![Page 27: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/27.jpg)
But if you are not sure it is normal
> wilcox.test(EPI,DALY)
Wilcoxon rank sum test with continuity correction
data: EPI and DALY
W = 15970, p-value = 0.7386
alternative hypothesis: true location shift is not equal to 0
27
![Page 28: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/28.jpg)
Comparing the CDFs> plot(ecdf(EPI), do.points=FALSE, verticals=TRUE)
> plot(ecdf(DALY), do.points=FALSE, verticals=TRUE, add=TRUE)
28
![Page 29: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/29.jpg)
29
![Page 30: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/30.jpg)
30
![Page 31: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/31.jpg)
31
![Page 32: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/32.jpg)
32
![Page 33: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/33.jpg)
More munging• Bad values, outliers, corrupted entries,
thresholds …
• Noise reduction – low-pass filtering, binning
• Modal filtering
• REMEMBER: when you munge you MUST record what you did (and why) and save copies of pre- and post- operations… 33
![Page 34: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/34.jpg)
34
![Page 35: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/35.jpg)
35
![Page 36: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/36.jpg)
Populations within populations• In the EPI example:
– Geographic regions (GEO_subregion)– EPI_regions– Eco-regions (EDC v. LEDC – know what that is?)– Primary industry(ies)– Climate region
• What would you do to start exploring?
36
![Page 37: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/37.jpg)
37
![Page 38: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/38.jpg)
38
Or, a twist – n=1 but many attributes?
The item of interest in relation to its attributes
![Page 39: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/39.jpg)
Summary: explore• Going from preliminary to initial analysis…
• Determining if there is one or more common distributions involved – i.e. parametric statistics (assumes or asserts a probability distribution)
• Fitting that distribution -> provides a model!
• Or NOT– A hybrid or
– Non-parametric (statistics) approaches are needed – more on this to come
39
![Page 40: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/40.jpg)
Goodness of fit• And, we cannot take the models at face
value, we must assess how fit they may be:– Chi-Square – One-sided and two-sided Kolmogorov-Smirnov
tests– Lilliefors tests– Ansari-Bradley tests– Jarque-Bera tests
• Just a preview…40
![Page 41: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/41.jpg)
41
Summary
• Cyber and Human data; quality, uncertainty and bias – you will often spend a lot of time with the data
• Distributions – the common and not-so common ones and how cyber and human data can have distinct distributions
• How simple statistical distributions can mislead us• Populations and samples and how inferential
statistics will lead us to model choices (no we have not actually done that yet in detail)
• Munging toward exploratory analysis• Toward models!
![Page 42: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/42.jpg)
How are the software installs going?
• R
• Data exercises?
– You can try some of the examples from today on the EPI dataset
• More on Friday… and other datasets.
42
![Page 43: 1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 2a, February 3, 2015, LALLY 102 Data and Information Resources, Role of Hypothesis, Synthesis and](https://reader035.vdocuments.net/reader035/viewer/2022062500/5697bff81a28abf838cbf3d8/html5/thumbnails/43.jpg)
The DATUM project• This class is one included in the DATUM
project for assessment and evaluation
• Here’s what that means…
– Friday – intro and survey
– End of semester – another survey
43