1 data description. 2 units l unit: entity we are studying, subject if human being l each...
TRANSCRIPT
1
DATA DESCRIPTION
2
Units
Unit: entity we are studying, subject if human being
Each unit/subject has certain parameters, e.g., a student (subject) has his age, weight, height, home address, number of units taken, and so on.
3
Variables
These parameters are called variables. In statistics variables are stored in columns,
each variable occupying a column.
4
Cross-sectional and time-series analyses
In a cross-sectional analysis a unit/subject will be the entity you are studying. For example, if you study the housing market in San Diego, a unit will be a house, and variables will be price, size, age, etc., of a house.
In a time-series analysis the unit is a time unit, say, hour, day, month, etc.
5
Data Types
Nominal data: male/female, colors, Ordinal data: excellent/good/bad, Interval data: temperature, GMAT
scores, Ratio data: distance to school, price,
6
Two forms
GRAPHICAL form NUMERICAL SUMMARY form
7
Graphical forms
Sequence plots Histograms (frequency distributions) Scatter plots
8
Sequence plots
To describe a time series The horizontal axis is always related to the
sequence in which data were collected The vertical axis is the value of the variable
9
Example: sequence plot
40302010Index
470
460
450
440
430
S&
P-5
00
10
Histograms I
A histogram (frequency distribution) shows how many values are in a certain range.
It is used for cross-sectional analysis. the potential observation values are divided into
groups (called classes). The number of observations falling into each class is
called frequency. When we say an observation falls into a class, we
mean its value is greater than or equal to the lower bound but less than the upper bound of the class.
11
Example: histogram
A commercial bank is studying the time a customer spends in line. They recorded waiting times (in minutes) of 28 customers:
5.9 7.6 5.3 9.7 1.6 3.5 7.4
4.0 1.6 7.3 8.2 8.4 6.5 8.9
1.1 8.6 4.3 1.2 3.3 2.1 8.4
1.1 6.7 5.0 4.5 9.4 6.3 6.4
12
Example: histogram
13
Histogram II
The relative frequency distribution depicts the ratio of the frequency and the total number of observations.
The cumulative distribution depicts the percentage of observations that are less than a specific value.
14
Example: relative frequency distribution
A “relative frequency” distribution plots the fraction (or percentage) of observations in each class instead of the actual number. For this problem, the relative frequency of the first class is 6/28=0.214. The remaining relative frequencies are 0.179, 0.250, 0.286 and 0.071. A graph similar to the above one can then be plotted.
15
Example: cumulative distribution
In the previous example, the percentage of observations that are less than 3 minutes is 0.214, the percentage of observations that are less than 5 is 0.214+0.179=0.393, less than 7 is 0.214+0.179+0.25=0.643, less than 9 is 0.214+0.179+0.25+0.286=0.929, and that less than 11 is 1.0.
16
Example: cumulative distribution
17
Histogram III
The summation of all the relative frequencies is always 1.
The cumulative distribution is non-decreasing.
The last value of the cumulative distribution is always 1.
A cumulative distribution can be derived from the corresponding relative distribution, and vice versa.
18
Probability
A random variable is a variable whose values cannot predetermined but governed by some random mechanism.
Although we cannot predict precisely the value of a random variable, we might be able to tell the possibility of a random variable being in a certain interval.
The relative frequency is also the probability of a random variable falling in the corresponding class.
The relative frequency distribution is also the probability distribution.
19
Scatter plots
A scatter plot shows the relationship between two variables.
20
Example: scatter plot
. The following are the height and foot size measurements of 8 men arbitrarily selected from students in the cafeteria. Heights and foot sizes are in centimeters.
man 1 2 3 4 5 6 7 8
Height 155 160 149 175 182 145 177 164
foot 23.3 21.8 22.1 26.3 28.0 20.7 25.3 24.9
21
Example: scatter plot
130
140
150
160
170
180
190
20 22 24 26 28
Foot size, cm
Height, cm
22
Numerical Summary Forms
Central locations: mean, median, and mode. Dispersion: standard deviation and variance. Correlation.
23
Mean
Mean/average is the summation of the observations divided by the number of observations
27 22 26 24 27 20 23 24 18 32
Sum = (27 + 22 + 26 + 24 + 27 + 20 + 23 + 24 + 18 + 32) = 243
Mean = 243/10 = 24.3
24
Median
Median is the value of the central observation (the one in the middle), when the observations are listed in ascending or descending order.
When there is an even number of values, the median is given by the average of the middle two values.
When there is an odd number of values, the median is given by the middle number.
25
Example: median
18 20 22 23 24 24 26 27 27 32
26
Compare mean and median
The median is less sensitive to outliers than the mean. Check the mean and median for the following two data sets:
18 20 22 23 24 24 26 27 27 32
18 20 22 23 24 24 26 27 27 320
27
Mode
Mode is the most frequently occurring value(s).
28
Symmetry and skew
A frequency distribution in which the area to the left of the mean is a mirror image of the area to the right is called a symmetrical distribution.
A distribution that has a longer tail on the right hand side than on the left is called positively skewed or skewed to the right. A distribution that has a longer tail on the left is called negatively skewed.
If a distribution is positively skewed, the mean exceeds the median. For a negatively skewed distribution, the mean is less than the median.
29
Range
The range is the difference in the maximum and minimum values of the observations.
30
Standard deviation and variance
The standard deviation is used to describe the dispersion of the data.
The variance is the squared standard deviation.
31
Calculation of S.D.
Calculate the mean; calculate the deviations; calculate the squares of the deviations and
sum them up; Divide the sum by n-1 and take the square
root.
32
Example: S.D.
Sample 27 22 26 24 27 20 23 24 18 32
Deviation 2.7 -2.3 1.7 -0.3 2.7 -4.3 -1.3 -.3 -6.3 7.7
Sq of Dev 7.29 5.29 2.89 .09 7.29 19.5 1.69 .09 39.7 59.3
Sum of = 7.29 + 5.29 + ..... + 59.3 = 142.1
Std. Dev. = 142 1
915 79 3 97
.. .
33
std devx x x x x x
nn. .
( ) ( ) ( )
12
22 2
1
34
Empirical rules
If the distribution is symmetrical and bell-shaped,
Approximately 68% of the observations will be within plus and minus one standard deviation from he mean.
Approximately 95% observations will be within two standard deviation of the mean.
Approximately 99.7% observations will be within three standard deviations of the mean.
35
Percentiles
The 75th percentile is the value such that 75% of the numbers are less than or equal to this value and the remaining 25% are larger than this value.
The k-th percentile is the value such that k% of the numbers are less than or equal to this value and the remaining 1-k% are larger than this value.
36
Correlation coefficient
The Correlation coefficient measures how closely two variables are (linearly) related to each other. It has a value between -1 to +1.
Positive and negative linear relationships. If two variables are not linearly related, the
correlation coefficient will be zero; if they are closely related, the correlation coefficient will be close to 1 or -1.