1 data description. 2 units l unit: entity we are studying, subject if human being l each...

1

DATA DESCRIPTION

2

Units

Unit: entity we are studying, subject if human being

Each unit/subject has certain parameters, e.g., a student (subject) has his age, weight, height, home address, number of units taken, and so on.

3

Variables

These parameters are called variables. In statistics variables are stored in columns,

each variable occupying a column.

4

Cross-sectional and time-series analyses

In a cross-sectional analysis a unit/subject will be the entity you are studying. For example, if you study the housing market in San Diego, a unit will be a house, and variables will be price, size, age, etc., of a house.

In a time-series analysis the unit is a time unit, say, hour, day, month, etc.

5

Data Types

Nominal data: male/female, colors, Ordinal data: excellent/good/bad, Interval data: temperature, GMAT

scores, Ratio data: distance to school, price,

6

Two forms

GRAPHICAL form NUMERICAL SUMMARY form

7

Graphical forms

Sequence plots Histograms (frequency distributions) Scatter plots

8

Sequence plots

To describe a time series The horizontal axis is always related to the

sequence in which data were collected The vertical axis is the value of the variable

9

Example: sequence plot

40302010Index

470

460

450

440

430

S&

P-5

00

10

Histograms I

A histogram (frequency distribution) shows how many values are in a certain range.

It is used for cross-sectional analysis. the potential observation values are divided into

groups (called classes). The number of observations falling into each class is

called frequency. When we say an observation falls into a class, we

mean its value is greater than or equal to the lower bound but less than the upper bound of the class.

11

Example: histogram

A commercial bank is studying the time a customer spends in line. They recorded waiting times (in minutes) of 28 customers:

5.9 7.6 5.3 9.7 1.6 3.5 7.4

4.0 1.6 7.3 8.2 8.4 6.5 8.9

1.1 8.6 4.3 1.2 3.3 2.1 8.4

1.1 6.7 5.0 4.5 9.4 6.3 6.4

12

Example: histogram

13

Histogram II

The relative frequency distribution depicts the ratio of the frequency and the total number of observations.

The cumulative distribution depicts the percentage of observations that are less than a specific value.

14

Example: relative frequency distribution

A “relative frequency” distribution plots the fraction (or percentage) of observations in each class instead of the actual number. For this problem, the relative frequency of the first class is 6/28=0.214. The remaining relative frequencies are 0.179, 0.250, 0.286 and 0.071. A graph similar to the above one can then be plotted.

15

Example: cumulative distribution

In the previous example, the percentage of observations that are less than 3 minutes is 0.214, the percentage of observations that are less than 5 is 0.214+0.179=0.393, less than 7 is 0.214+0.179+0.25=0.643, less than 9 is 0.214+0.179+0.25+0.286=0.929, and that less than 11 is 1.0.

16

Example: cumulative distribution

17

Histogram III

The summation of all the relative frequencies is always 1.

The cumulative distribution is non-decreasing.

The last value of the cumulative distribution is always 1.

A cumulative distribution can be derived from the corresponding relative distribution, and vice versa.

18

Probability

A random variable is a variable whose values cannot predetermined but governed by some random mechanism.

Although we cannot predict precisely the value of a random variable, we might be able to tell the possibility of a random variable being in a certain interval.

The relative frequency is also the probability of a random variable falling in the corresponding class.

The relative frequency distribution is also the probability distribution.

19

Scatter plots

A scatter plot shows the relationship between two variables.

20

Example: scatter plot

. The following are the height and foot size measurements of 8 men arbitrarily selected from students in the cafeteria. Heights and foot sizes are in centimeters.

man 1 2 3 4 5 6 7 8

Height 155 160 149 175 182 145 177 164

foot 23.3 21.8 22.1 26.3 28.0 20.7 25.3 24.9

21

Example: scatter plot

130

140

150

160

170

180

190

20 22 24 26 28

Foot size, cm

Height, cm

22

Numerical Summary Forms

Central locations: mean, median, and mode. Dispersion: standard deviation and variance. Correlation.

23

Mean

Mean/average is the summation of the observations divided by the number of observations

27 22 26 24 27 20 23 24 18 32

Sum = (27 + 22 + 26 + 24 + 27 + 20 + 23 + 24 + 18 + 32) = 243

Mean = 243/10 = 24.3

24

Median

Median is the value of the central observation (the one in the middle), when the observations are listed in ascending or descending order.

When there is an even number of values, the median is given by the average of the middle two values.

When there is an odd number of values, the median is given by the middle number.

25

Example: median

18 20 22 23 24 24 26 27 27 32

26

Compare mean and median

The median is less sensitive to outliers than the mean. Check the mean and median for the following two data sets:

18 20 22 23 24 24 26 27 27 32

18 20 22 23 24 24 26 27 27 320

27

Mode

Mode is the most frequently occurring value(s).

28

Symmetry and skew

A frequency distribution in which the area to the left of the mean is a mirror image of the area to the right is called a symmetrical distribution.

A distribution that has a longer tail on the right hand side than on the left is called positively skewed or skewed to the right. A distribution that has a longer tail on the left is called negatively skewed.

If a distribution is positively skewed, the mean exceeds the median. For a negatively skewed distribution, the mean is less than the median.

29

Range

The range is the difference in the maximum and minimum values of the observations.

30

Standard deviation and variance

The standard deviation is used to describe the dispersion of the data.

The variance is the squared standard deviation.

31

Calculation of S.D.

Calculate the mean; calculate the deviations; calculate the squares of the deviations and

sum them up; Divide the sum by n-1 and take the square

root.

32

Example: S.D.

Sample 27 22 26 24 27 20 23 24 18 32

Deviation 2.7 -2.3 1.7 -0.3 2.7 -4.3 -1.3 -.3 -6.3 7.7

Sq of Dev 7.29 5.29 2.89 .09 7.29 19.5 1.69 .09 39.7 59.3

Sum of = 7.29 + 5.29 + ..... + 59.3 = 142.1

Std. Dev. = 142 1

915 79 3 97

.. .

33

std devx x x x x x

nn. .

( ) ( ) ( )

12

22 2

1

34

Empirical rules

If the distribution is symmetrical and bell-shaped,

Approximately 68% of the observations will be within plus and minus one standard deviation from he mean.

Approximately 95% observations will be within two standard deviation of the mean.

Approximately 99.7% observations will be within three standard deviations of the mean.

35

Percentiles

The 75th percentile is the value such that 75% of the numbers are less than or equal to this value and the remaining 25% are larger than this value.

The k-th percentile is the value such that k% of the numbers are less than or equal to this value and the remaining 1-k% are larger than this value.

36

Correlation coefficient

The Correlation coefficient measures how closely two variables are (linearly) related to each other. It has a value between -1 to +1.

Positive and negative linear relationships. If two variables are not linearly related, the

correlation coefficient will be zero; if they are closely related, the correlation coefficient will be close to 1 or -1.

1 data description. 2 units l unit: entity we are studying, subject if human being l each...

Documents