1 data description. 2 units l unit: entity we are studying, subject if human being l each...

36
1 DATA DESCRIPTION

Upload: beverly-doyle

Post on 28-Dec-2015

222 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

1

DATA DESCRIPTION

Page 2: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

2

Units

Unit: entity we are studying, subject if human being

Each unit/subject has certain parameters, e.g., a student (subject) has his age, weight, height, home address, number of units taken, and so on.

Page 3: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

3

Variables

These parameters are called variables. In statistics variables are stored in columns,

each variable occupying a column.

Page 4: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

4

Cross-sectional and time-series analyses

In a cross-sectional analysis a unit/subject will be the entity you are studying. For example, if you study the housing market in San Diego, a unit will be a house, and variables will be price, size, age, etc., of a house.

In a time-series analysis the unit is a time unit, say, hour, day, month, etc.

Page 5: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

5

Data Types

Nominal data: male/female, colors, Ordinal data: excellent/good/bad, Interval data: temperature, GMAT

scores, Ratio data: distance to school, price,

Page 6: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

6

Two forms

GRAPHICAL form NUMERICAL SUMMARY form

Page 7: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

7

Graphical forms

Sequence plots Histograms (frequency distributions) Scatter plots

Page 8: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

8

Sequence plots

To describe a time series The horizontal axis is always related to the

sequence in which data were collected The vertical axis is the value of the variable

Page 9: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

9

Example: sequence plot

40302010Index

470

460

450

440

430

S&

P-5

00

Page 10: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

10

Histograms I

A histogram (frequency distribution) shows how many values are in a certain range.

It is used for cross-sectional analysis. the potential observation values are divided into

groups (called classes). The number of observations falling into each class is

called frequency. When we say an observation falls into a class, we

mean its value is greater than or equal to the lower bound but less than the upper bound of the class.

Page 11: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

11

Example: histogram

A commercial bank is studying the time a customer spends in line. They recorded waiting times (in minutes) of 28 customers:

5.9 7.6 5.3 9.7 1.6 3.5 7.4

4.0 1.6 7.3 8.2 8.4 6.5 8.9

1.1 8.6 4.3 1.2 3.3 2.1 8.4

1.1 6.7 5.0 4.5 9.4 6.3 6.4

Page 12: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

12

Example: histogram

Page 13: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

13

Histogram II

The relative frequency distribution depicts the ratio of the frequency and the total number of observations.

The cumulative distribution depicts the percentage of observations that are less than a specific value.

Page 14: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

14

Example: relative frequency distribution

A “relative frequency” distribution plots the fraction (or percentage) of observations in each class instead of the actual number. For this problem, the relative frequency of the first class is 6/28=0.214. The remaining relative frequencies are 0.179, 0.250, 0.286 and 0.071. A graph similar to the above one can then be plotted.

Page 15: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

15

Example: cumulative distribution

In the previous example, the percentage of observations that are less than 3 minutes is 0.214, the percentage of observations that are less than 5 is 0.214+0.179=0.393, less than 7 is 0.214+0.179+0.25=0.643, less than 9 is 0.214+0.179+0.25+0.286=0.929, and that less than 11 is 1.0.

Page 16: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

16

Example: cumulative distribution

Page 17: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

17

Histogram III

The summation of all the relative frequencies is always 1.

The cumulative distribution is non-decreasing.

The last value of the cumulative distribution is always 1.

A cumulative distribution can be derived from the corresponding relative distribution, and vice versa.

Page 18: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

18

Probability

A random variable is a variable whose values cannot predetermined but governed by some random mechanism.

Although we cannot predict precisely the value of a random variable, we might be able to tell the possibility of a random variable being in a certain interval.

The relative frequency is also the probability of a random variable falling in the corresponding class.

The relative frequency distribution is also the probability distribution.

Page 19: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

19

Scatter plots

A scatter plot shows the relationship between two variables.

Page 20: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

20

Example: scatter plot

. The following are the height and foot size measurements of 8 men arbitrarily selected from students in the cafeteria. Heights and foot sizes are in centimeters.

man 1 2 3 4 5 6 7 8

Height 155 160 149 175 182 145 177 164

foot 23.3 21.8 22.1 26.3 28.0 20.7 25.3 24.9

Page 21: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

21

Example: scatter plot

130

140

150

160

170

180

190

20 22 24 26 28

Foot size, cm

Height, cm

Page 22: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

22

Numerical Summary Forms

Central locations: mean, median, and mode. Dispersion: standard deviation and variance. Correlation.

Page 23: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

23

Mean

Mean/average is the summation of the observations divided by the number of observations

27 22 26 24 27 20 23 24 18 32

Sum = (27 + 22 + 26 + 24 + 27 + 20 + 23 + 24 + 18 + 32) = 243

Mean = 243/10 = 24.3

Page 24: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

24

Median

Median is the value of the central observation (the one in the middle), when the observations are listed in ascending or descending order.

When there is an even number of values, the median is given by the average of the middle two values.

When there is an odd number of values, the median is given by the middle number.

Page 25: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

25

Example: median

18 20 22 23 24 24 26 27 27 32

Page 26: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

26

Compare mean and median

The median is less sensitive to outliers than the mean. Check the mean and median for the following two data sets:

18 20 22 23 24 24 26 27 27 32

18 20 22 23 24 24 26 27 27 320

Page 27: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

27

Mode

Mode is the most frequently occurring value(s).

Page 28: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

28

Symmetry and skew

A frequency distribution in which the area to the left of the mean is a mirror image of the area to the right is called a symmetrical distribution.

A distribution that has a longer tail on the right hand side than on the left is called positively skewed or skewed to the right. A distribution that has a longer tail on the left is called negatively skewed.

If a distribution is positively skewed, the mean exceeds the median. For a negatively skewed distribution, the mean is less than the median.

Page 29: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

29

Range

The range is the difference in the maximum and minimum values of the observations.

Page 30: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

30

Standard deviation and variance

The standard deviation is used to describe the dispersion of the data.

The variance is the squared standard deviation.

Page 31: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

31

Calculation of S.D.

Calculate the mean; calculate the deviations; calculate the squares of the deviations and

sum them up; Divide the sum by n-1 and take the square

root.

Page 32: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

32

Example: S.D.

Sample 27 22 26 24 27 20 23 24 18 32

Deviation 2.7 -2.3 1.7 -0.3 2.7 -4.3 -1.3 -.3 -6.3 7.7

Sq of Dev 7.29 5.29 2.89 .09 7.29 19.5 1.69 .09 39.7 59.3

Sum of = 7.29 + 5.29 + ..... + 59.3 = 142.1

Std. Dev. = 142 1

915 79 3 97

.. .

Page 33: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

33

std devx x x x x x

nn. .

( ) ( ) ( )

12

22 2

1

Page 34: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

34

Empirical rules

If the distribution is symmetrical and bell-shaped,

Approximately 68% of the observations will be within plus and minus one standard deviation from he mean.

Approximately 95% observations will be within two standard deviation of the mean.

Approximately 99.7% observations will be within three standard deviations of the mean.

Page 35: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

35

Percentiles

The 75th percentile is the value such that 75% of the numbers are less than or equal to this value and the remaining 25% are larger than this value.

The k-th percentile is the value such that k% of the numbers are less than or equal to this value and the remaining 1-k% are larger than this value.

Page 36: 1 DATA DESCRIPTION. 2 Units l Unit: entity we are studying, subject if human being l Each unit/subject has certain parameters, e.g., a student (subject)

36

Correlation coefficient

The Correlation coefficient measures how closely two variables are (linearly) related to each other. It has a value between -1 to +1.

Positive and negative linear relationships. If two variables are not linearly related, the

correlation coefficient will be zero; if they are closely related, the correlation coefficient will be close to 1 or -1.