session 5 & 6hedge-fund-analysis.net › media › docs › lesson 3.pdf · according to their...

28
SESSION 5 & 6 Last Update 23 rd February 2011 Introduction to Statistics

Upload: others

Post on 04-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

SESSION 5 & 6

Last Update 23rd February 2011

Introduction to Statistics

Page 2: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Lecturer: Florian Boehlandt

University: University of Stellenbosch Business School

Domain: http://www.hedge-fund-analysis.net/pages/vega.php

Page 3: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Learning Objectives Part 1

1. Sampling (Random Sampling)

2. Sampling Error

3. Nonsampling Error

Page 4: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Sampling

• Why? Cost!

• The sample proportions are used as an estimate for the population proportions

• Examples:

– Nielsen ratings (1,000 television viewers)

– Quality Management (destroy items?)

Page 5: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Terminology

• Target Population: the population about which statisticians want to draw inferences

• Sampled Population: The actual population from which the sample is taken

• The sample statistic is a good estimator of the population parameter if target population = sampled population

Page 6: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Terminology

• Self-selected samples are always biased, because individuals who participate are more keenly interested in the issue than non-participants (SLOP = self-selected opinion poll)

Page 7: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Sampling Plan

• A simple random sample is a sample selected in such a way that every possible sample with the same number of observations is equally likely to be chosen.

• A stratified random sample is obtained by separating he population into mutually exclusive sets (strata), and then drawing simple random samples form each stratum.

Page 8: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Sampling Plan

• A cluster sample is a simple random sample of groups or cluster of elements

Page 9: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Simple Random Sampling

• Concept: Raffles each element of the chosen population is assigned a unique number and then ‘drawn from a hat’

+ Social security numbers

+ Student numbers

– Telephone numbers

• A random number table / random number generator

(Excel: RAND) can be used to select sample numbers.

Page 10: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Simple Random Sampling

• Example Tax Returns (Keller 2006: p. 148)

Page 11: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Stratified Random Sampling

• Concept: Increase the amount of information aboiut the population

• Examples of criteria separating the population into strata:

– Gender

– Age

– Occupation

– Household Income

Page 12: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Stratified Random Sampling

• Example Proposed Tax Increase:

1. Draw random samples form four income groups according to their proportions in the population

2. Make adjustments before making inferences about the entire population

Stratum Income ‘000s Population % Sample

1 Under 25 25% 250

2 25-49 40% 400

3 50-75 30% 300

4 Over 75 5% 50

Total 1,000

Page 13: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Systematic Sampling

• Concept: sample members are chosen in a regular manner working progressively through the list

• Example Vega students:

500 students from Vega’s 8,500 enrolled students: 8,500 / 500 = 17. Thus, every 17th student would be selected

Page 14: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Cluster Sampling

• Concept: Useful when it is difficult or costly to develop a complete list of population members (i.e. making it difficult to draw a simple random sample) or when the population elements are widely dispersed (geographically)

• Example: Each block within a city represents a cluster. A sample of clusters could then be selected and every household within these clusters is questioned (sampling error? sample size)

Page 15: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Sampling Error

• Sampling error refers to the differences between the sample and the population that exist because of the observations that happened to be selected for the sample. The value of the sample mean will deviate from the population mean simply by chance

• The difference between the true (unknown) value of the population mean μ and its estimate (the sample mean x-bar) is the sampling error

• The only way to reduce the sampling error is to increase the sample size n

Page 16: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Nonsampling Error

• Nonsampling errors are due to mistakes made in the acquisition of data or due to the sample observations being selected improperly

• Nonsampling errors are more serious than sampling errors, because taking a larger sample won’t diminish the size, or possibilty of occurrence, of this error

Page 17: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Types of Nonsampling Error

• Errors in data acquisition: incorrect measurements/responses, inaccurate recording

• Nonresponse error: refers to bias introduced when responses are not obtained from some members of the sample (not representative of target population); self-administered surveys

• Selection bias: Some members of the target population cannot possibly be included in the sample (e.g. members have no phone)

Page 18: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Learning Objectives Part 2

4. Frequency Tables

5. Histograms

6. Class Intervals and Width

Page 19: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Frequency Tables – Data Types

Interval Data Ordinal Data Nominal Data

Categories Class Intervals

Count the number of times

each category of the

variable occurs

Count the number of

observations that fall into

each of a series of intervals

Frequency

Distribution

Frequency

Distribution

Histogram Bar Chart

Page 20: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Frequency Tables – Data Types

• There are times when a data set contains a large number of values (even when the data type is nominal) that would result in a table with too many rows to be convenient. We can overcome this problem by grouping the data into fewer categories or classes and then compiling a grouped frequency distribution.

Page 21: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Frequency Tables – Data Types

Ungrouped

Data

Categories Class Intervals

Count the number of times

each category of the

variable occurs

Count the number of

observations that fall into

each of a series of intervals

Frequency

Distribution

Frequency

Distribution

Histogram Bar Chart

Grouped Data

Page 22: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Frequency Tables – Data Types

• Example 1: Coffee refills

Data type nominal; Data ungrouped Categories

• Example 2: Class marks out of 100

Data type nominal; BUT: Data may be grouped Class intervals (approximately interval)

• Example 3: Waiting times at supermarket cashiers

Data type interval Class intervals

Page 23: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Number of Categories

Nominal / not grouped:

1. Determine maximum and minimum observation

2. Define categories including all distinct (integer) observations in between

Example tossing two dice:

Min: 2

Max: 12

Other possible outcomes: 2 4 5 6 7 8 9 10 11

(all outcomes accounted for)

Page 24: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Number of Class Intervals

Interval or grouped data:

The more observations there are the larger the number of class intervals required. Sturges’ Formula

Number of class intervals = 1 + 3.3 log10(n) OR

Number of class intervals = 1 + 1.4 ln(n)

Example n = 50: Number of class intervals = 1 + 3.3 log10 (50) = 1 + 3.3 * 1.70 = 6.61 ≈ 7

Number of class intervals = 1 + 1.4 ln(50) = 1 + 1.4 * 3.91 = 6.48 ≈ 6

Page 25: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Excursion Logarithms

• The logarithm of a number to a given base is the exponent to which the base must be raised in order to produce that number. (Example: 10^1.70 = 50)

• The natural logarithm is the logarithm to the base e, where e is an irrational constant approximately equal to 2.718. The natural logarithm of a number x (written as ln(x)) is the power to which e would have to be raised to equal x. (Example: e^3.91 = 50)

• The mathematical constant e (Euler’s number) is the unique real number such that the value of the derivative d/dx (slope of the tangent line) of the function f(x) = ex at the point x = 0 is equal to 1. It is called the exponential function.

Page 26: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Class Interval Width

Class width:

1. Subtract largest observation from smallest observation

2. Divide by number of classes (Sturges)

3. Round class width to convenient value

4. Select a lower limit so that the first class interval contains the smallest observation. Determine all other intervals consecutively by adding (multiples) of the class width

Page 27: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Definitions

• Class Mark or Class Midpoint: Adding the lower class limits to the upper class limits and dividing by two frequency polygon

• Width of a class interval or class length: The difference between the upper class limit and the lower class limit. Usually, all classes are of equal width / length (Sturges)

• Class Boundaries: The class limits are stated in such a way that there is no overlap between classes

Page 28: SESSION 5 & 6hedge-fund-analysis.net › media › docs › Lesson 3.pdf · according to their proportions in the population 2. Make adjustments before making inferences about the

Definitions

Class Boundaries: The class limits are stated in such a way that there is no overlap between classes. Limits are stated in this manner so that there cannot be any doubt as to which class a certain value (observation) is to be allocated. Since data is often rounded, the true class limits are not the same as the stated class limits.

Example:

Weights recorded to the nearest kilogram

Stated class interval: 60 – 62

True class interval or class boundaries: 59.5 – 62.5