session 5 & 6hedge-fund-analysis.net › media › docs › lesson 3.pdf · according to their...

SESSION 5 & 6

Last Update 23rd February 2011

Introduction to Statistics

Lecturer: Florian Boehlandt

University: University of Stellenbosch Business School

Domain: http://www.hedge-fund-analysis.net/pages/vega.php

http://www.hedge-fund-analysis.net/





Learning Objectives Part 1

1. Sampling (Random Sampling)

2. Sampling Error

3. Nonsampling Error

Sampling

• Why? Cost!

• The sample proportions are used as an estimate for the population proportions

• Examples:

– Nielsen ratings (1,000 television viewers)

– Quality Management (destroy items?)

Terminology

• Target Population: the population about which statisticians want to draw inferences

• Sampled Population: The actual population from which the sample is taken

• The sample statistic is a good estimator of the population parameter if target population = sampled population

Terminology

• Self-selected samples are always biased, because individuals who participate are more keenly interested in the issue than non-participants (SLOP = self-selected opinion poll)

Sampling Plan

• A simple random sample is a sample selected in such a way that every possible sample with the same number of observations is equally likely to be chosen.

• A stratified random sample is obtained by separating he population into mutually exclusive sets (strata), and then drawing simple random samples form each stratum.

Sampling Plan

• A cluster sample is a simple random sample of groups or cluster of elements

Simple Random Sampling

• Concept: Raffles each element of the chosen population is assigned a unique number and then ‘drawn from a hat’

+ Social security numbers

+ Student numbers

– Telephone numbers

• A random number table / random number generator

(Excel: RAND) can be used to select sample numbers.

Simple Random Sampling

• Example Tax Returns (Keller 2006: p. 148)

Stratified Random Sampling

• Concept: Increase the amount of information aboiut the population

• Examples of criteria separating the population into strata:

– Gender

– Age

– Occupation

– Household Income

Stratified Random Sampling

• Example Proposed Tax Increase:

1. Draw random samples form four income groups according to their proportions in the population

2. Make adjustments before making inferences about the entire population

Stratum Income ‘000s Population % Sample

1 Under 25 25% 250

2 25-49 40% 400

3 50-75 30% 300

4 Over 75 5% 50

Total 1,000

Systematic Sampling

• Concept: sample members are chosen in a regular manner working progressively through the list

• Example Vega students:

500 students from Vega’s 8,500 enrolled students: 8,500 / 500 = 17. Thus, every 17th student would be selected

Cluster Sampling

• Concept: Useful when it is difficult or costly to develop a complete list of population members (i.e. making it difficult to draw a simple random sample) or when the population elements are widely dispersed (geographically)

• Example: Each block within a city represents a cluster. A sample of clusters could then be selected and every household within these clusters is questioned (sampling error? sample size)

Sampling Error

• Sampling error refers to the differences between the sample and the population that exist because of the observations that happened to be selected for the sample. The value of the sample mean will deviate from the population mean simply by chance

• The difference between the true (unknown) value of the population mean μ and its estimate (the sample mean x-bar) is the sampling error

• The only way to reduce the sampling error is to increase the sample size n

Nonsampling Error

• Nonsampling errors are due to mistakes made in the acquisition of data or due to the sample observations being selected improperly

• Nonsampling errors are more serious than sampling errors, because taking a larger sample won’t diminish the size, or possibilty of occurrence, of this error

Types of Nonsampling Error

• Errors in data acquisition: incorrect measurements/responses, inaccurate recording

• Nonresponse error: refers to bias introduced when responses are not obtained from some members of the sample (not representative of target population); self-administered surveys

• Selection bias: Some members of the target population cannot possibly be included in the sample (e.g. members have no phone)

Learning Objectives Part 2

4. Frequency Tables

5. Histograms

6. Class Intervals and Width

Frequency Tables – Data Types

Interval Data Ordinal Data Nominal Data

Categories Class Intervals

Count the number of times

each category of the

variable occurs

Count the number of

observations that fall into

each of a series of intervals

Frequency

Distribution

Frequency

Distribution

Histogram Bar Chart


• There are times when a data set contains a large number of values (even when the data type is nominal) that would result in a table with too many rows to be convenient. We can overcome this problem by grouping the data into fewer categories or classes and then compiling a grouped frequency distribution.


Ungrouped

Data

Categories Class Intervals

Count the number of times

each category of the

variable occurs

Count the number of

observations that fall into

each of a series of intervals

Frequency

Distribution

Frequency

Distribution

Histogram Bar Chart

Grouped Data


• Example 1: Coffee refills

Data type nominal; Data ungrouped Categories

• Example 2: Class marks out of 100

Data type nominal; BUT: Data may be grouped Class intervals (approximately interval)

• Example 3: Waiting times at supermarket cashiers

Data type interval Class intervals

Number of Categories

Nominal / not grouped:

1. Determine maximum and minimum observation

2. Define categories including all distinct (integer) observations in between

Example tossing two dice:

Min: 2

Max: 12

Other possible outcomes: 2 4 5 6 7 8 9 10 11

(all outcomes accounted for)

Number of Class Intervals

Interval or grouped data:

The more observations there are the larger the number of class intervals required. Sturges’ Formula

Number of class intervals = 1 + 3.3 log10(n) OR

Number of class intervals = 1 + 1.4 ln(n)

Example n = 50: Number of class intervals = 1 + 3.3 log10 (50) = 1 + 3.3 * 1.70 = 6.61 ≈ 7

Number of class intervals = 1 + 1.4 ln(50) = 1 + 1.4 * 3.91 = 6.48 ≈ 6

Excursion Logarithms

• The logarithm of a number to a given base is the exponent to which the base must be raised in order to produce that number. (Example: 10^1.70 = 50)

• The natural logarithm is the logarithm to the base e, where e is an irrational constant approximately equal to 2.718. The natural logarithm of a number x (written as ln(x)) is the power to which e would have to be raised to equal x. (Example: e^3.91 = 50)

• The mathematical constant e (Euler’s number) is the unique real number such that the value of the derivative d/dx (slope of the tangent line) of the function f(x) = ex at the point x = 0 is equal to 1. It is called the exponential function.

Class Interval Width

Class width:

1. Subtract largest observation from smallest observation

2. Divide by number of classes (Sturges)

3. Round class width to convenient value

4. Select a lower limit so that the first class interval contains the smallest observation. Determine all other intervals consecutively by adding (multiples) of the class width

Definitions

• Class Mark or Class Midpoint: Adding the lower class limits to the upper class limits and dividing by two frequency polygon

• Width of a class interval or class length: The difference between the upper class limit and the lower class limit. Usually, all classes are of equal width / length (Sturges)

• Class Boundaries: The class limits are stated in such a way that there is no overlap between classes

Definitions

Class Boundaries: The class limits are stated in such a way that there is no overlap between classes. Limits are stated in this manner so that there cannot be any doubt as to which class a certain value (observation) is to be allocated. Since data is often rounded, the true class limits are not the same as the stated class limits.

Example:

Weights recorded to the nearest kilogram

Stated class interval: 60 – 62

True class interval or class boundaries: 59.5 – 62.5

session 5 & 6hedge-fund-analysis.net › media › docs › lesson 3.pdf · according to their...

Documents