ch1. introduction - kocwcontents.kocw.net/kocw/document/2015/gachon/kimnamhyoung...ch1. introduction...

Ch1. Introduction

Namhyoung Kim

Dept. of Applied Statistics

Gachon University

[email protected]

1

1.1 Categorical Response Data

• A categorical variable has a measurement scale consisting of a set of categories

• For example, political philosophy may be measured as “liberal”, “moderate”, or “conservative”;

• Commonly used in the social and health sciences for measuring attitudes, opinions and responses.

• Behavior sciences, public health, zoology, education, marketing, engineering sciences and industrial quality control

2

Response/Explanatory Variable

• Response variable(dependent variable or Y variable) • Explanatory variable(independent variable or X variable) • The subject of this course is the analysis of categorical

response variables. • The explanatory variables can be categorical or

continuous.

3

Nominal/Ordinal Scale

• Categorical variables have two main types of measurement scales – Ordinal variables: ordered scales like attitude

toward something, appraisal of a company’s inventory level, response to a medical treatment, and frequency of feeling symptoms of anxiety

– Nominal variables: unordered scales like religious affiliation, primary mode of transportation to work, favorite type of music, and favorite place to shop

4

Nominal/Ordinal Scale

• Methods designed for ordinal variables cannot be used with nominal variables.

• Methods designed for nominal variables can be used with nominal or ordinal variables, but they do not use the information about that ordering (serious loss of power)

5

Problems

• 1.1 In the following examples, identify the response variable and the explanatory variables. – a. Attitude toward gun control(favor, oppose),

Gender(female, male), Mother’s education(high school, college)

– b. Heart disease(yes, no), Blood pressure, Cholesterol level

– c. Race(white, nonwhite), Religion(Catholic, Jewish, Protestant), Vote for president(Democrat, Republican, Other), Annual income

– d. Marital status (married, single, divorced, widowed), Quality of life(excellent, good, fair, poor)

6

Problems

• 1.2 Which scale of measurement is most appropriate for the following variables –nominal, or ordinal? – a. Political party affiliation (Democrat, Republican,

unaffiliated). – b. Highest degree obtained (none, high school,

bachelor’s, master’s, doctorate). – c. Patient condition (good, fair, serious, critical). – d. Hospital location (London, Boston, Madison,

Rochester, Toronto). – e. Favorite beverage (beer, juice, milk, soft drink,

wine, other). – f. How often feel depressed (never, occasionally,

often, always). 7

1.2 Probability Distributions for Categorical Data

• Key distributions for categorical data: – binomial and – multinomial distribution

8

Binomial Distribution

• n independent and identical trials with two possible outcomes, “success” and “failure”

• Identical trials: the probability of success is the same for each trial

• Independent trials: the response outcomes are independent random variables

Bernoulli trials

9

Binomial Distribution

• Let Y denote the number of successes out of the 𝑛𝑛 trials with 𝜋𝜋, the probability of success for a given trial.

• The probability of outcome y for Y equals

𝑃𝑃 𝑦𝑦 =𝑛𝑛!

𝑦𝑦! 𝑛𝑛 − 𝑦𝑦 !𝜋𝜋𝑦𝑦(1 − 𝜋𝜋)𝑛𝑛−𝑦𝑦 ,𝑦𝑦 = 0,1,2, … ,𝑛𝑛

• For fixed 𝑛𝑛, it becomes more skewed as π moves toward 0 or 1

• For fixed 𝜋𝜋, it becomes more bell-shaped as 𝑛𝑛 increases.

• When n is large, it can be approximated by a normal distribution with 𝜇𝜇 = 𝑛𝑛𝜋𝜋 and σ= 𝑛𝑛𝜋𝜋(1 − 𝜋𝜋)

10

Binomial Distribution • Table 1.1. Binomial Dist. with 𝑛𝑛 =10 and 𝜋𝜋 =0.20, 0.50, and 0.80.

The distribution is symmetric when 𝜋𝜋 =0.5 y P(y) when π=0.2 P(y) when π=0.5 P(y) when π=0.8

0 0.107 0.001 0.000

1 0.268 0.010 0.000

2 0.302 0.044 0.000

3 0.201 0.117 0.001

4 0.088 0.205 0.005

5 0.027 0.246 0.027

6 0.005 0.205 0.088

7 0.001 0.117 0.201

8 0.000 0.044 0.302

9 0.000 0.010 0.268

10 0.000 0.001 0.107 11

Multinomial Distribution

• have more than two possible outcomes. • Let c denote the number of outcome

categories. • For 𝑛𝑛 independent observations, the

multinomial probability that 𝑛𝑛1 fall in category 1, 𝑛𝑛2 fall in category 2, …, 𝑛𝑛𝑐𝑐 fall in category c with their probabilities 𝜋𝜋𝑗𝑗 , where ∑ 𝜋𝜋𝑗𝑗𝑗𝑗 = 1, equals

𝑃𝑃 𝑛𝑛1,𝑛𝑛2, … ,𝑛𝑛𝑐𝑐 = (𝑛𝑛!

𝑛𝑛1!𝑛𝑛2! …𝑛𝑛𝑐𝑐!)𝜋𝜋1𝑛𝑛1𝜋𝜋2𝑛𝑛2 ⋯𝜋𝜋𝑐𝑐𝑛𝑛𝑐𝑐

12

1.3 Statistical Inference for a Proportion

• In practice, the parameter values for the binomial and multinomial distributions are unknown.

• Using sample data, we estimate the parameters.

• In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of a statistical model.

13

Likelihood Function

• The probability of the observed data, expressed as a function of the parameter, is called the likelihood function.

• For example, in n=10 trials, suppose a binomial count equals y=0.

• From the binomial formula with parameter 𝜋𝜋, the probability of this outcome equals

𝑃𝑃 𝑦𝑦 = 0 =10!

0! 10!𝜋𝜋0(1 − 𝜋𝜋)10 = (1 − 𝜋𝜋)10

14

Maximum Likelihood Estimation(MLE)

15

• The maximum likelihood (ML)estimate of a parameter is the parameter value for which the probability of the observed data takes its greatest value.

Maximum Likelihood Estimation(MLE)

• In general, for the binomial outcome of y successes in n trials, the ML estimate of 𝜋𝜋 equals 𝑝𝑝 = 𝑦𝑦/𝑛𝑛 (the sample proportion of successes for the n trials)

• The ML estimate is often denoted by the parameter symbol with a ^(a “hat”) over it.

16

Significance Test About a Binomial Proportion

• The ML estimator for the parameter 𝜋𝜋 is the sample proportion, 𝑝𝑝.

• The sampling distribution of the sample proportion 𝑝𝑝 has mean and standard error

𝐸𝐸 𝑝𝑝 = 𝜋𝜋, 𝜎𝜎 𝑝𝑝 = 𝜋𝜋(1−𝜋𝜋)𝑛𝑛

• The sampling distribution of 𝑝𝑝 is approximately normal for large n.

17

Significance Test About a Binomial Proportion

• Null hypothesis 𝐻𝐻0: 𝜋𝜋 = 𝜋𝜋0 • The test statistic

𝑧𝑧 =𝑝𝑝 − 𝜋𝜋0𝜋𝜋0(1 − 𝜋𝜋0)

𝑛𝑛

• For large samples, the null sampling distribution of the z test statistic is the standard normal.

18

Example: Survey Results on Legalizing Abortion

• Let 𝜋𝜋 denote the proportion of the American adult population that responds “yes” to the question,

• “Please tell me whether or not you think it should be possible for a pregnant woman to obtain a legal abortion if she is married and does not want any more children.”

19

Example: Survey Results on Legalizing Abortion

• Of 893 respondents to this question, 400 replied “yes” and 493 replied “no”

• p=400/893=0.448 • 𝐻𝐻0: 𝜋𝜋 = 0.50, 𝐻𝐻𝑎𝑎: 𝜋𝜋 ≠ 0.50

• z=(0.448 − 0.50)/ 0.50 0.50893

= −3.1

• The two-sided P-value is 0.002

20

Confidence Intervals for a Binomial Proportion

• 100(1-𝛼𝛼)% confidence interval for 𝜋𝜋 𝑝𝑝 ± 𝑧𝑧𝛼𝛼

2𝑆𝑆𝐸𝐸 ,𝑤𝑤𝑤𝑤𝑤𝑤𝑤 𝑆𝑆𝐸𝐸 = 𝑝𝑝(1 − 𝑝𝑝)/𝑛𝑛

• where 𝑧𝑧𝛼𝛼2 denotes the standard normal

percentile having right-tail probability equal to 𝛼𝛼

2

• Unless 𝜋𝜋 is close to 0.50, however, it does not work well unless n is very large.

21

Confidence Intervals for a Binomial Proportion

• A better way to construct confidence intervals uses a duality with significance tests.

• For given p and n, the 𝜋𝜋0 values that have test statistic value 𝑧𝑧𝛼𝛼

2 are the solutions to

the equation |𝑝𝑝 − 𝜋𝜋0|

𝜋𝜋0(1 − 𝜋𝜋0)/𝑛𝑛= 𝑧𝑧𝛼𝛼

2

for 𝜋𝜋0.

22

ch1. introduction - kocwcontents.kocw.net/kocw/document/2015/gachon/kimnamhyoung...ch1. introduction...

Documents