statistics for linguistics students michaelmas 2004 week 1 bettina braun

19

Click here to load reader

Upload: derick-reed

Post on 26-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Statistics for Linguistics Students

Michaelmas 2004

Week 1

Bettina Braun

Page 2: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Why calculating statistics?

• Describe and summarise the data

• E.g. examination results (out of 100)

22 98 40 45 16 31 77 78

55 45 61 91 87 45 54 66

75 87 88 49 64 76 58 61 …

• Average mark/Spread of scores/Lowest and highest marks?/Comparison with other results (e.g. from last year’s?)

Page 3: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Population vs. Sample

• Population: total universe of all possible observations.Populations can be finite or infinite, real or theoretical – the IQ of all adult men in Britain– The outcome of an infinite number of flips of a

coin

• Descriptive statitics are called parameters

Page 4: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Population vs. Sample (cont’d)

• Sample: Subset of observations drawn from a given population– The IQ scores of 100 adult men in Britain– The outcome of 50 flips of a coin

• Descriptive statitics from a sample are called statistics

• Note: In experimental research it is important to draw a representative, random sample that is not biased

Page 5: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Histograms: Frequency distribution of each event

0,00 20,00 40,00 60,00 80,00

VAR00001

0

5

10

15

20

25

Frequency

Mean = 54,6207Std. Dev. = 16,9673N = 87

Data: Tutorial1.sav

Page 6: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Central tendency: mode and median

• Mode: Most frequent mark (Note: there may be multiple modes)

• Median: score from the middle of the list when ordered from lowest to highest. Cuts data into halves (doesn’t take account of values of all scores but only of the scores in middle position).

Statistics

mark87

0

54,62

56,00

55a

Valid

Missing

N

Mean

Median

Mode

Multiple modes exist. The smallest value is showna.

Page 7: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Central tendency: mean

• Mean: sum of scores divided by the number of scores

Note on notation: Greek letters often used for population, roman letters used for statistic (properties of a sample)

Page 8: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Comparing measures of “central tendency”

• Mode: – quick if we have frequency distribution– Possible with categorical data

• Median:– Good estimate if we have abnormally large or small

values (e.g. max aircraft speed of 450km/h, 480km/h, 500km/h, 530km/h, 600km/h, and 1100km/h)

– Only influenced by values in the middle of ordered data

• Mean– Every score is taken into account– Some interesting properties Most widely used

Page 9: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Types of variables

• Interval (scale): difference between consecutive numbers are of equal intervals (e.g. time, speed, distances). Precise measurements

• Ordinal: assignments of ranks that represent position along some ordered dimension (e.g. ranking people wrt their speed, 1 = fastest, 4 = slowest). No equal intervals

• Categorical (nominal): numerical categories, labels (e.g. brown = 1, blue = 2, green = 3)

• Question: on which type of data can we calculate a meaningful “central tendency”?

Page 10: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Spread of distributions: why?

Page 11: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Spread of distributions:range and quartiles

• Small spread often desirable as it indicates a high proportion of identical scores

• Large spread indicates large differences between individual scores

• Range: difference between highest and lowest score – rather crude measure

• Quartiles: cuts the ordered data into quarters (second quartile = median)

Page 12: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Median, quartiles, and outliers

o Outlier (more than 1.5 box lengths above or below the box)

Interquartile range

* Extreme value (more than 3 box lengths below or above the box)

var0001

0,00

20,00

40,00

60,00

80,00

100,00

391

8066

28

75

37

45

7

Largest value which is not outlier

Upper quartileMedianLower quartile

Smallest value which is not outlier

tutorial1.sav: simple bp, sep. var

Page 13: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Spread of the population: variance measures

• Variance: sum of squared deviations from the mean

Variance =

• Standard deviation: square root of variance

Page 14: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Normal distribution (Gaussian distribution)

• Example: IQ scores, mean=100, sd=16

Mean = Median = Mode

Page 15: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Skewed distributions and measures of central tendency

Page 16: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Bimodal distributions

Page 17: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

Normal distribution (Gaussian distribution)

• Example: IQ scores, mean=100, sd=16

Mean = Median = Mode

Page 18: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

z-scores

• Z-score: deviation of given score from the mean in terms of standard deviations

Page 19: Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun

How likely is a given event?

• Example: time to utter a particular sentence: x = 3.45s and sd = .84s

• Questions:– What proportion of the population of utterance

times will fall below 3s?– What proportion would lie between 3s and 4s?– What is the time value below which we will

find 1% of the data?