st3905 lecturer : supratik roy email : [email protected]@ucc.ie (unix) :...

42
ST3905 Lecturer : Supratik Roy Email : s.roy@ucc. ie (Unix) : [email protected] Phone: ext. 3626

Post on 21-Dec-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

ST3905

Lecturer : Supratik Roy

Email : [email protected]

(Unix) : [email protected]

Phone: ext. 3626

Page 2: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

What do we want to do?

1. What is statistics?

2. Describing Information :

3. Summarization, Visual and non-Visual representation

4. Drawing conclusion from information :

5. Managing uncertainty and incompleteness of information

Page 3: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Describing Information1. Why summarization of information?

2. Visual representation (aka graphical Descriptive Statistics)

3. Non-visual representation (numerical measures)

4. Classical techniques vs modern IT

Page 4: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Stem and Leaf PlotDecimal point is 2 places to the right of the colon

0 : 8

1 : 000011122233333333333344444

1 : 55555566666677777778888888899999999999

2 : 0000000111111111111222222233333333444444444

2 : 555556666666666777778889999999999999999

3 : 000000001111112222333333333444

3 : 55555555666667777777888888899999999

4 : 0122234

4 : 55555678888889

5 : 111111134

5 : 555667

6 : 44

6 : 7

Page 5: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Pie-Chart

diffgeom

com

plex

algebra

rea

ls

statistics

diffgeom

com

plex

algebra

rea

ls

statistics

Page 6: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

DotChart

10 20 30

ooo

oo

o

oo

o

oo

o

oo

o

Old Suburb Coast County

New Suburb

Old Suburb Coast County

New Suburb

Old Suburb Coast County

New Suburb

Old Suburb Coast County

New Suburb

Old Suburb Coast County

New Suburb

Child Care

Health Services

Community Centers

Family & Youth

Other

Page 7: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Histogram

-4 -3 -2 -1 0 1 2

05

10

15

50 samples from a t distribution with 5 d.f.

my.sample

Page 8: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Histogram-Categorical

Northeast South North Central West

05

10

15

state.region

Page 9: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Rules for Histograms1. Height of Rectangle proportional to frequency of class

2. No. of classes proportional to sqrt(total no. of observations) [not a hard and fast rule]

3. In case of categorical data, keep rectangle widths identical, and base of rectangles separate.

4. Best, if possible, let the software do it.

Page 10: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Data

-0.053626486 -0.828128399 0.214910482 0.346570399

[5] -0.849316517 0.001077376 0.736191791 1.417540397

[9] -2.382332275 -2.699019949 -0.111907192 1.384903284

[13] 2.113286699 -1.828108272 -1.108280724 0.131883612

[17] -0.394494473 0.829806888 0.023178033 0.019839537

[21] -0.346280222 -0.251981108 1.159853307 -0.249501904

[25] -1.342704742 -2.012653224 -1.535503208 0.869806233

[29] -1.313495887 -0.244408426 -0.998886998 -1.446769605

[33] 1.224528053 -0.410163230 0.032230907 -0.137297112

[37] -2.717620031 -0.728570438 0.034697116 2.202863874

[41] -0.170794163 0.353651680 -0.673296374 3.136364814

[45] -1.260108638 -0.367334893 -0.652217259 -0.301847039

[49] 0.315180215 0.190766333

Page 11: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

TabulationClass freq

-3,-2 //// 4

-2,-1 //// // 7

-1,0 //// //// //// /// 18

0,1 //// //// //// 14

1,2 //// 4

2,3 // 2

3,4 / 1

Total 50

Page 12: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Box-Plot - I

20

04

00

60

08

00

Page 13: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Box Plot – II

12

34

56

7

18-24 25-34 35-44 45-54 55-64 65+

Page 14: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Box Plot – III

20

04

00

60

08

00

0 1 2 3 4 5 6 7 8 9

NJ Pick-it Lottery (5/22/75-3/16/76)

Leading Digit of Winning Numbers

Pa

yoff

Page 15: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Non-Visual (numerical measures)

1. Pictures vs. quantitative measures

2. Criteria for selection of a measure – purpose of study

3. Qualities that a measure should have

4. We live in an uncertain world – chances of error

Page 16: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Measures of Location1. Mean :

2. Mode

3. Median

Page 17: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Location : mean, median algebra test scores

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

43 50 41 69 52 38 51 54 43 47 54 51 70 58 44 54 52 32 42 70

21 22 23 24 25 50 49 56 59 38

Mean = 50.68

10% trimmed mean of scores = 50.33333

Median = 51

Page 18: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Location : Non-classicalAn M-estimate of location is a solution mu of the equation:

sum(psi( (y-mu)/s )) = 0.

Data set : car.miles

(bisquare) 204.5395

(Huber’s ) 204.2571

Page 19: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Tabular method of computing

Class freq Class-midpt

Rel. freq

r.f X midpt

-3,-2 4 -2.5 0.08 -0.20

-2,-1 7 -1.5 0.14 -0.21

-1,0 18 -0.5 0.36 -0.18

0,1 14 0.5 0.28 0.14

1,2 4 1.5 0.08 0.12

2,3 2 2.5 0.04 0.10

3,4 1 3.5 0.02 0.07

50 -0.16

Page 20: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Tabular method of computing

Class freq Class-midpt(x)

A=-0.5

x-A/d

Rel. freq

r.f X x

-3,-2 4 -2.5 -2 0.08 -0.16

-2,-1 7 -1.5 -1 0.14 -0.14

-1,0 18 -0.5 0 0.36 0

0,1 14 0.5 1 0.28 0.28

1,2 4 1.5 2 0.08 0.16

2,3 2 2.5 3 0.04 0.12

3,4 1 3.5 4 0.02 0.08

50 0.34

Page 21: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Measures of Scale (aka Dispersion)

1. Variance (unbiased) : sum((x-mean(x))^2)/(N-1)

2. Variance (biased) : sum((x-mean(x))^2)/(N)

3. Standard Deviation : sqrt( variance)

Page 22: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Tabular method of computing

Class Class-midpt(x)

A=-0.5

x’=(x-A)/d

x^2 Rel. freq

r.f X x^2

-3,-2 -2.5 -2 4 0.08 0.32

-2,-1 -1.5 -1 1 0.14 0.14

-1,0 -0.5 0 0 0.36 0

0,1 0.5 1 1 0.28 0.28

1,2 1.5 2 4 0.08 0.32

2,3 2.5 3 9 0.04 0.36

3,4 3.5 4 16 0.02 0.32

1.74

Page 23: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Robust measures of scale1. The MAD scale estimate generally has very small bias

compared with other scale estimators when there is "contamination" in the data.

2. Tau-estimates and A-estimates also have 50% breakdown, but are more efficient for Gaussian data.

3. The A-estimate that scale.a computes is redescending, so it is inappropriate if it necessary that the scale estimate always be increasing as the size of a datapoint is increased. However, the A-estimate is very good if all of the contamination is far from the "good" data.

Page 24: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Comparison of scale measuresMAD(corn.yield) =4.15128

scale.tau(corn.yield) = 4.027753

scale.a(corn.yield) = 4.040902

var(corn.yield) = 19.04191

sqrt(var(corn.yield)) = 4.363703

N.B. To really compare you have to compare for various probability distributions as well as various sample sizes.

Page 25: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Probability1. Concept of an Experiment on Random observables

2. Sets and Events, Random variables, Probability

(a).Set of all basic outcomes = Sample space = S

(b).An element of S or union of elements in S = An event

(Asingleton event = simple event, else compound)

(c) A numerical function that associates an event with a number(s) = Random Variable

(d) A map from E onto [0,1] obeying certain rules = probability

Page 26: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Examples of ProbabilityConsider toss of single coin :

1. A single throw : Only two possible outcomes – Head or Tail

2. Two consecutive throws : Four possible outcomes – (Head, Head), (Head, Tail), (Tail, Head), (Tail, Tail)

3. Unbiased coin : P(Head turns up) = 0.5

4. Define R.V. X to be X(Head)=1, X(Tail)=0. P(X=1)=0.5, P(X=0)=0.5.

Page 27: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Axioms of Probability1. 0 <= P(A) <= 1 for any event A

2. P[A B] = P[A]+P[B] if A,B are disjoint sets/events

3. P[S] =1

Page 28: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Basic Formulae-I1. P[A’] = 1- P[A]

2. P[A B] = 0 if A,B are disjoint

3. P[A B] = P[A]+P[B]-P[A B]

4. P[A B C] = P[A]+P[B]+ P[C]

-P[A B] –P[A C] – P[B C]

+P[A B C]

Page 29: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Basic Formulae - II

1. Counting Principle : For an ordered sequence to be formed from N groups G1,G2,….GN with sizes k1,k2,….kN, the total no. of sequences that can be formed are k1 x k2 x ….kN.

2. An ordered sequence of k objects taken from a set of n distinct objects is called a Permutation of size k of the objects, and is denoted by Pk,n.

3. For any positive integer m, m! is read as “m-factorial” and defined by m!=m(m-1)(m-2)…3.2.1

4. Any unordered subset of size k from a set of n distinct objects is called a Combination, denoted Ck,n.

Page 30: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Basic Formulae-III1. Pk,n = n!/(n-k)!

2. Ck,n = n!/[k!(n-k)!]

3. For any two events A and B with P(B)>0, the Conditional Probability of A given (that ) B (has occurred)is defined by P(A|B) = P(A B)/P(B) [=0 if P(B)=0]

4. Let A,B be disjoint and C be any event with P[C]>0. Then P(C)=P(C|A)P(A)+P(C|B)P(B) [Law of Total Probability]

5. Let A,B be disjoint and C be any event with P[C]>0. Then P(A|C)=P(C|A)P(A)/[P(C|A)P(A)+P(C|B)P(B)]. [Bayes Theorem]

Page 31: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Random Variables - Discrete1. A discrete set is a set such that either it is finite or there exists a

map from each element of the set into a subset of the set of Natural numbers.

2. A discrete random variable is a r.v. which takes values in a discrete set consisting of numbers.

3. The probability distribution or probability mass function (pmf) of a discrete r.v. X is defined for every number x by p(x)=P(X=x)=P(all s S: X(s)=x) [P[X=x] is read “the probability that the r.v. X assumes the value x”. Note, p(x) >= 0, sum of p(x) over all possible x is 1

Page 32: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Cumulative Distribution Function1. The Cumulative distribution function (cdf) F(x) of a discrete

r.v. X with pmf p(x) is defined for every number x by F(x)=P(Xx)={y : y x} p(y)

2. For any number x, F(x) is the probability that the observed value of X will be at most x.

3. For any two numbers a,b with a b, P(a X b) = F(b)-F(a-) where a- represents the largest possible X value that is strictly less than a.

Page 33: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Operations on RV’s1. Expectation of a RV

2. Expectations of functions of RV’s

3. Special Cases : Moments, Covariance

Page 34: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Expected Values of Random Variables

1. Let X be a discrete r.v. with set of possible values D and pmf p(x). The expected value or mean value of X, denoted by E(X) or X , is E(X) = X ={xD} x.p(x)

2. Note that E(X) may not always exists. Consider p(x)=k/x2

Page 35: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Expected Values of functions of Random Variables

1. Let X be a discrete r.v. with set of possible values D and pmf p(x). The expected value or mean value of f(X), denoted by E(f(X)) or f(X) , is E(f(X)) ={xD} f(x).p(x)

2. Example : Variance. Var(X)=V(X)=E[X-E(X)]2=E(X2)-[E(X)]2

Page 36: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Random Variables - Continuous

Page 37: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Joint distribution of >1 RV’s

Page 38: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Gaussian or Normal Distribution

Page 39: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Sample as Random Observables

Page 40: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Parametric Inference

Page 41: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Tests of Hypothesis

Page 42: ST3905 Lecturer : Supratik Roy Email : s.roy@ucc.ies.roy@ucc.ie (Unix) : supratik@stat.ucc.iesupratik@stat.ucc.ie Phone: ext. 3626

Hypothesis Tests for Normal Population