st3905 lecturer : supratik roy email : s.roy@ucc.ies.roy@ucc.ie (unix) :...

Post on 21-Dec-2015

230 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ST3905

Lecturer : Supratik Roy

Email : s.roy@ucc.ie

(Unix) : supratik@stat.ucc.ie

Phone: ext. 3626

What do we want to do?

1. What is statistics?

2. Describing Information :

3. Summarization, Visual and non-Visual representation

4. Drawing conclusion from information :

5. Managing uncertainty and incompleteness of information

Describing Information1. Why summarization of information?

2. Visual representation (aka graphical Descriptive Statistics)

3. Non-visual representation (numerical measures)

4. Classical techniques vs modern IT

Stem and Leaf PlotDecimal point is 2 places to the right of the colon

0 : 8

1 : 000011122233333333333344444

1 : 55555566666677777778888888899999999999

2 : 0000000111111111111222222233333333444444444

2 : 555556666666666777778889999999999999999

3 : 000000001111112222333333333444

3 : 55555555666667777777888888899999999

4 : 0122234

4 : 55555678888889

5 : 111111134

5 : 555667

6 : 44

6 : 7

Pie-Chart

diffgeom

com

plex

algebra

rea

ls

statistics

diffgeom

com

plex

algebra

rea

ls

statistics

DotChart

10 20 30

ooo

oo

o

oo

o

oo

o

oo

o

Old Suburb Coast County

New Suburb

Old Suburb Coast County

New Suburb

Old Suburb Coast County

New Suburb

Old Suburb Coast County

New Suburb

Old Suburb Coast County

New Suburb

Child Care

Health Services

Community Centers

Family & Youth

Other

Histogram

-4 -3 -2 -1 0 1 2

05

10

15

50 samples from a t distribution with 5 d.f.

my.sample

Histogram-Categorical

Northeast South North Central West

05

10

15

state.region

Rules for Histograms1. Height of Rectangle proportional to frequency of class

2. No. of classes proportional to sqrt(total no. of observations) [not a hard and fast rule]

3. In case of categorical data, keep rectangle widths identical, and base of rectangles separate.

4. Best, if possible, let the software do it.

Data

-0.053626486 -0.828128399 0.214910482 0.346570399

[5] -0.849316517 0.001077376 0.736191791 1.417540397

[9] -2.382332275 -2.699019949 -0.111907192 1.384903284

[13] 2.113286699 -1.828108272 -1.108280724 0.131883612

[17] -0.394494473 0.829806888 0.023178033 0.019839537

[21] -0.346280222 -0.251981108 1.159853307 -0.249501904

[25] -1.342704742 -2.012653224 -1.535503208 0.869806233

[29] -1.313495887 -0.244408426 -0.998886998 -1.446769605

[33] 1.224528053 -0.410163230 0.032230907 -0.137297112

[37] -2.717620031 -0.728570438 0.034697116 2.202863874

[41] -0.170794163 0.353651680 -0.673296374 3.136364814

[45] -1.260108638 -0.367334893 -0.652217259 -0.301847039

[49] 0.315180215 0.190766333

TabulationClass freq

-3,-2 //// 4

-2,-1 //// // 7

-1,0 //// //// //// /// 18

0,1 //// //// //// 14

1,2 //// 4

2,3 // 2

3,4 / 1

Total 50

Box-Plot - I

20

04

00

60

08

00

Box Plot – II

12

34

56

7

18-24 25-34 35-44 45-54 55-64 65+

Box Plot – III

20

04

00

60

08

00

0 1 2 3 4 5 6 7 8 9

NJ Pick-it Lottery (5/22/75-3/16/76)

Leading Digit of Winning Numbers

Pa

yoff

Non-Visual (numerical measures)

1. Pictures vs. quantitative measures

2. Criteria for selection of a measure – purpose of study

3. Qualities that a measure should have

4. We live in an uncertain world – chances of error

Measures of Location1. Mean :

2. Mode

3. Median

Location : mean, median algebra test scores

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

43 50 41 69 52 38 51 54 43 47 54 51 70 58 44 54 52 32 42 70

21 22 23 24 25 50 49 56 59 38

Mean = 50.68

10% trimmed mean of scores = 50.33333

Median = 51

Location : Non-classicalAn M-estimate of location is a solution mu of the equation:

sum(psi( (y-mu)/s )) = 0.

Data set : car.miles

(bisquare) 204.5395

(Huber’s ) 204.2571

Tabular method of computing

Class freq Class-midpt

Rel. freq

r.f X midpt

-3,-2 4 -2.5 0.08 -0.20

-2,-1 7 -1.5 0.14 -0.21

-1,0 18 -0.5 0.36 -0.18

0,1 14 0.5 0.28 0.14

1,2 4 1.5 0.08 0.12

2,3 2 2.5 0.04 0.10

3,4 1 3.5 0.02 0.07

50 -0.16

Tabular method of computing

Class freq Class-midpt(x)

A=-0.5

x-A/d

Rel. freq

r.f X x

-3,-2 4 -2.5 -2 0.08 -0.16

-2,-1 7 -1.5 -1 0.14 -0.14

-1,0 18 -0.5 0 0.36 0

0,1 14 0.5 1 0.28 0.28

1,2 4 1.5 2 0.08 0.16

2,3 2 2.5 3 0.04 0.12

3,4 1 3.5 4 0.02 0.08

50 0.34

Measures of Scale (aka Dispersion)

1. Variance (unbiased) : sum((x-mean(x))^2)/(N-1)

2. Variance (biased) : sum((x-mean(x))^2)/(N)

3. Standard Deviation : sqrt( variance)

Tabular method of computing

Class Class-midpt(x)

A=-0.5

x’=(x-A)/d

x^2 Rel. freq

r.f X x^2

-3,-2 -2.5 -2 4 0.08 0.32

-2,-1 -1.5 -1 1 0.14 0.14

-1,0 -0.5 0 0 0.36 0

0,1 0.5 1 1 0.28 0.28

1,2 1.5 2 4 0.08 0.32

2,3 2.5 3 9 0.04 0.36

3,4 3.5 4 16 0.02 0.32

1.74

Robust measures of scale1. The MAD scale estimate generally has very small bias

compared with other scale estimators when there is "contamination" in the data.

2. Tau-estimates and A-estimates also have 50% breakdown, but are more efficient for Gaussian data.

3. The A-estimate that scale.a computes is redescending, so it is inappropriate if it necessary that the scale estimate always be increasing as the size of a datapoint is increased. However, the A-estimate is very good if all of the contamination is far from the "good" data.

Comparison of scale measuresMAD(corn.yield) =4.15128

scale.tau(corn.yield) = 4.027753

scale.a(corn.yield) = 4.040902

var(corn.yield) = 19.04191

sqrt(var(corn.yield)) = 4.363703

N.B. To really compare you have to compare for various probability distributions as well as various sample sizes.

Probability1. Concept of an Experiment on Random observables

2. Sets and Events, Random variables, Probability

(a).Set of all basic outcomes = Sample space = S

(b).An element of S or union of elements in S = An event

(Asingleton event = simple event, else compound)

(c) A numerical function that associates an event with a number(s) = Random Variable

(d) A map from E onto [0,1] obeying certain rules = probability

Examples of ProbabilityConsider toss of single coin :

1. A single throw : Only two possible outcomes – Head or Tail

2. Two consecutive throws : Four possible outcomes – (Head, Head), (Head, Tail), (Tail, Head), (Tail, Tail)

3. Unbiased coin : P(Head turns up) = 0.5

4. Define R.V. X to be X(Head)=1, X(Tail)=0. P(X=1)=0.5, P(X=0)=0.5.

Axioms of Probability1. 0 <= P(A) <= 1 for any event A

2. P[A B] = P[A]+P[B] if A,B are disjoint sets/events

3. P[S] =1

Basic Formulae-I1. P[A’] = 1- P[A]

2. P[A B] = 0 if A,B are disjoint

3. P[A B] = P[A]+P[B]-P[A B]

4. P[A B C] = P[A]+P[B]+ P[C]

-P[A B] –P[A C] – P[B C]

+P[A B C]

Basic Formulae - II

1. Counting Principle : For an ordered sequence to be formed from N groups G1,G2,….GN with sizes k1,k2,….kN, the total no. of sequences that can be formed are k1 x k2 x ….kN.

2. An ordered sequence of k objects taken from a set of n distinct objects is called a Permutation of size k of the objects, and is denoted by Pk,n.

3. For any positive integer m, m! is read as “m-factorial” and defined by m!=m(m-1)(m-2)…3.2.1

4. Any unordered subset of size k from a set of n distinct objects is called a Combination, denoted Ck,n.

Basic Formulae-III1. Pk,n = n!/(n-k)!

2. Ck,n = n!/[k!(n-k)!]

3. For any two events A and B with P(B)>0, the Conditional Probability of A given (that ) B (has occurred)is defined by P(A|B) = P(A B)/P(B) [=0 if P(B)=0]

4. Let A,B be disjoint and C be any event with P[C]>0. Then P(C)=P(C|A)P(A)+P(C|B)P(B) [Law of Total Probability]

5. Let A,B be disjoint and C be any event with P[C]>0. Then P(A|C)=P(C|A)P(A)/[P(C|A)P(A)+P(C|B)P(B)]. [Bayes Theorem]

Random Variables - Discrete1. A discrete set is a set such that either it is finite or there exists a

map from each element of the set into a subset of the set of Natural numbers.

2. A discrete random variable is a r.v. which takes values in a discrete set consisting of numbers.

3. The probability distribution or probability mass function (pmf) of a discrete r.v. X is defined for every number x by p(x)=P(X=x)=P(all s S: X(s)=x) [P[X=x] is read “the probability that the r.v. X assumes the value x”. Note, p(x) >= 0, sum of p(x) over all possible x is 1

Cumulative Distribution Function1. The Cumulative distribution function (cdf) F(x) of a discrete

r.v. X with pmf p(x) is defined for every number x by F(x)=P(Xx)={y : y x} p(y)

2. For any number x, F(x) is the probability that the observed value of X will be at most x.

3. For any two numbers a,b with a b, P(a X b) = F(b)-F(a-) where a- represents the largest possible X value that is strictly less than a.

Operations on RV’s1. Expectation of a RV

2. Expectations of functions of RV’s

3. Special Cases : Moments, Covariance

Expected Values of Random Variables

1. Let X be a discrete r.v. with set of possible values D and pmf p(x). The expected value or mean value of X, denoted by E(X) or X , is E(X) = X ={xD} x.p(x)

2. Note that E(X) may not always exists. Consider p(x)=k/x2

Expected Values of functions of Random Variables

1. Let X be a discrete r.v. with set of possible values D and pmf p(x). The expected value or mean value of f(X), denoted by E(f(X)) or f(X) , is E(f(X)) ={xD} f(x).p(x)

2. Example : Variance. Var(X)=V(X)=E[X-E(X)]2=E(X2)-[E(X)]2

Random Variables - Continuous

Joint distribution of >1 RV’s

Gaussian or Normal Distribution

Sample as Random Observables

Parametric Inference

Tests of Hypothesis

Hypothesis Tests for Normal Population

top related