practical statistics - university of arizonaircamera.as.arizona.edu/astr_518/sep-2-numstat.pdf ·...

Practical Statistics • Lecture 3 (Sep. 2)

Read: W&J Ch. 4-5

- Correlation

- Hypothesis Testing

• Lecture 4 (Sep. 4) - Principle Component

Analysis

• Lecture 5 (Sep. 9): Read: W&J Ch. 6

- Parameter Estimation

- Bayesian Analysis

- Rejecting Outliers

- Bootstrap + Jack-knife

• Lecture 6 (Sep. 11) Read: W&J Ch. 7

- Random Numbers

- Monte Carlo Modeling !

• Lecture 7 (Sep. 16): - Markov Chain MC

!

• Lecture 8 (Sep. 18): Read: W&J Ch. 9

- Fourier Techniques

- Filtering

- Unevenly Sampled Data1

Review: Process of Decision Making

2

Ask a Question

Take Data

Reduce Data

Derive Statistics describing data

Does the Statistic answer your question?

Probability Distribution

Error Analysis

Publish!

No

Reflect on what is needed

Yes

Hypothesis Testing

Simulation

P1(n) = pn(1� p)M�n

P (n) =M !

n!(M � n)!pn(1� p)M�n =

✓Mn

◆pn(1� p)M�n

Review: The Binomial distribution! You are observing something that has a probability, p,

of occurring in a single observation. ! You observe it M times. ! Want chance of obtaining n successes. For one,

particular sequence of observations the probability is:

! There are many sequences which yield n successes:

Example: The importance of the null result

You are reading a telescope proposal to observe stars with transition disks for binarity. They aim to disprove that transition disks are caused by a stellar mass companion.

The proposal requests time for 20 objects. The stated goal is to prove that, in general, transition disks are not due to a stellar companion.

4

Is this a reasonable sample?

If there is only time to do 10, should you give them time?

Example 2:Counting Statistics

I am at the Kuiper 1.5 m telescope on Mt. Bigelow, getting ready to measure a transit of a planet in front of a star. I expect the drop in brightness to be 1% of the star’s total flux.

!I take a 1 s exposure and measure the number of digital

units (DU) to be 200 for the star. The camera’s user manual tells me that I expect 5 e-/DU. So photoelectrons=1000.

5

How long should I expose for each frame to get good quality data?

Example 3: Recursive BayesianIf we have a coin we suspect is double headed, how

many flips would it take us to be reasonably confident it really is double headed?

!

6

What posterior probability criteria should we choose?

Assume we adopt the prior belief that there is only a 1% chance. . .

Assume we have seen two coins, one double-headed and the other normal. We don’t know which is being used. What is the prior?

Correlation

• Often the first approach to analyzing data is to look for correlations in various parameters.

- May or may not be physically motivated. - Understand experimental effects first (be skeptical). - Be careful of “subclusters” of points. - Correlation is not (necessarily) causation (remain skeptical).

7

A mass-separation correlation?

8

Are people born early in the year better hockey players?

See “Outliers” book by Malcolm Gladwell 9

r =P

i(Xi� < X >)(Yi� < Y >)pPi(Xi� < X >)2

Pi(Yi� < Y >)2

� =covariance(x, y)

⇥x

⇥y

Correlation coefficient• The correlation coefficient for two parameters, x and y,

is defined as the covariance between parameters over the scatter in the distribution for each parameter:

10

• The correlation coefficient can be estimated directly from the data:

prob(⇢|data) / (1� ⇢

2)(N�1)/2

(1� ⇢r)N�3/2(1 +

1N � 1/2

1 + ⇢r

8+ ...)

Probability of correlation

• For a bivariate Gaussian distribution, Bayes’ theorem can be used to estimate the probability of correlation:

11

• This is often useful for comparing correlations or giving relative chances on the correlation of data.

Use of Jeffrey’s correlation distribution

12

W&J Figure 4.5

Probability of a positive correlation

13

r=0.75

r=0.5

r=0.25

W&J Figure 4.6

What if we see a correlation?• It’s common (but dangerous!) to just fit a line to the

data:

14“Anscombe’s quartet” illustrates the potential pitfalls of line fitting

�i =nX

j=1

aijxj

Principle Component Analysis

• If we have N objects, n measured variables (x_n) for each object then:

- We want a minimum number of variables that are independent.

- These variables will be linear combinations of the observed variables:

15

The goal is to define the new variables to minimize the residual variance in the data

Geometrical view of PCA

• Iterative approach of finding the component with maximum variance.

16

PCA manipulation

17

Statistics for Hypothesis Testing

! Hypothesis testing uses some metric to determine whether two data sets, or a data set and a model, are distinct.

! Typically, the problem is set up so that the hypothesis is that the data sets are consistent (the null hypothesis).

! A probability is calculated that the value found would be obtained again with another sample.

! Based on the required level of confidence, the hypothesis is rejected or accepted.

Parametric Tests

•Often, the most intuitive way to understand our data is to choose the parameter of interest (say the mean) and compare it to a model.

•Alternatively, we might be comparing two data sets by asking whether the differences in a statistic are meaningful.

!

•These general tests are called “Parametric tests” •They can use frequentist approaches to accept or reject

the hypothesis. •They can use Bayesian approaches to calculate

probabilities of different results. 19

Are two data sets drawn from the same distribution?

! The “t” statistic quantifies the likelihood that the means are the same.

! The “F” statistic quantifies the likelihood that the variances of two data sets are the same.

! Consider two data sets, x and y, with m and n data points:

s2 =nS

x

+ mSy

n + m

F =P

(xi � x)2/(n� 1)P(yi � y)2/(m� 1)

S

x

=P

(xi

� x)2

n

t =x� y

s

p1/m + 1/n

� = m + n� 2

Student's t test

! Calculate the t statistic. A perfect agreement is t=0. ! Evaluate the probability for t>value.

s2 =nS

x

+ mSy

n + m

t =x� y

s

p1/m + 1/n

F test! Calculate the F statistic.

!! Calculate the probability that F>value.

F =P

(xi � x)2/(n� 1)P(yi � y)2/(m� 1)

Student-t Test Example

Imagine we are observing a sample of stars with known hot Jupiters.

A fraction of these are observed to have stellar companions.

A fraction of the sample have orbits that are significantly different than the stellar spin axis.

There appears to be a connection between these.

!What is the chance that the sample with and without stellar

companions are drawn from the same distribution?

23

Student-t Test Example

There are 27 degrees of freedom in this example (29 observations and two means to calculate).

!The mean of the misaligned sample is 0.77 detections/star.

The mean of the aligned sample is 0.25 detections/star.

We can calculate the t-statistic is t=2.0.

Indicates there is a 0.5% probability the two samples are randomly drawn from the same distribution.

24

Non-Parametric Tests

If we don’t know the underlying distribution, or have small number statistics, there are still tests that can be used to accept or reject a hypothesis.

Non-parametric tests still make some assumption about the data: Usually this is something related to the data following

counting statistics, or the binomial distribution (randomness assumed, in the appropriate form)

25

The Kolmogorov-Smirnov Test

! Calculate the cumulative distribution function for your model (C_model(x)).

! Calculate the cumulative distribution function for your data(C_data(x).

! Find maximum of |Cmodel(x)-Cdata(x)| ! The variables, x, must be continuous to use K-S test.

K-S test example

Right panel is the CDF of known single radial velocity planets (solid line).

If we model this as a mixture of single planets, and double planets (which mimic a single eccentric planet) the correct mixture is ~50%, constrained by the K-S test.

Chi-squared test

The chi-squared statistic can be used to compare any model to a data set:

28

�2 =NX

i=1

(Ei �Oi)2

Ei

Assumes variation in data is due to counting statistics !Data must be binned so that E_i is reasonable for the model

General Picture:

Correlation -> Hypothesis Testing -> Model Fitting -> Parameter Estimation.

!Is there a correlation? Is it consistent with an assumed distribution? Does the assumed model fit the data? What parameters can we derive for the model with what

uncertainty?

29

practical statistics - university of arizonaircamera.as.arizona.edu/astr_518/sep-2-numstat.pdf ·...

Documents