introduction to probability and statisticsfilippi/teaching/dtc2014/lecture_filippi1.pdf ·...

44
Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University of Oxford [email protected] 22/10/2014

Upload: trandiep

Post on 21-May-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Introduction to Probability and Statistics

Sarah FilippiDepartmental Lecturer, University of Oxford

[email protected]

22/10/2014

Page 2: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Objectives of Course

I To teach underlying theory and applications of statistical dataanalysis

I To emphasis the importance of exploratory data analysis, priorto any formal statistical testing or modelling

I To define some useful summary statistics (numbers) that helpcharacterise features of data sets, such as location and spread

I To introduce probability as a mathematical framework forreasoning with uncertainty

I To demonstrate the use of MatLab to implement thesetechniques

Page 3: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

What is Statistics ?

I Statistics is the science and art of data analysisI from observational studiesI from planned experiments

I Statistics is concerned with the collection, analysis andinterpretation of data

I It is the science of the scientific method

Page 4: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Branches of statistics

I Statistics covers a range of areas from data collection(optimal design of experiments) to the construction ofstochastic (empirical) models

I It is a broad churchI Some areas of note include

I graphical displays of quantitative dataI stochastic modelling of systemsI forecasting algorithms

Page 5: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Uncertainty

At the heart of statistics is the rigourous treatment of uncertaintyor random variation as characterised via probability

I probability provides a formal system for inference andinductive logic

I allows for coherent accumulation of evidence supporting orrefuting a hypothesis of interest

Page 6: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

My use of statistics

I I work on a range of topics in computational statistics focusedon understanding biological processes and their relation todisease.

I Using Bayesian statistics and stochastic models.

I I did a PhD in reinforcement learning, a field at theintersection of statistics, machine learning and artificialintelligence.

Page 7: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

First steps

I The starting point of ALL good statistical data analysis beginswith graphical plots and summary statistics of the data

I ALWAYS, ALWAYS, ALWAYS, PLOT YOUR DATA!!!

I Why?

Page 8: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

First steps

I The starting point of ALL good statistical data analysis beginswith graphical plots and summary statistics of the data

I ALWAYS, ALWAYS, ALWAYS, PLOT YOUR DATA!!!

I Why?

Graphics reveal data, communicate complex ideas anddependencies with clarity, precision and efficiency

Page 9: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Graphical Excellence

Excellent graphics:

I show the data

I induce the viewer to think about the substance

I avoid bias

I make large complex data sets coherent

I encourage data exploration and debate

Page 10: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Types of data

Categorical/ Qualitative Quantitative

Nominal Ordinal

Binary non-Binary

Discrete (Counts) Continuous

Eye colour

Socioeconomic class

Height, weight

Gender

IQ

ethnicityhighest level

of educational attainment

femur length

number of children

age

Likert scale: 1=Strongly disagree;

2=Disagree; 3=Neutral; 4=Agree;

5=Strongly agree.

Smoker (yes/no)

Page 11: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Bar chart for categorical data

Sex of babies born in one day on an Australian hospital

Frequency

Girl Boy

04

812

1620

24

Page 12: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Histograms for quantitative data

Birth Weight (g)

Freq

uenc

y

1500 2000 2500 3000 3500 4000 4500

05

1015

200

510

1520

Page 13: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

How to make a histogram?

1. Find range of the data

2. Choose bin size

3. Count up observation in each bin

4. Plot boxes

Page 14: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Effect of bin size

Too few categories

Birth Weight (g)

Freq

uenc

y

1500 2500 3500 4500

05

1015

2025

3035

Too many categories

Birth Weight (g)

Freq

uenc

y1500 2500 3500 4500

01

23

45

67

Page 15: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Histograms: frequency vs density

Birth Weight (g)

Freq

uenc

y

1500 2000 2500 3000 3500 4000 4500

05

1015

200

510

1520

Weight (g)

Den

sity

(bab

ies/

g)

1500 2000 2500 3000 3500 4000 45000

0.01

0.02

0.03

0.04

Page 16: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Typical shapes of histograms

I bell shapes

I left skewed shapes

I right skewed shapes

I bimodal

Page 17: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Bell shape

Serum cholesterol (Mg/100 ml)

Density

0.000

0.005

0.010

0.015

0.020

100 120 140 160 180 200 220 240 260 280 300

Page 18: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Right skewed

Histogram of California household income 1999

income in $thousands

Density

0 50 100 150 200 250 300

0.000

0.004

0.008

0.012

Page 19: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Bimodal

Percentage acetylation of isoniazid

Frequency

010

2030

4050

25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

Page 20: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Summarizing data

I Summary statistics are numbers (univariate statistics) thatcharacterise certain (important) features of your data

Definition: A ”statistic” is a number derived from apopulation; formally, it is a function on a set of random

variable(s).

I You can think about summary statistics as data reduction;condensing your n data points to a small number of pertinentfeatures

I In data analysis two common attributes of interest are thelocation and spread of the data

Page 21: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Summary statistics

I It is vital at this point to introduce the notion of an unknownunderlying population, characteristics of which you are tryingto estimate using a sample taken from the population.

I For example, you would like to know the average weight of aWelshman. There IS a true value for this (but you and I willnever know it). Because we don’t know it we estimate it usinga data set of observations taken from the population.

I This is one of the most important concepts in Statistics.

I Think carefully about this

Page 22: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Location: mean

I The mean is the most commonly used measure of location.

I The true unknown population mean is estimated by the”average” value in your data

x̄ =1

n

n∑i=1

xi .

I The mean is also known as the central moment; whichmeasures the expectation of your data.

I It defines the centre of mass (measure) of your data.

Page 23: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Location: mean

I The mean value is often not representative of an actual value

I ”The average Welshman has less than two legs”I What is the average (expected) value of the throw of a dice?

I The mean value can be distorted by outliersI in 1984 U. Virginia’s Dept. Communication announced that

starting graduate salary was $55, 000 in 1984I however, class contained 7ft 4inch, Ralph Sampson, leading

N.B.A. center for Houston Rockets

Page 24: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Location: mean

I The mean value is often not representative of an actual valueI ”The average Welshman has less than two legs”I What is the average (expected) value of the throw of a dice?

I The mean value can be distorted by outliersI in 1984 U. Virginia’s Dept. Communication announced that

starting graduate salary was $55, 000 in 1984I however, class contained 7ft 4inch, Ralph Sampson, leading

N.B.A. center for Houston Rockets

Page 25: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Location: median

I The median provides an alternative measure of location

I The true unknown population median is estimated by thecentral value of the data

I That is, by first sorting the data (by value) and then reportingthe middle value (or the average of the two middle values ifthe number of data points are even)

I The median provides a useful complement to the mean as it ismore robust.That is, it is less sensitive to extreme values,

I Compare the mean and median values of the following dataset:

(2.3, 5.2, 6.4, 7.8, 987234.3)

Page 26: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Location

Symmetric

10 0 10 20 30

mean = median = mode

Positive Skew

0 5 10 15 20 25 30

meanmedianmode

Negative Skew

0 5 10 15 20 25 30

meanmedianmode

Page 27: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Spread: standard deviation

I The standard deviation measures spread of the data aroundthe mean

I The true unknown population standard deviation is estimatedby

sd(x) =

√√√√ 1

n − 1

n∑i=1

(xi − x̄)2

I The variance is simply the standard deviation squared

I The standard deviation is more useful than the variance as ithas the same units of measurement as the data (rather thanon squared units for the variance)

Page 28: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Spread: quantiles

I An alternative measure of spread is provide by the quantiles

I The quantiles are estimated by sorting the data by value andthen reading off the values at particular points

I The 0% quantile is the minimum value; the 100% quantile isthe maximum value; the 50% quantile is the median. Ingeneral the A% quantile is the A/100 sorted value

I The inter-quartile range is defined as the range 75% − 25%quantile

I The inter-quartile range is a robust measure of spread as itdiscards the top and bottom 25% of the data

Page 29: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Boxplots

I Introduced by John Tukey, a founder of graphical statistics

I Middle bar denotes the median

I Rectangle enclose the Inter-Quartile Range (IQR)note: position of median bar within IQR reveals skewness inthe distribution

I Outliers are drawn as points; defined as greater than 1.5 IQRfrom rectangle

I ”Whiskers” extend out to furthest points which are notoutliers

Page 30: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

●●

●●●●

(0,100] (100,1000] (1000,1e+04] (1e+04,1e+05]

050

100

150

GDP

Infant mortality

Page 31: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

0 10 20 30 400

1000

2000

3000

4000

5000

0 10 20 30 400

1000

2000

3000

4000

5000

6000

7000

Page 32: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Principle of good graphical display

Good graphs should.....

I Display information in an unbiased and uncluttered manner

I Have a clear purpose

I Reveal interesting and pertinent features of the datadistribution- and highlight outliers

I Encourage the viewer to think about the data generatingmechanisms (abductive inference)

Page 33: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Pie Chart

Rules for using Pie Charts:

Never use them!

Pie charts are rarely justified. If you must use them, do so onlywhen the total is fixed (and meaningful) and never (ever!) use 3dpie charts as they’re worse still !!What’s the matter with pie charts:

I people are not good at interpreting areas

I zero is often a very meaningful number but gets lost in a piechart

I very hard to compare two pie charts

I loose overall scale (size) of the underlying figures

Barplots are usually a much better choice

Page 34: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Pie Chart

Rules for using Pie Charts:

Never use them!

Pie charts are rarely justified. If you must use them, do so onlywhen the total is fixed (and meaningful) and never (ever!) use 3dpie charts as they’re worse still !!What’s the matter with pie charts:

I people are not good at interpreting areas

I zero is often a very meaningful number but gets lost in a piechart

I very hard to compare two pie charts

I loose overall scale (size) of the underlying figures

Barplots are usually a much better choice

Page 35: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Pie Chart

Page 36: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Pie Chart

Page 37: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Barplot

I Barplots are useful for comparing single figures across groups

I Barplots should not(!) be used to compare distributions ofdata across groups

I Most of the information is lost in this case

Page 38: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Barplot

Page 39: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Barplot

Page 40: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Barplot

I Note how the information is revealed by the boxplot whereasinteresting features are lost in the barplot of the same data

*** Aside: you will have noted that the 100m times is a poorexample as the data follows a time trend (i.e. not

exchangeable) ***

I A graph is an opportunity to present, with clear purpose, thedata in the most informative way

I Don’t waste this opportunity

Page 41: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Scatter plot

I Scatterplots can be used to reveal associations betweenvariables

I Also useful to breakdown association into categories usingconditional scatterplots (see the examples using scatter andplotmatrix)

Page 42: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Scatter plot

Page 43: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Summary

I Type of data

I Representing data with graphs: bar charts, histograms,boxplots, scatter plots

I Summarising data: location, spread

I Graphical displays enhance our understanding of the state ofnature

I Good graphs serve a well defined purpose, and present data inan uncluttered and unbiased manner

I Good graphs reveal pertinent features and invite us to thinkabout the causal mechanisms that underly the data

Page 44: Introduction to Probability and Statisticsfilippi/Teaching/dtc2014/lecture_filippi1.pdf · Introduction to Probability and Statistics Sarah Filippi Departmental Lecturer, University

Some references

I Rice, J. A. ”Mathematical Statistics and Data Analysis”.(1995). Duxbury - standard undergrad text

I Box, G. E. P. et al. ”Statistics for Experimenters”. (1978).Wiley - a classic text!

I Gilks, W et al. eds. ”Markov chain Monte Carlo in Practice”.(1998) - a great collection of papers illustrating the power ofmodern computational statistics

I Hastie, T. et al. ”The elements of statistical learning”.(2001). - statistical machine learning, pattern recognition anddata mining

I Tufte, E. ”The Visual Display of Quantitative Information”.2nd ed. (2001). Graphics Press - Surely the greatest bookever written on graphical display of data