introduction to probability and statisticsfilippi/teaching/dtc2014/lecture_filippi1.pdf ·...

Introduction to Probability and Statistics

Sarah FilippiDepartmental Lecturer, University of Oxford

[email protected]

22/10/2014

Objectives of Course

I To teach underlying theory and applications of statistical dataanalysis

I To emphasis the importance of exploratory data analysis, priorto any formal statistical testing or modelling

I To define some useful summary statistics (numbers) that helpcharacterise features of data sets, such as location and spread

I To introduce probability as a mathematical framework forreasoning with uncertainty

I To demonstrate the use of MatLab to implement thesetechniques

What is Statistics ?

I Statistics is the science and art of data analysisI from observational studiesI from planned experiments

I Statistics is concerned with the collection, analysis andinterpretation of data

I It is the science of the scientific method

Branches of statistics

I Statistics covers a range of areas from data collection(optimal design of experiments) to the construction ofstochastic (empirical) models

I It is a broad churchI Some areas of note include

I graphical displays of quantitative dataI stochastic modelling of systemsI forecasting algorithms

Uncertainty

At the heart of statistics is the rigourous treatment of uncertaintyor random variation as characterised via probability

I probability provides a formal system for inference andinductive logic

I allows for coherent accumulation of evidence supporting orrefuting a hypothesis of interest

My use of statistics

I I work on a range of topics in computational statistics focusedon understanding biological processes and their relation todisease.

I Using Bayesian statistics and stochastic models.

I I did a PhD in reinforcement learning, a field at theintersection of statistics, machine learning and artificialintelligence.

First steps

I The starting point of ALL good statistical data analysis beginswith graphical plots and summary statistics of the data

I ALWAYS, ALWAYS, ALWAYS, PLOT YOUR DATA!!!

I Why?

First steps

I The starting point of ALL good statistical data analysis beginswith graphical plots and summary statistics of the data

I ALWAYS, ALWAYS, ALWAYS, PLOT YOUR DATA!!!

I Why?

Graphics reveal data, communicate complex ideas anddependencies with clarity, precision and efficiency

Graphical Excellence

Excellent graphics:

I show the data

I induce the viewer to think about the substance

I avoid bias

I make large complex data sets coherent

I encourage data exploration and debate

Types of data

Categorical/ Qualitative Quantitative

Nominal Ordinal

Binary non-Binary

Discrete (Counts) Continuous

Eye colour

Socioeconomic class

Height, weight

Gender

IQ

ethnicityhighest level

of educational attainment

femur length

number of children

age

Likert scale: 1=Strongly disagree;

2=Disagree; 3=Neutral; 4=Agree;

5=Strongly agree.

Smoker (yes/no)

Bar chart for categorical data

Sex of babies born in one day on an Australian hospital

Frequency

Girl Boy

04

812

1620

24

Histograms for quantitative data

Birth Weight (g)

Freq

uenc

y

1500 2000 2500 3000 3500 4000 4500

05

1015

200

510

1520

How to make a histogram?

1. Find range of the data

2. Choose bin size

3. Count up observation in each bin

4. Plot boxes

Effect of bin size

Too few categories

Birth Weight (g)

Freq

uenc

y

1500 2500 3500 4500

05

1015

2025

3035

Too many categories

Birth Weight (g)

Freq

uenc

y1500 2500 3500 4500

01

23

45

67

Histograms: frequency vs density

Birth Weight (g)

Freq

uenc

y

1500 2000 2500 3000 3500 4000 4500

05

1015

200

510

1520

Weight (g)

Den

sity

(bab

ies/

g)

1500 2000 2500 3000 3500 4000 45000

0.01

0.02

0.03

0.04

Typical shapes of histograms

I bell shapes

I left skewed shapes

I right skewed shapes

I bimodal

Bell shape

Serum cholesterol (Mg/100 ml)

Density

0.000

0.005

0.010

0.015

0.020

100 120 140 160 180 200 220 240 260 280 300

Right skewed

Histogram of California household income 1999

income in $thousands

Density

0 50 100 150 200 250 300

0.000

0.004

0.008

0.012

Bimodal

Percentage acetylation of isoniazid

Frequency

010

2030

4050

25 30 35 40 45 50 55 60 65 70 75 80 85 90 95

Summarizing data

I Summary statistics are numbers (univariate statistics) thatcharacterise certain (important) features of your data

Definition: A ”statistic” is a number derived from apopulation; formally, it is a function on a set of random

variable(s).

I You can think about summary statistics as data reduction;condensing your n data points to a small number of pertinentfeatures

I In data analysis two common attributes of interest are thelocation and spread of the data

Summary statistics

I It is vital at this point to introduce the notion of an unknownunderlying population, characteristics of which you are tryingto estimate using a sample taken from the population.

I For example, you would like to know the average weight of aWelshman. There IS a true value for this (but you and I willnever know it). Because we don’t know it we estimate it usinga data set of observations taken from the population.

I This is one of the most important concepts in Statistics.

I Think carefully about this

Location: mean

I The mean is the most commonly used measure of location.

I The true unknown population mean is estimated by the”average” value in your data

x̄ =1

n

n∑i=1

xi .

I The mean is also known as the central moment; whichmeasures the expectation of your data.

I It defines the centre of mass (measure) of your data.

Location: mean

I The mean value is often not representative of an actual value

I ”The average Welshman has less than two legs”I What is the average (expected) value of the throw of a dice?

I The mean value can be distorted by outliersI in 1984 U. Virginia’s Dept. Communication announced that

starting graduate salary was $55, 000 in 1984I however, class contained 7ft 4inch, Ralph Sampson, leading

N.B.A. center for Houston Rockets

Location: mean

I The mean value is often not representative of an actual valueI ”The average Welshman has less than two legs”I What is the average (expected) value of the throw of a dice?

I The mean value can be distorted by outliersI in 1984 U. Virginia’s Dept. Communication announced that

starting graduate salary was $55, 000 in 1984I however, class contained 7ft 4inch, Ralph Sampson, leading

N.B.A. center for Houston Rockets

Location: median

I The median provides an alternative measure of location

I The true unknown population median is estimated by thecentral value of the data

I That is, by first sorting the data (by value) and then reportingthe middle value (or the average of the two middle values ifthe number of data points are even)

I The median provides a useful complement to the mean as it ismore robust.That is, it is less sensitive to extreme values,

I Compare the mean and median values of the following dataset:

(2.3, 5.2, 6.4, 7.8, 987234.3)

Location

Symmetric

10 0 10 20 30

mean = median = mode

Positive Skew

0 5 10 15 20 25 30

meanmedianmode

Negative Skew

0 5 10 15 20 25 30

meanmedianmode

Spread: standard deviation

I The standard deviation measures spread of the data aroundthe mean

I The true unknown population standard deviation is estimatedby

sd(x) =

√√√√ 1

n − 1

n∑i=1

(xi − x̄)2

I The variance is simply the standard deviation squared

I The standard deviation is more useful than the variance as ithas the same units of measurement as the data (rather thanon squared units for the variance)

Spread: quantiles

I An alternative measure of spread is provide by the quantiles

I The quantiles are estimated by sorting the data by value andthen reading off the values at particular points

I The 0% quantile is the minimum value; the 100% quantile isthe maximum value; the 50% quantile is the median. Ingeneral the A% quantile is the A/100 sorted value

I The inter-quartile range is defined as the range 75% − 25%quantile

I The inter-quartile range is a robust measure of spread as itdiscards the top and bottom 25% of the data

Boxplots

I Introduced by John Tukey, a founder of graphical statistics

I Middle bar denotes the median

I Rectangle enclose the Inter-Quartile Range (IQR)note: position of median bar within IQR reveals skewness inthe distribution

I Outliers are drawn as points; defined as greater than 1.5 IQRfrom rectangle

I ”Whiskers” extend out to furthest points which are notoutliers

●

●●

●

●

●

●

●●●●

(0,100] (100,1000] (1000,1e+04] (1e+04,1e+05]

050

100

150

GDP

Infant mortality

0 10 20 30 400

1000

2000

3000

4000

5000

0 10 20 30 400

1000

2000

3000

4000

5000

6000

7000

Principle of good graphical display

Good graphs should.....

I Display information in an unbiased and uncluttered manner

I Have a clear purpose

I Reveal interesting and pertinent features of the datadistribution- and highlight outliers

I Encourage the viewer to think about the data generatingmechanisms (abductive inference)

Pie Chart

Rules for using Pie Charts:

Never use them!

Pie charts are rarely justified. If you must use them, do so onlywhen the total is fixed (and meaningful) and never (ever!) use 3dpie charts as they’re worse still !!What’s the matter with pie charts:

I people are not good at interpreting areas

I zero is often a very meaningful number but gets lost in a piechart

I very hard to compare two pie charts

I loose overall scale (size) of the underlying figures

Barplots are usually a much better choice

Pie Chart

Barplot

I Barplots are useful for comparing single figures across groups

I Barplots should not(!) be used to compare distributions ofdata across groups

I Most of the information is lost in this case

Barplot

Barplot

I Note how the information is revealed by the boxplot whereasinteresting features are lost in the barplot of the same data

*** Aside: you will have noted that the 100m times is a poorexample as the data follows a time trend (i.e. not

exchangeable) ***

I A graph is an opportunity to present, with clear purpose, thedata in the most informative way

I Don’t waste this opportunity

Scatter plot

I Scatterplots can be used to reveal associations betweenvariables

I Also useful to breakdown association into categories usingconditional scatterplots (see the examples using scatter andplotmatrix)

Scatter plot

Summary

I Type of data

I Representing data with graphs: bar charts, histograms,boxplots, scatter plots

I Summarising data: location, spread

I Graphical displays enhance our understanding of the state ofnature

I Good graphs serve a well defined purpose, and present data inan uncluttered and unbiased manner

I Good graphs reveal pertinent features and invite us to thinkabout the causal mechanisms that underly the data

Some references

I Rice, J. A. ”Mathematical Statistics and Data Analysis”.(1995). Duxbury - standard undergrad text

I Box, G. E. P. et al. ”Statistics for Experimenters”. (1978).Wiley - a classic text!

I Gilks, W et al. eds. ”Markov chain Monte Carlo in Practice”.(1998) - a great collection of papers illustrating the power ofmodern computational statistics

I Hastie, T. et al. ”The elements of statistical learning”.(2001). - statistical machine learning, pattern recognition anddata mining

I Tufte, E. ”The Visual Display of Quantitative Information”.2nd ed. (2001). Graphics Press - Surely the greatest bookever written on graphical display of data

introduction to probability and statisticsfilippi/teaching/dtc2014/lecture_filippi1.pdf ·...

Documents