introduction to statistics rss6 2014
TRANSCRIPT
Introduction to Statistics
Amr Albanna, MD, MSc
Content
• Scales of Measurement – Categorical Variables – Numerical Variables:
• Displays of Categorical Data – Frequencies – Bar Graph – Pie Chart
• Numerical Measures of Central Tendency – Mean – Median – Mode
• Numerical Measures of Spread • Association • Correlation • Regression
Scales of Measurement
• Categorical Variables: – Nominal: Categorical variable with no order (e.g. Blood
type A, B, AB or O). – Ordinal: Categorical, but with an order (e.g. Pain: “none",
“mild", “moderate", or “severe").
• Numerical Variables:
– Interval: Quantitative data where differences are meaningful (e.g. Years 2009 -2010.). Here differences are meaningful; ratios are not meaningful.
– Ratio: Quantitative data where ratios are meaningful (e.g. weights, 200 lbs is twice as heavy as 100 lbs).
Categorical Variables
• Displays of Categorical Data
– Frequencies
– Bar Graph
– Pie Chart
Categorical Variables Variable (Sex) Frequency Proportion
Male 609 0.61
Female 391 0.39
Total 1000 100
0
100
200
300
400
500
600
700
Male Female
Bar Graph Pie Chart
Bar Graph
Numerical Variables
Central Tendency
Numerical Spread
Measures of Central Tendency
• The 3 M's
– Mean
– Median
– Mode
Measures of Central Tendency
Sample Mean
The sample mean, 𝑥 , is the sum of all values in the sample divided by the total number of observations, n, in the sample.
𝑥 = 𝑥𝑖𝑛𝑖=1
𝑛
Example: Sample Mean
• Mean systolic blood pressure
Scenario 1:
Mean = (120 + 135 + 115 + 110 + 105 + 140)/6 =121
Subjects BP
1 120 (x1)
2 135 (x2)
3 115 (x3)
4 110 (x4)
5 105 (x5)
6 140 (x6)
Sample Mean
• The mean is affected by extreme observations and is not a resistant measure.
Scenario 2:
Mean = (120 + 135 + 115 + 110 + 105 + 140 + 280)/7 =144
Subjects BP
1 120 (x1)
2 135 (x2)
3 115 (x3)
4 110 (x4)
5 105 (x5)
6 140 (x6)
7 280 (x7)
Median
• The sample median, M, is the number such that “half" the values in the sample are smaller and the other “half" are larger.
• Use the following steps to find M. – Sort the data (arrange in increasing order).
– Is the size of the data set n even or odd?
– If odd: M = value in the exact middle.
– If even: M = the average of the two middle numbers.
Example: Sample Median
• Median systolic BP: Scenario 1: 120 : 135 : 115 : 110 : 105 : 140 Median = (115 + 110) /2 = 112.5 Scenario 2: 120 : 135 : 115 : 110 : 105 : 140 : 280 Median = 110
• The median is not affected by extreme observations and is a resistant measure.
Mode
• The sample mode is the value that occurs most frequently in the sample (a data set can have more than one mode).
• This is the only measure of center which can also be used for categorical data.
• The population mode is the highest point on the population distribution.
Symmetric Data Distribution
0
1
2
3
4
5
6
10 20 30 40 50
Fre
qu
en
cy
Value
Rightward Skewness of Data
0
1
2
3
4
5
6
10 20 30 40 50
Mode
Fre
qu
en
cy
Value
Median Mean
Leftward Skewness of Data
0
1
2
3
4
5
6
10 20 30 40 50
Mean Median Mode
Value
Fre
qu
en
cy
Numerical Measures of Spread
• Range
• Sample Variance
• Inter Quartile Range (IQR)
Numerical Measures of Spread
Range: The range of the data set is the difference between the highest value and the lowest value.
– Range = highest value - lowest value
– Easy to compute BUT ignores a great deal of information.
– Obviously the range is affected by extreme observations and is not a resistant measure.
Numerical Measures of Spread
• Variance: equal to the sum of squared deviations from the sample mean divided by n - 1, where n is the number of observations in the sample.
Numerical Measures of Spread
• Percentile: The percentile of a distribution is the value at which observations fall at or below it.
Numerical Measures of Spread
• The most commonly used percentiles are the quartiles.
1st quartile Q1 = 25th percentile.
2nd quartile Q2 = 50th percentile.
3rd quartile Q1 = 75th percentile.
Numerical Measures of Spread
Inter Quartile Range (IQR)
A simple measure spread giving the range covered by the middle half of the data is the (IQR) defined below.
IQR = Q3 - Q1
The IQR is a resistant measure of spread.
Numerical Measures of Spread
Outliers: extreme observations that fall well outside the overall pattern of the distribution.
• An outlier may be the result of a
– Recording error,
– An observation from a different population,
– An unusual extreme observation (biological diversity)
Numerical Measures of Spread
Association Between Variables
• Explanatory (exposure) variable “X”
• Response (outcome) variable “Y”
Association Between Variables
Association Between Variables
Association Between Variables
Measurement of Correlation
Correlation is NOT Association
Regression