www.biostat.ir1 بسم الله الرّحمن الرّحيم. 2 biostatistics academic preview...
TRANSCRIPT
www.biostat.irwww.biostat.ir 11
حمن حمن بسم الله الّر� بسم الله الّر�
حيم حيمالّر� الّر�
www.biostat.ir 2
Biostatistics Academic
Preview
Descriptive Statistics
3www.biostat.ir
What Is Statistics?What Is Statistics?
Statistics is the science of describing Statistics is the science of describing or making inferences about the or making inferences about the world from a sample of data.world from a sample of data.
Descriptive statistics are numerical Descriptive statistics are numerical estimates that organize and sum up estimates that organize and sum up or present the data.or present the data.
Inferential statistics is the process of Inferential statistics is the process of inferring from a sample to the inferring from a sample to the population.population.
4www.biostat.ir
Statistics has two major chapters:Statistics has two major chapters:
Descriptive StatisticsDescriptive Statistics
Inferential statisticsInferential statistics
5www.biostat.ir
Two types of StatisticsTwo types of Statistics Descriptive statisticsDescriptive statistics
Used to summarize, organize and simplify Used to summarize, organize and simplify datadata
What was the average height score?What was the average height score? What was the highest and lowest score?What was the highest and lowest score? What is the most common response to a What is the most common response to a
question?question? Inferential statisticsInferential statistics
Techniques that allow us to study Techniques that allow us to study samplessamples and then make generalizations about the and then make generalizations about the populationspopulations from which they were selected from which they were selected
Are 5th grade boys taller than 5th grade girls?Are 5th grade boys taller than 5th grade girls? Does a treatment suitable?Does a treatment suitable?
6www.biostat.ir
Population and SamplesPopulation and Samples
The Population under study is the set off The Population under study is the set off all individualsall individuals of interest for the research. of interest for the research.
That That part of the populationpart of the population for which we for which we collect measurements is called sample.collect measurements is called sample.
The number of individuals in a sample is The number of individuals in a sample is denoted by n.denoted by n.
7www.biostat.ir
VariablesVariables
8www.biostat.ir
DefinitionsDefinitions
Variable:Variable: a characteristic that a characteristic that changeschanges or or variesvaries over time and/or over time and/or different subjects under consideration.different subjects under consideration.
Changing over timeChanging over time Blood pressure, height, weightBlood pressure, height, weight
Changing across a populationChanging across a population gender, racegender, race
9www.biostat.ir
Types of variablesTypes of variables
Data
Variables
Quantitative(numeric)
Qualitative(categorical)
Discrete Continuous Nominal Ordinal
10www.biostat.ir
Types of variables :Types of variables :DefinitionsDefinitions
Quantitative variables (numeric):Quantitative variables (numeric): measure a numerical quantity of measure a numerical quantity of amount on each experimental unitamount on each experimental unit
Qualitative variables (categorical)Qualitative variables (categorical):: measure a non numeric quality or measure a non numeric quality or characteristic on each experimental characteristic on each experimental unity by classifying each subject into a unity by classifying each subject into a categorycategory
11www.biostat.ir
Types of variables :Types of variables :Quantitative variablesQuantitative variables
Discrete variables:Discrete variables: can only take can only take values from a list of possible valuesvalues from a list of possible values Number of brushing per dayNumber of brushing per day
Continuous variablesContinuous variables: : can assume can assume the infinitely many values the infinitely many values corresponding to the points on a line corresponding to the points on a line intervalinterval weight, heightweight, height
12www.biostat.ir
Types of variables :Types of variables :Categorical variablesCategorical variables
Nominal:Nominal: unordered categoriesunordered categories RaceRace GenderGender
Ordinal:Ordinal: ordered categoriesordered categories likert scales( disagree, neutral, agree )likert scales( disagree, neutral, agree ) Income categoriesIncome categories
13www.biostat.ir
Types of VariablesTypes of Variables
A discrete variable has gaps between A discrete variable has gaps between its values. For example, number of its values. For example, number of brushing per day is a discrete variable. brushing per day is a discrete variable.
A continuous variable has no gaps A continuous variable has no gaps
between its values. All values or between its values. All values or fractions of values have meaning. Age fractions of values have meaning. Age is an example of continuous variable.is an example of continuous variable.
14www.biostat.ir
Levels of MeasurementLevels of Measurement
Reflects type of information Reflects type of information measured and helps determine what measured and helps determine what descriptive statistics and which descriptive statistics and which statistical test can be used.statistical test can be used.
15www.biostat.ir
Four Levels of Four Levels of MeasurementMeasurement
NominalNominal lowest level, categories, no rank lowest level, categories, no rank
OrdinalOrdinal second lowest, ranked second lowest, ranked categoriescategories
IntervalInterval next to highest, ranked next to highest, ranked categories with categories with known units between rankingsknown units between rankings
RatioRatio highest level, ranked categories highest level, ranked categories with with known intervals and an known intervals and an absolute zeroabsolute zero
16www.biostat.ir
Scales of MeasurementScales of Measurement TemperatureTemperature Men/WomenMen/Women Good/Better/BestGood/Better/Best WeightWeight Republicans/Democrats/ Republicans/Democrats/
IndependentsIndependents VolumeVolume IQIQ Not at all/A little/A lotNot at all/A little/A lot
IntervalInterval
NominalNominal OrdinalOrdinal RatioRatio NominalNominal RatioRatio IntervalInterval OrdinalOrdinal
17www.biostat.ir
Descriptive Statistics
Qualitative Quantitative
FrequencyRelative frequency
Percentage
Measures of Central TendencyMeasures of spreadFive number system
TablesPie ChartsBar Graphs
Tables HistogramsBox plotsBar chartsLine charts
18www.biostat.ir
Descriptive MeasuresDescriptive Measures
Central Tendency measuresCentral Tendency measures. . They They are computed in order to give a “center” are computed in order to give a “center” around which the measurements in the around which the measurements in the data are distributed.data are distributed.
Relative Standing measuresRelative Standing measures. . They They describe the relative position of a specific describe the relative position of a specific measurement in the data.measurement in the data.
Variation or Variability measuresVariation or Variability measures. . They describe “data spread” or how far They describe “data spread” or how far away the measurements are from the away the measurements are from the center.center.
19www.biostat.ir
Measures of Central TendencyMeasures of Central Tendency
MeanMean: :
Sum of all measurements in the data divided by Sum of all measurements in the data divided by the number of measurements.the number of measurements.
MedianMedian: :
A number such that at most half of the A number such that at most half of the measurements are below it and at most half of measurements are below it and at most half of the measurements are above it.the measurements are above it.
ModeMode: : The most frequent measurement in the data.The most frequent measurement in the data.
20www.biostat.ir
Summary Statistics: Summary Statistics: Measures of central tendency Measures of central tendency
(location)(location) Mean: The mean of a data set is the sum of the Mean: The mean of a data set is the sum of the
observations divided by the number of observationobservations divided by the number of observation Population mean: Sample mean:Population mean: Sample mean:
Median: The median of a data set is the “middle Median: The median of a data set is the “middle value”value” For an odd number of observations, the median is the For an odd number of observations, the median is the
observation exactly in the middle of the ordered listobservation exactly in the middle of the ordered list For an even number of observation, the median is the For an even number of observation, the median is the
mean of the two middle observation is the ordered listmean of the two middle observation is the ordered list
Mode: The mode is the single most frequently Mode: The mode is the single most frequently occurring data valueoccurring data value
n
iix
n 1
1
n
iix
1
n
iix
nx
1
1
21www.biostat.ir
SkewnessSkewnessTheThe skewness skewness of a distribution is measured by of a distribution is measured by comparing the relative positions of the mean, median comparing the relative positions of the mean, median and mode.and mode. Distribution is Distribution is symmetricalsymmetrical
Mean = Median = ModeMean = Median = Mode
Distribution Distribution skewed rightskewed right Median lies between mode and mean, and Median lies between mode and mean, and
mode is less than meanmode is less than mean
Distribution Distribution skewed leftskewed left Median lies between mode and mean, and Median lies between mode and mean, and
mode is greater than meanmode is greater than mean
22www.biostat.ir
Relative positions of the mean and Relative positions of the mean and median for (a) right-skewed, (b) median for (a) right-skewed, (b) symmetric, andsymmetric, and(c) left-skewed distributions (c) left-skewed distributions
Note: The mean assumes that the data is normally distributed. If this is not the case it is better to report the median as the measure of location.
23www.biostat.ir
Frequency Distributions and Histograms Frequency Distributions and Histograms
Histograms for symmetric and skewed distributions.
24www.biostat.ir
Normal curvesNormal curvessame mean but different standard same mean but different standard
deviationdeviation
25www.biostat.ir
Further NotesFurther Notes
When the Mean is greater than the Median the When the Mean is greater than the Median the data distribution is skewed to the Right.data distribution is skewed to the Right.
When the Median is greater than the Mean the When the Median is greater than the Mean the data distribution is skewed to the Left.data distribution is skewed to the Left.
When Mean and Median are very close to each When Mean and Median are very close to each other the data distribution is approximately other the data distribution is approximately symmetricsymmetric..
26www.biostat.ir
Summary statisticsSummary statisticsMeasures of spread (scale)Measures of spread (scale)
Variance: The average of the squared Variance: The average of the squared deviations of each sample value from the deviations of each sample value from the sample mean, except that instead of dividing sample mean, except that instead of dividing the sum of the squared deviations by the the sum of the squared deviations by the sample size N, the sum is divided by N-1.sample size N, the sum is divided by N-1.
Standard deviation: The square root of the Standard deviation: The square root of the sample variance sample variance
Range: the difference between the maximum Range: the difference between the maximum and minimum values in the sample. and minimum values in the sample.
n
ii xx
ns
1
2
1
1
n
ii xx
ns
1
22
1
1
27www.biostat.ir
Summary statistics: measures of Summary statistics: measures of spread (scale)spread (scale)
We can describe the spread of a distribution by We can describe the spread of a distribution by using percentiles. using percentiles.
The The pth pth percentile of a distribution is the value percentile of a distribution is the value such that p percent of the observations fall at or such that p percent of the observations fall at or below it.below it. Median=50Median=50thth percentile percentile
Quartiles divide data into four equal parts.Quartiles divide data into four equal parts. First quartile—QFirst quartile—Q11
25% of observations are below Q25% of observations are below Q11 and 75% above Q and 75% above Q11
Second quartile—QSecond quartile—Q22
50% of observations are below Q50% of observations are below Q2 2 and 50% above Qand 50% above Q22
Third quartile—QThird quartile—Q33
75% of observations are below Q75% of observations are below Q33 and 25% above Q and 25% above Q33
28www.biostat.ir
QuartilesQuartiles
25% 25% 25% 25%
Q 3Q 2Q 1
29www.biostat.ir
Five number systemFive number system
MaximumMaximum MinimumMinimum Median=50Median=50thth percentile percentile Lower quartile Q1=25Lower quartile Q1=25thth percentile percentile Upper quartile Q3=75Upper quartile Q3=75thth percentile percentile
30www.biostat.ir
Graphical display of Graphical display of numerical variablesnumerical variables
(histogram)(histogram)
Class IntervalFrequency
20-under 30 6
30-under 40 18
40-under 50 11
50-under 60 11
60-under 70 3
70-under 80 1
010
20
0 10 20 30 40 50 60 70 80
Years
Fre
qu
ency
31www.biostat.ir
Frequency Distributions and Histograms Frequency Distributions and Histograms
A histogram of the compressive strength data with 17 bins.
32www.biostat.ir
Frequency Distributions and Histograms Frequency Distributions and Histograms
A histogram of the compressive strength data with nine bins.
33www.biostat.ir
Histogram of compressive strength data.
Frequency Distributions and Histograms Frequency Distributions and Histograms
34www.biostat.ir
Q1 Q3Q2Minimum Maximum
Median
Graphical display of Graphical display of numerical variablesnumerical variables
(box plot)(box plot)
35www.biostat.ir
NegativelySkewed
PositivelySkewed
Symmetric(Not Skewed)
S < 0 S = 0 S > 0
Graphical display of Graphical display of numerical variablesnumerical variables
(box plot)(box plot)
36www.biostat.ir
Summary measuresSummary measures Count=frequencyCount=frequency Percent=frequency/total samplePercent=frequency/total sample
The distribution of a categorical The distribution of a categorical variable lists the categories and variable lists the categories and gives either a count or a percent of gives either a count or a percent of individuals who fall in each categoryindividuals who fall in each category
Univariate statisticsUnivariate statistics(categorical variables)(categorical variables)
37www.biostat.ir
Displaying categorical Displaying categorical variablesvariables
RankRank Cause Cause of Deathof Death
FrequenFrequency (%)cy (%)
11 Heart Heart DiseaseDisease
710,760 710,760 (43%)(43%)
22 CancerCancer 553,091 553,091 (33%)(33%)
33 StrokeStroke 167,661 167,661 (11%)(11%)
44 CLRDCLRD 122,009 122,009
( 7%)( 7%)
55 AccidentAccidentss
97,90097,900
( 6%)( 6%)
TotalTotal All five All five causescauses
1,651,421,651,4211
0
20
40
60
heart cancer stroke CLRD accident
heart cancer stroke CLRD accident
38www.biostat.ir
Response and explanatory Response and explanatory variablesvariables
Response variable: the variable which we intend to model. we intend to explain through statistical
modeling
Explanatory variable: the variable or variables which may be used to model the response variable values may be related to the response
variable
39www.biostat.ir
Bivariate relationshipsBivariate relationships
An extension of univariate An extension of univariate descriptive statisticsdescriptive statistics
Used to detect evidence of Used to detect evidence of association in the sampleassociation in the sample Two variables are said to be associated Two variables are said to be associated
if the distribution of one variable differs if the distribution of one variable differs across groups or values defined by the across groups or values defined by the other variableother variable
40www.biostat.ir
Bivariate RelationshipsBivariate Relationships
Two quantitative variablesTwo quantitative variables Scatter plotScatter plot Side by side stem and leaf plotsSide by side stem and leaf plots
Two qualitative variablesTwo qualitative variables TablesTables Bar chartsBar charts
One quantitative and one qualitative One quantitative and one qualitative variablevariable Side by side box plotsSide by side box plots Bar chartBar chart
41www.biostat.ir
Two quantitative variablesTwo quantitative variablesCorrelationCorrelation
What type of relationship exists between the two variables and is the correlation significant?
x y
Cigarettes smoked per day
Height
Hours of Training
Explanatory(Independent)Variable
Response(Dependent)Variable
A relationship between two variables.
Number of Accidents
Shoe Size Height
Lung Capacity
IQ
42www.biostat.ir
Negative Correlation as x increases, y decreases
x = hours of trainingy = number of accidents
Scatter Plots and Types of Scatter Plots and Types of CorrelationCorrelation
Accidents
43www.biostat.ir
Positive Correlation as x increases y increases
x = SAT scorey = GPAGPA
Scatter Plots and Types of Scatter Plots and Types of CorrelationCorrelation
44www.biostat.ir
IQ
No linear correlation
x = height y = IQ
Scatter Plots and Types of Scatter Plots and Types of CorrelationCorrelation
45www.biostat.ir
Correlation CoefficientCorrelation CoefficientA measure of the strength and direction of a linear relationship
between two variables
2222 )( yynxxn
yxxynr
The range of r is from -1 to 1.
If r is close to 1 there is a strong
positive correlation
If r is close to -1 there is a strong negative correlation
If r is close to 0 there is no
linear correlation
-1 0 1
46www.biostat.ir
Positive and negative Positive and negative correlationcorrelation
11 If two variablesIf two variables x x and and yy are positively correlated this are positively correlated this means that:means that: large values of large values of xx are associated with large values of are associated with large values of
yy, and, and small values ofsmall values of xx are associated with small values of are associated with small values of
yy
22 If two variables If two variables x x and and yy are negatively correlated this are negatively correlated this means that:means that: large values of large values of xx are associated with small values of are associated with small values of
yy, and, and small values ofsmall values of x x are associated with large values of are associated with large values of
yy
47www.biostat.ir
Positive correlationPositive correlation
48www.biostat.ir
Negative correlationNegative correlation
49www.biostat.ir
Two qualitative variablesTwo qualitative variables(Contingency Tables)(Contingency Tables)
Categorical data is usually displayed Categorical data is usually displayed using a contingency table, which using a contingency table, which shows the frequency of each shows the frequency of each combination of categories observed in combination of categories observed in the data valuethe data value The rows correspond to the categories of The rows correspond to the categories of
the explanatory variablethe explanatory variable
The columns correspond the categories of The columns correspond the categories of the response variablethe response variable
50www.biostat.ir
ExampleExample
Aspirin and Heart AttacksAspirin and Heart Attacks Explanatory variable=drug receivedExplanatory variable=drug received
placeboplacebo AspirinAspirin
Response variable=heart attach statusResponse variable=heart attach status yesyes nono
51www.biostat.ir
Contingency table:Contingency table: heart attack example heart attack example
Heart Heart AttackAttack
No Heart No Heart AttackAttack
TotalTotal
AspirinAspirin 104104 10,93310,933 11,03711,037
placeboplacebo 189189 10,84510,845 11,03411,034
TotalTotal 293293 21,77821,778 22,07122,071
52www.biostat.ir
Two qualitative variablesTwo qualitative variables
BotBothh
NeithNeitherer
OneOne
NeverNever 1717 141141 6868 226226
OccasionOccasionalal
1111 5454 4444 109109
RegularRegular 1919 4040 5151 110110
TotalTotal 4747 235235 163163 4454450
10
20
30
40
50
60
Both Neither One
Never Occasional Regular
Marijuana Use in College: x=parental use, y=student use
53www.biostat.ir
0 1
10
20
30
40
50
age
l bw
22.31
23.66
21.5
22
22.5
23
23.5
24
yes no
low birth weight
low birth weight
Box plot of age by low birth weight Mean age by low birth weight
One quantitative, One One quantitative, One qualitativequalitative
54www.biostat.ir
Trivariate RelationshipsTrivariate Relationships An extension of bivariate descriptive An extension of bivariate descriptive
statisticsstatistics
We focus on description that helps us We focus on description that helps us decide about the role variables might play decide about the role variables might play in the ultimate statistical analysesin the ultimate statistical analyses
Identify variables that can increase the Identify variables that can increase the precision of the data analysis used to precision of the data analysis used to answer associations between two other answer associations between two other variablesvariables
55www.biostat.ir
Confounding and effect Confounding and effect modificationmodification
A factor, Z, is said to A factor, Z, is said to confoundconfound a relationship a relationship between a risk factor, X, and an outcome, Y, if it between a risk factor, X, and an outcome, Y, if it is not an effect modifier and the unadjusted is not an effect modifier and the unadjusted strength of the relationship between X and Y strength of the relationship between X and Y differs from the common strength of the differs from the common strength of the relationship between X and Y for each level of Z. relationship between X and Y for each level of Z.
A factor, Z, is said to be an A factor, Z, is said to be an effect modifiereffect modifier of a of a relationship between a risk factor, X, and an relationship between a risk factor, X, and an outcome measure, Y, if the strength of the outcome measure, Y, if the strength of the relationship between the risk factor, X, and the relationship between the risk factor, X, and the outcome, Y, varies among the levels of Z. outcome, Y, varies among the levels of Z.
56www.biostat.ir
Example: confoundingExample: confounding
In our low birth weight data suppose In our low birth weight data suppose we wish to investigate the association we wish to investigate the association between race and low birth weight.between race and low birth weight.
Our ability to detect this association Our ability to detect this association might be affected by:might be affected by: Smoking status being associated with low Smoking status being associated with low
birth weightbirth weight Smoking status being associated with raceSmoking status being associated with race
57www.biostat.ir
Multiple ModelsMultiple Models
Allows one to calculated the association Allows one to calculated the association between and response and outcome of between and response and outcome of interest, after controlling for potential interest, after controlling for potential confounders.confounders.
Allows for one to assess the association Allows for one to assess the association between an outcome and multiple between an outcome and multiple response variables of interest.response variables of interest.
58www.biostat.ir
Time Sequence Plots Time Sequence Plots
• A time series or time sequence is a data set in which the observations are recorded in the order in which they occur. • A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say x) and the horizontal axis denotes the time (which could be minutes, days, years, etc.). • When measurements are plotted as a time series, weoften see
•trends, •cycles, or •other broad features of the data
59www.biostat.ir
Time Sequence Plots Time Sequence Plots
Company sales by year (a) and by quarter (b).
60www.biostat.ir
Tests comparing difference between 2 or more Tests comparing difference between 2 or more groupsgroups
TestTest Dependent Dependent variablevariable
Independent Independent variablevariable
PairedPaired
((dependent t-dependent t-testtest))
Interval/ratio Interval/ratio pre and post pre and post teststests
Nominal Nominal
Unpaired Unpaired (independent t-(independent t-test)test)
Interval/ratioInterval/ratio Nominal (2 Nominal (2 grps)grps)
ANOVA F-ANOVA F-testtest
Interval/ratioInterval/ratio Nominal (>2 Nominal (>2 grps)grps)
Chi-SquareChi-Square
(Nonparamet(Nonparametric)ric)
Nominal Nominal (Dichotomous)(Dichotomous)
NominalNominal
61www.biostat.ir
Tests demonstrating Tests demonstrating association between two association between two
groupsgroups
TestTest Dependent Dependent var.var.
Independent Independent var.var.
Spearman rhoSpearman rho OrdinalOrdinal OrdinalOrdinal
Mann-Whitney Mann-Whitney UU
Non-parametricNon-parametric
OrdinalOrdinal NominalNominal
Pearson’s rPearson’s r Interval/ratioInterval/ratio Interval/ratioInterval/ratio
62www.biostat.ir
Tests demonstrating Tests demonstrating association between two association between two groups, controlling for groups, controlling for
third variablethird variableTestTest DependentDependent IndependentIndependent
Logistic Logistic regressionregression
NominalNominal NominalNominal
Linear Linear regressionregression
Interval/ratioInterval/ratio Interval/ratioInterval/ratio
Pearson Pearson partial rpartial r
Interval/ratioInterval/ratio Interval/ratioInterval/ratio
Kendall’s Kendall’s partial rpartial r
OrdinalOrdinal OrdinalOrdinal