6.descriptive stats

Upload: boobalan-dhanabalan

Post on 05-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 6.Descriptive Stats

    1/37

    Descriptive Statistics

    Purpose of descriptive statistics

    Frequency distributionsMeasures of central tendency

    Measures of dispersion

  • 8/2/2019 6.Descriptive Stats

    2/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Statistics as a Tool for LIS Research

    Importance of statistics in research

    Summarize observations to provide answers to research questions andhypotheses

    Make general conclusions based on specific study observations

    Objectively evaluate reliability of study conclusions

  • 8/2/2019 6.Descriptive Stats

    3/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Statistics as a Tool for LIS Research

    Main purposes of statistics in research

    Describe central point in a set of data/observations

    Describe how broad, diversified, or variable the data in a set is

    Indicate whether specfic features of a set of data are related, and howclosely they are related

    Indicate probability of features of data being influenced by factorsother than simply chance

  • 8/2/2019 6.Descriptive Stats

    4/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Statistics as a Tool for LIS Research

    Two main types or branches of statistics

    Descriptive statistics

    Characterizing or summarizing data set

    Presenting data in charts and tables to clarify characteristics

    No inference, just describing a particular group of observations

    Inferential statistics

    Using sample data to make generalizations (inferences) or estimatesabout a population

    Statements made in terms of probability

  • 8/2/2019 6.Descriptive Stats

    5/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Statistics as a Tool for LIS Research

    Descriptive and inferential statistics not mutually exclusive

    Overlap in what can be called descriptive and what can becalled inferential

    Intent is important:

    Group of observations intended to describe an event: descriptive

    Group of observations collected from a sample and intended topredict what a larger population is like: inferential

  • 8/2/2019 6.Descriptive Stats

    6/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Statistics as a Tool for LIS Research

    Choosing statistical methods

    Type of data collected largely determines choice of statistical analysis

    techniques

    Decisions about how and what type of data is collected will determinethe specific statistical tests that can be performed to analyze the data

    Data collected should determine statistical tests used, not the other way

    around

    But consideration of how you want to analyze data should be done aspart of research design to ensure study can produce the type ofconclusions you want to make

  • 8/2/2019 6.Descriptive Stats

    7/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Descriptive Statistics

    Commonly used in LIS research

    Cannot test causal relationships

    Primary strength is describing and summarizing data:

    Describing data in terms of frequency distributions

    Describing most typical value in data set-

    measures of centraltendency

    Describing variability of data- measures of dispersion

  • 8/2/2019 6.Descriptive Stats

    8/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Frequency Distributions

    Describing data in terms offrequency distributions

    Counts of totals by value orcategory for each measured variable

    Can be presented as absolute totals,cumulative totals, percentages,

    grouped totals

    Often afirst step in statisticalanalysis of data

    Usually presented in tables orcharts (histogram, bar graph, etc.)

    0-10 11-20 21-40 41-60 61+

    Age group

    0

    20

    40

    60

    80

    Bookschecked

    out

  • 8/2/2019 6.Descriptive Stats

    9/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Describing most typical value in data set - measures of centraltendency

    Mean is often referred to as average though average can be any ofthese measures of central tendency:

    Mean (arithmetic average)

    Median

    Mode

  • 8/2/2019 6.Descriptive Stats

    10/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Mean

    Most popular statistic for summarizing data

    Can be used for interval or ratio data

    Based on all observations of the data set

    Arithmetic average of a set of observations

    Example: mean of 5, 10, and 30 is 15, since 453 = 15

    Mean of a set of numbers can be a number not in set

    Example: mean of 1, 2, 3, and 4 is 3.5, since 104 = 2.5

  • 8/2/2019 6.Descriptive Stats

    11/37

  • 8/2/2019 6.Descriptive Stats

    12/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Median

    Value that is above the lower one-half and below the upper one-half of the

    values -- middle value of set of observations when they have been arrangedin order

    Can be used for ordinal, interval or ratio data

    Most central measure of a distribution

    Every data set has a median that is unique

    Difference in sets with odd numbers of observations than for even numbersof observations

    Example: median of thefive observations 1, 3, 15, 16, and 17 = 15

    Example: median of thesix observations 1, 2, 3, 5, 8, and 9 = 4

  • 8/2/2019 6.Descriptive Stats

    13/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Mode

    Can be used for any type of data

    Most frequently occuring value among a set of observations

    Examples:

    Mode of the observations 1, 2, 2, 3, 4, 5 = 2

    Set of observations 1, 2, 3, 4, 5 has no modeSet of observations 1, 2, 3, 3, 4, 5, 5 has no single mode, but can beconsidered to have two modes, or is bi-modal

  • 8/2/2019 6.Descriptive Stats

    14/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Advantages of mean

    Always exists

    Is unique

    Can always be calculated by a simple formula

    Disadvantages of mean

    Mean value for a data set is not necessarily one of the values of the data set

    Sensitive to extreme scores, either high or low

    Easily distorted by extremely large or extremely small values among the set ofobservations,

    Example: mean of 1, 2, and 1,000,000 is 333,334.33

  • 8/2/2019 6.Descriptive Stats

    15/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Advantages of median

    Not affected by extreme scores

    Useful way of describing sets of observations that are skewed byincluding extremely large or small values

    Disadvantages of median

    Median is not necessarily one of the values of the data set

    Defined differently for odd and even numbers of observations

  • 8/2/2019 6.Descriptive Stats

    16/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Advantages of mode

    Can be used with any scale of measurement

    If set of observations has a mode, mode usefully characterizing the set

    For example, set of observations noting result of rolling two dice will have amode of 7

    Disadvantages of mode

    Many sets of observations lack a mode because no observed value occursmore than once

    Other sets of observations may have several different most frequent values

    Doesnt characterize set beyond most frequently occuring value

  • 8/2/2019 6.Descriptive Stats

    17/37

    Measures of Central Tendency

    Calculatingmean

    13

    14

    15

    16

    17

    18

    19

    Age Frequency

  • 8/2/2019 6.Descriptive Stats

    18/37

    Measures of Central Tendency

    Calculatingmean

    13

    14

    15

    16

    17

    18

    19

    13 x 3 = 39

    14 x 4 = 56

    15 x 6 = 90

    16 x 8 = 128

    17 x 4 = 68

    18 x 3 = 54

    19 x 3 = 57

    Sum of X = 492N = 31

    Mean = 15.87

    492/31 = 15.87

    Age Frequency

  • 8/2/2019 6.Descriptive Stats

    19/37

    Measures of Central Tendency

    Calculatingmode 13

    14

    15

    16

    17

    18

    19

    Mode = 16

    Age Frequency

  • 8/2/2019 6.Descriptive Stats

    20/37

    Measures of Central Tendency

    Calculatingmedian

    Non-grouped data

    13

    14

    15

    16

    17

    18

    19

    Median = 16

    1 - 3

    4 - 7

    8 - 13

    14 - 21

    22 - 25

    26 - 28

    29 - 31

    N = 31 so midpoint is 16th value

    Age Frequency

  • 8/2/2019 6.Descriptive Stats

    21/37

    Measures of Central Tendency

    Calculatingmedian

    Grouped data:

    Each value issomewhere within

    each age range

    Values are assumedto be equally

    distributed withinrange

    13

    14

    15

    16

    17

    18

    19

    Median = 16.31

    1 - 3

    4 - 7

    8 - 13

    14 - 21

    22 - 25

    26 - 28

    29 - 31

    N = 31 so midpoint is 16th value

    Age Frequency

    16.19

    15

    16.31

    16

    16.44

    17

    16.56

    18

    16.69

    19

    16.81

    20

    16.94

    21

    16.06

    14

  • 8/2/2019 6.Descriptive Stats

    22/37

    Measures of Central Tendency

    Mean = 15.87

    Mode = 16

    Median = 16.31

  • 8/2/2019 6.Descriptive Stats

    23/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Normal distribution

    Normal curve, bell-shaped curve,

    Gaussian distributionMany types of data are normallydistributed in a population

    Histogram of data approximates abell-shaped, symmetrical curve

    Concentration of scores in themiddle, with fewer and fewerscores as you approach extremes

    Example: heights of people in apopulation are normallydistributed

  • 8/2/2019 6.Descriptive Stats

    24/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Skewness

    Not all sets of data will exhibit properties of a normal distribution

    Some data sets are asymmetrical around a central point

    Majority of scores are closer to one extreme or the other: skeweddistribution

    In a skewed distribution, the mean does not equal the median

  • 8/2/2019 6.Descriptive Stats

    25/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central TendencyPositively skewed distribution, tail goes to the right - median is less than themean

    Example: Annual income of populationNegatively skewed distribution tail goes to the left - mean is less than themedian

  • 8/2/2019 6.Descriptive Stats

    26/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Special case of skewness: J-Curve

    Extreme skewness

    Proposed by Allport to describe conforming behavior in groups of people

    Large majority of scores fall at end representing socially acceptablebehavior, small minority represent deviation from norm

    Example: amount of time drivers who park in No Parking zone stay there

    < 5 5 to 10 10 to 15 15 to 20 20 to 25 >25

    0

    25

    50

    75

    100

  • 8/2/2019 6.Descriptive Stats

    27/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Determining when a distribution is skewed too much to beconsidered normal

    General rule of thumb: values beyond 2 standard errors of skewness (ses) areprobably significantly skewed

    ses = or use ses statistic from software (SPSS, for example) output

    Example: if sample size = 30 and skewness statistic is .9814:

    Other factors (histograms, normal probability plots, type of test to be used)should influence decision, depending on exact circumstances of analysis

    6/N

    ses =6/30 = .20 = .4472 2 ses = .4472 x 2 = .8944

    skewness statistic of .9814 is beyond 2 ses, so is significantly skewed

  • 8/2/2019 6.Descriptive Stats

    28/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Kurtosis - amount of peakednessor flatness of the distribution

    Mesokurtic-

    normalLeptokurtic - peaked, manyscores around middle

    Platykurtic -flat, many scoresdispersed from middle

    Non-normal kurtosis determinedby similar process to skewness

    Non-normal kurtosis only aconcern with some statistical tests

  • 8/2/2019 6.Descriptive Stats

    29/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Central Tendency

    Selecting appropriate measure of central tendency

    Interactive selection at Selecting Statistics by William M.K.

    Trochim: http://trochim.human.cornell.edu/selstat/ssstart.htm

    Rules below can be bent, depending on situation

    Unimodal, Ratio or interval data, skewed median

    Unimodal, Ratio or interval data, not skewed mean

    Unimodal, ordinal median

    Unimodal, Nominal mode

    Bi-modal or multi-modal distribution mode

  • 8/2/2019 6.Descriptive Stats

    30/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Dispersion

    Variability is a fundamental characteristic of most data sets, but is notaddressed by measures of central tendency

    Measures of central tendency are not enough to accurately describe a data setAlso need to be able to describe the variability or dispersion of the data

    Dispersion: scatteredness or flucuation of scores around average score

    Several types of measures of dispersion

    Range

    Standard deviation

    Variance

  • 8/2/2019 6.Descriptive Stats

    31/37

  • 8/2/2019 6.Descriptive Stats

    32/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Dispersion

    Interquartile range

    Simplified version: ignore the top and bottom 25% after sorting

    Difference between the remaining largest and smallest numbers isinterquartile range

    Addresses the problem of outliers

    Other methods of calculating interquartile range are slightly morecomplicated but take into account more data

  • 8/2/2019 6.Descriptive Stats

    33/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Dispersion

    Standard deviation

    Measures the variability or the degree of dispersion of the data set

    Square root of the average squared deviations from the mean

    Roughly speaking, standard deviation is the average distance betweenthe individual observations and the center of the set of observations

  • 8/2/2019 6.Descriptive Stats

    34/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Dispersion

    Calculating standard deviation

    1. Subtract each each observation

    from sample/population meanand square

    2. Add squared distances

    3. Divide sum by n - 1 or N(adjusted mean of squared

    distances)

    4. Take square root of meansquared distances

    s

    (x x)2

    n

    1

    (x )2

    N

    SD of sample:

    SD of population:

  • 8/2/2019 6.Descriptive Stats

    35/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Dispersion

    Variance

    Square of standard deviation

    Not used for descriptivestatistics, but is importantfor specific inferentialstatistics tests

    Variance of sample

    Variance of population

  • 8/2/2019 6.Descriptive Stats

    36/37

    Gary Geisler Simmons College LIS 403 Spring, 2004

    Measures of Dispersion

    Advantages of range as measure of dispersion

    Very simple to calculate

    Provides a meaningful characteristic of a set of observations (total spread of theobservations)

    Disadvantages of range as measure of dispersion

    Extreme values distort range

    Only measures the total spread; tells us nothing about the pattern of data distributionExamples:

    Data set 1, 2, 3, 4, 5, 6, 7, 8, 9 has a range of 8

    Data set 1, 9, 9, 9, 9, 9, 9, 9, 9 also has range of 8, though clearly less scattered

  • 8/2/2019 6.Descriptive Stats

    37/37

    Gary Geisler Simmons College LIS 403 Spring 2004

    Measures of DispersionAdvantages of standard deviation as measure of dispersion

    Can always be calculated

    Meaningful characteristic of a set of observations; takes every observation intoaccount to express the scatteredness of observations

    Examples:

    Set of observations 1, 2, 3, 4, 5, 6, 7, 8, 9 has a standard deviation s = 2.74

    Set of observations 1, 9, 9, 9, 9, 9, 9, 9, 9 has a standard deviation s = 2.67Range doesnt distinguish difference in scatteredness of sets, but standarddeviation does

    Disadvantage of standard deviation as measure of dispersion is that it is morecomplicated to calculate -- though not for computers