ch01 2statistics intro

Upload: andrea-jeromia

Post on 05-Apr-2018

233 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Ch01 2Statistics Intro

    1/74

    Introduction to Statistics

    Chapter 1

  • 7/31/2019 Ch01 2Statistics Intro

    2/74

    Statistics: Introduction

    Definition

    Data types, Scales of measurement

    Basic concepts

    Tabulation & Graphs

    Measures of central tendency

  • 7/31/2019 Ch01 2Statistics Intro

    3/74

    Definition

    Statistics is a science which deals with the

    collection, classification, presentation

    (tabulation, graphs), analysis and interpretation

    of numerical data.

  • 7/31/2019 Ch01 2Statistics Intro

    4/74

    Collection of data

    primary data- when data is collected directly

    from the original source, example census,

    survey with sampling (using questionnaire)

    secondary data- when data is taken indirectly

    from any other source, example magazines,

    news paper, NGOs, etc.

  • 7/31/2019 Ch01 2Statistics Intro

    5/74

    Classification of data

    Qualitative Data

    which shows the quality of the individual.

    For example- Eye Color, First Name, Favorite

    Movie, Political PartyQuantitative data are those for which the value

    has numerical meaning. You can do mathematicaloperations on the values of quantitative variables

    (like taking an average). Examples - A personsHeight, Weight, Income, Resting Pulse Rate, BloodAlcohol Level

  • 7/31/2019 Ch01 2Statistics Intro

    6/74

    Categorical data

    Geographical

    Chronological

  • 7/31/2019 Ch01 2Statistics Intro

    7/74

    Quantitative data

    Discrete variables are usually obtained by

    counting. There are a finite or countable

    number of choices available with discrete

    data. You can't have 2.63 people in the room.

    Continuous variables are usually obtained by

    measuring. Length, weight, and time are all

    examples of continuous variables. Can bedefined in intervals.

  • 7/31/2019 Ch01 2Statistics Intro

    8/74

    Levels of Measurement

    nominal

    ordinal

    interval

    Ratio

  • 7/31/2019 Ch01 2Statistics Intro

    9/74

    Levels of Measurement contd.

    Nominal

    Only names are meaningful.

    Can be qualitative only.

    Data values serve as labels.

    Labels have no meaningful order.

    Examples: Blood Type, Breed of Dog

  • 7/31/2019 Ch01 2Statistics Intro

    10/74

    Levels of Measurement contd.

    Ordinal

    Adds an order to the names.

    It can be qualitative or quantitative.

    Data values serve as labels but the labels have

    a natural meaningful order.

    Differences between values, however, are

    meaningless.

    Examples: Statistics Grade, Basketball

    Rankings

  • 7/31/2019 Ch01 2Statistics Intro

    11/74

    Levels of Measurement contd.

    Interval

    Data values are numerical

    They have a meaningful order

    Differences between data values aremeaningful

    The ratio of two data values, however, ismeaningless. This occurs when zero is an

    arbitrary measurement rather than actuallyindicating nothing

    Examples: Temperature, Year of Birth

  • 7/31/2019 Ch01 2Statistics Intro

    12/74

    Levels of Measurement contd.

    Ratio.

    Data values are numerical, have order, andboth differences and ratios of values are

    meaningful. Zero measurement indicates absence of

    quantity being measured.

    Examples : Weight, Volume, Number ofChildren

  • 7/31/2019 Ch01 2Statistics Intro

    13/74

    Population vs Sample

    Population - data set consisting of all

    outcomes, measurements, or responses of

    interest. (group of observations under study)

    Sample - data set which is a subset of the

    population data set

  • 7/31/2019 Ch01 2Statistics Intro

    14/74

    Types of Sampling

    There are 5 types of sampling methods:

    Random sampling

    Example Telephone polling random telephone

    numbers, Drawing names out of a hat.

    Systematic sampling

    Example- Choosing a sample of registered voters

    by choosing every 25th voter from the countyregistration roll , Testing every 300th product from

    the assembly line

  • 7/31/2019 Ch01 2Statistics Intro

    15/74

    Types of Sampling contd.

    Convenience sampling

    Example- Internet Polls, Mail-In Customer Survey

    Cluster sampling is accomplished by dividing the

    population into groups -- usually geographically.

    These groups are called clusters or blocks. The

    clusters are randomly selected, and each element

    in the selected clusters are used. Example-Randomly choose 10 polling stations in a city and

    exit poll all voters at those stations

  • 7/31/2019 Ch01 2Statistics Intro

    16/74

    Types of Sampling contd.

    Stratified sampling also divides the population

    into groups called strata & then a sample is taken

    from each of these strata using either random,

    systematic, or convenience sampling. Example-Choosing 100 men and 100 women for a sample

    Stratify the population of 1000 employees by

    income level and then choose a sample of low,

    middle, and high income individuals.

  • 7/31/2019 Ch01 2Statistics Intro

    17/74

    Parameter and Statistics

    Parameters are statistical measuresassociated with populations and statistic

    with samples.

    Parameters are usually denoted usingGreek letters (mu, sigma) while statistics

    are usually denoted using Roman letters (x,

    s).

  • 7/31/2019 Ch01 2Statistics Intro

    18/74

    Descriptive and Inferential Statistics

    We compute statistic, and use them to

    estimate parameters. The computation is

    the first part of the statistics course is calledDescriptive Statistics and the estimation is

    the second part called as Inferential

    Statistics.

  • 7/31/2019 Ch01 2Statistics Intro

    19/74

    Presentation of Data

    Tabulation

    Frequency distribution

    Grouped

    Ungrouped

    Graphs

  • 7/31/2019 Ch01 2Statistics Intro

    20/74

    Frequency distribution

    According to Croxton & Cowden ,

    Frequency Distribution is a statistical

    table which shows the set of all distinctvalues of the variable arranged in order

    of magnitude either individually or in a

    group with their correspondingfrequencies side by side

  • 7/31/2019 Ch01 2Statistics Intro

    21/74

    Ungrouped Frequency of a Discrete Variable:

    Suppose that a survey of 150 houses wasconducted and number of rooms in each house isrecorded as below

    5 4 4 6 3 2 2 6 6 2 6 3 3 4 5 6 3 2 2 5 31 4 5 1 5 1 4 3 2 5 1 5 3 2 2 4 2 2 4 4 63 2 4 2 3 2 4 6 3 3 2 6 4 1 4 4 5 2 4 1 42 1 5 1 3 3 2 5 6 1 3 1 5 3 4 3 1 1 4 1 12 2 1 5 2 3 5 3 4 3 1 1 4 1 1 2 2 1 5 2 3

    6 3 5 2 2 3 3 3 3 4 5 1 6 2 1 2 1 1 6 5 21 1 5 6 4 2 2 3 3 3 4 3 2 1 5 2 3 1 1 4 64 6 2

  • 7/31/2019 Ch01 2Statistics Intro

    22/74

    Counting of frequency using tally MarksNo. of Rooms (X) Tally Marks Frequency

    1 llll llll llll llll llll llll 29

    2 llll llll llll llll llll llll lll 33

    3 llll llll llll llll llll llll 30

    4 llll llll llll llll llll 24

    5 llll llll llll llll 19

    6 llll llll llll 15

  • 7/31/2019 Ch01 2Statistics Intro

    23/74

    Grouped Frequency Distributions -

    Guidelines for classesNumber of classes should be between 5 and

    20.

    The classes must be mutually exclusiveThe classes must be all inclusive or

    exhaustive.

    The classes must be continuous.The classes must be equal in width.

  • 7/31/2019 Ch01 2Statistics Intro

    24/74

    Creating a Grouped Frequency Distribution

    Compute the Range = Maximum Minimum (fromobservations)

    Select the number of classes desired. This is usuallybetween 5 and 20.

    Or use Struges formula: n= 1+ 3.322 log10 N

    gives the necessary no. of classes for any continuous

    frequency distribution, N denotes total no. ofobservation.

    Find the class width by dividing the range by thenumber of classes and rounding up.

    Pick a suitable starting point less than or equal to the

    minimum value. You will be able to cover: "the classwidth times the number of classes" values. Your startingpoint is the lower limit of the first class. Continue to addthe class width to this lower limit to get the rest of thelower limits.

  • 7/31/2019 Ch01 2Statistics Intro

    25/74

  • 7/31/2019 Ch01 2Statistics Intro

    26/74

    Tabulating Numerical Data: Frequency

    Distributions

    Sort Raw Data in Ascending Order12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

    Find Range: 58 - 12 = 46

    Select Number of Classes: 5(usually between 5 and 15)

    Compute Class Interval (width): 10 (46/5 then round up)

    Determine Class Boundaries (limits): 10, 20, 30, 40, 50, 60

    Compute Class Midpoints: 15, 25, 35, 45, 55

    Count Observations & Assign to Classes

  • 7/31/2019 Ch01 2Statistics Intro

    27/74

    Frequency Distributions, Relative Frequency

    Distributions and Percentage Distributions

    Class Frequency

    10 but under 20 3 .15 15

    20 but under 30 6 .30 30

    30 but under 40 5 .25 25

    40 but under 50 4 .20 20

    50 but under 60 2 .10 10

    Total 20 1 100

    RelativeFrequency Percentage

    Data in ordered array:

    12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

  • 7/31/2019 Ch01 2Statistics Intro

    28/74

    Graphing Numerical Data:

    The Histogram

    Histogram

    0

    3

    6

    5

    4

    2

    0

    0

    1

    23

    4

    5

    6

    7

    5 15 25 35 45 55 More

    Frequency

    Data in ordered array:

    12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

    No Gaps

    Between

    Bars

    Class Midpoints

    Class Boundaries

  • 7/31/2019 Ch01 2Statistics Intro

    29/74

    Graphing Numerical Data:

    The Frequency Polygon

    Frequency

    0

    1

    23

    4

    5

    6

    7

    5 15 25 35 45 55 More

    Class Midpoints

    Data in ordered array:

    12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

  • 7/31/2019 Ch01 2Statistics Intro

    30/74

    Tabulating Numerical Data:

    Cumulative Frequency

    Lower Cumulative CumulativeLimit Frequency % Frequency

    10 0 0

    20 3 15

    30 9 45

    40 14 70

    50 18 90

    60 20 100

    Data in ordered array:

    12, 13, 17, 21, 24, 24, 26, 27, 27, 30, 32, 35, 37, 38, 41, 43, 44, 46, 53, 58

  • 7/31/2019 Ch01 2Statistics Intro

    31/74

  • 7/31/2019 Ch01 2Statistics Intro

    32/74

    Class Boundaries upper limit (Not

    Midpoints)

  • 7/31/2019 Ch01 2Statistics Intro

    33/74

    Example

    Using the data given below, construct a 'more than'cumulative frequency table and draw the Ogive

    (More than ogive)

  • 7/31/2019 Ch01 2Statistics Intro

    34/74

  • 7/31/2019 Ch01 2Statistics Intro

    35/74

  • 7/31/2019 Ch01 2Statistics Intro

    36/74

    Locate median on OgiveValue on X axis of point corresponding to N/2 on y axis

    (Point is at intersection of less than & more than ogive )

  • 7/31/2019 Ch01 2Statistics Intro

    37/74

    Tabulating and Graphing Univariate

    Categorical Data

    Categorical Data

    Tabulating Data

    The Summary Table

    Graphing Data

    Pie Charts

    Pareto DiagramBar Charts

  • 7/31/2019 Ch01 2Statistics Intro

    38/74

    Summary Table(for an Investors Portfolio)

    Investment Category Amount Percentage(in thousands $)

    Stocks 46.5 42.27

    Bonds 32 29.09

    CD 15.5 14.09

    Savings 16 14.55

    Total 110 100

    Variables are Categorical.

  • 7/31/2019 Ch01 2Statistics Intro

    39/74

    Graphing Univariate

    Categorical Data

    0 10 20 30 40 50

    S t ocks

    B onds

    Savings

    CD

    Categorical Data

    Tabulating Data

    The Summary Table

    Graphing Data

    Pie Charts

    Pareto DiagramBar Charts

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    S to ck s B on ds S avin gs C D

    0

    20

    40

    60

    80

    100

    120

  • 7/31/2019 Ch01 2Statistics Intro

    40/74

  • 7/31/2019 Ch01 2Statistics Intro

    41/74

    Pie Chart

    (for an Investors Portfolio)

    Percentages are

    rounded to the

    nearest percent.

    Amount Invested in K$

    Savings

    15%

    CD14%

    Bonds29%

    Stocks

    42%

  • 7/31/2019 Ch01 2Statistics Intro

    42/74

    Pareto Diagram

    Axis for linegraphshows

    cumulative% invested

    Axis forbar

    chartshows

    %investedin each

    category

    0%

    5%

    10%

    15%

    20%

    25%

    30%

    35%

    40%

    45%

    Stocks Bonds Savings CD

    0%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

  • 7/31/2019 Ch01 2Statistics Intro

    43/74

  • 7/31/2019 Ch01 2Statistics Intro

    44/74

  • 7/31/2019 Ch01 2Statistics Intro

    45/74

    Tabulating and Graphing Bivariate

    Categorical Data

    Side by Side Charts

    Comparing Investors

    0 10 20 30 40 50 60

    Stocks

    Bonds

    CD

    Savings

    Investor A Investor B Inves tor C

  • 7/31/2019 Ch01 2Statistics Intro

    46/74

  • 7/31/2019 Ch01 2Statistics Intro

    47/74

    Errors in Presenting Data

    Using Chart Junk

    No Relative Basis in Comparing Data between

    Groups

    Compressing the Vertical Axis

    No Zero Point on the Vertical Axis

  • 7/31/2019 Ch01 2Statistics Intro

    48/74

    Chart Junk

    Good Presentation

    1960: $1.00

    1970: $1.60

    1980: $3.10

    1990: $3.80

    Minimum Wage Minimum Wage

    0

    2

    4

    1960 1970 1980 1990

    $

    Bad Presentation

  • 7/31/2019 Ch01 2Statistics Intro

    49/74

    No Relative Basis

    Good PresentationAs received by

    students.

    As received by

    students.

    Bad Presentation

    0

    200

    300

    FR SO JR SR

    Freq.

    10

    30

    FR SO JR SR

    %

    FR = Freshmen, SO = Sophomore, JR = Junior, SR = Senior

  • 7/31/2019 Ch01 2Statistics Intro

    50/74

    Compressing Vertical Axis

    Good Presentation

    Quarterly Sales Quarterly Sales

    Bad Presentation

    0

    25

    50

    Q1 Q2 Q3 Q4

    $

    0

    100

    200

    Q1 Q2 Q3 Q4

    $

  • 7/31/2019 Ch01 2Statistics Intro

    51/74

  • 7/31/2019 Ch01 2Statistics Intro

    52/74

    Chapter 2

    Measures of Central Tendency

  • 7/31/2019 Ch01 2Statistics Intro

    53/74

    Average is a value which represents the

    whole data or it is a central value of the

    data. The arithmetic mean, median, or

    mode are commonly used measures ofthe central tendency.

    Measures of Central Tendency

  • 7/31/2019 Ch01 2Statistics Intro

    54/74

    Measures of Central Tendency

    Arithmetic Mean (AM)

    Population Mean:

    Sample Mean:

    Frequency Distribution:

  • 7/31/2019 Ch01 2Statistics Intro

    55/74

    Simple problems on AM

    For Raw data

    X = (X) / N , N = total no. of observation

    Find the mean of 5,34, 32, 34, 21, 23, 34, 31 30,

    35

  • 7/31/2019 Ch01 2Statistics Intro

    56/74

    Simple problems on AM

    For Raw data

    X = (X) / N , N = total no. of observation

    Find the mean of 5,34, 32, 34, 21, 23, 34, 31 30,

    35

    Answer= 27.9

  • 7/31/2019 Ch01 2Statistics Intro

    57/74

    d f di ib i

  • 7/31/2019 Ch01 2Statistics Intro

    58/74

    Grouped frequency distribution

    X = A + h (fd) / f , f=N

    Marks: 0-10 10-20 20-30 30-40 40-50 50-60

    No. of 12 18 27 20 17 6

    students:

    Missing frequency

  • 7/31/2019 Ch01 2Statistics Intro

    59/74

    Missing frequency

    Marks: 0-10 10-20 20-30 30-40 40-50

    No. of : 5 10 25 ----- 5

    Students

    Find missing frequency if arithmetic mean is 25.

    (ans-10)

  • 7/31/2019 Ch01 2Statistics Intro

    60/74

    Combine Series

    Let X1 and X2 are two series of data. Where

    X1 series is of n1 size and X1 AM.

    X2 series is of n2 size and X2 AM.

    then the AM of combined series of size n1 + n2 is

    given by

    X12= n1X1 + n2X2

    n1 + n2

    this formula can be generalized for k series.

    AM of Combine Series

  • 7/31/2019 Ch01 2Statistics Intro

    61/74

    AM of Combine Series

    Q The average honey obtained from 5 beehivesis 30 kg. The average of 3 of them is 31 kg. Whatis the average of the remaining two?

    Q A mean wage of 200 workers working in threeshifts in a factory is Rs. 520. The average of 90workers working in first shift is Rs. 500. Theaverage of 60 workers working in second shift is

    Rs. 450. What is the average of workers workingin third Shift? (Rs.640)

  • 7/31/2019 Ch01 2Statistics Intro

    62/74

  • 7/31/2019 Ch01 2Statistics Intro

    63/74

  • 7/31/2019 Ch01 2Statistics Intro

    64/74

    Median : A Positional Average

    The data must be ranked (sorted in ascendingorder) first.

    The median is the number in the middle.

    To find the depth (position) of the median,

    there are several formulas that could be used,the one that we will use is:Depth of median = 0.5 * (n + 1) = (n+1)/2

  • 7/31/2019 Ch01 2Statistics Intro

    65/74

    How to Find Median

    Raw Data -The median is the number in the middle

    position.

    If the sample size is odd, there is one value at center

    which is median. If N observations then it is (N+1)/2th position.

    If the sample size is even, the median will be a

    midpoint between the two middle numbers of the

    data. It is also called as (N+1)/2 th position.

    In case of frequency distribution we first need to find

    the cumulative frequencies (less than type).

  • 7/31/2019 Ch01 2Statistics Intro

    66/74

    Examples

    a) The marks of 9 students are 7,9,8,5,7,7,6,8 and 9

    b) The weights of 8 children are given as (kgs)

    12,11.5,13,13.5,10.5,14,12,15.

    Ungrouped frequency distribution

    c) Calculate the median weight of group of

    children

    Weights 30 31 32 33 34 35 36 37

    No. of children 8 12 15 25 20 12 5 2

  • 7/31/2019 Ch01 2Statistics Intro

    67/74

    Calculation of median for Ungrouped or

    frequency distribution

    First arrange in ascending/descending order.

    To find Cumulative frequencies less than type.

    Find (N+1)/2 & cf (equal or higher side) value

    in table for (N+1)/2, corresponding value ofvariable is median ( median class for grouped

    frequency data)

  • 7/31/2019 Ch01 2Statistics Intro

    68/74

  • 7/31/2019 Ch01 2Statistics Intro

    69/74

    Examples

    Grouped frequency distributiond) Calculate median weight for the following group of persons

    Weights 50-55 55-60 60-65 65-70 70-75 75-80

    No. of persons 8 10 25 35 15 7

    Sales in 100s of Rs. 0-10 10-20 20-30 30-40 40-50

    No. of shops 5 25 - 18 7

    Missing frequencye) Given that the median sales of shops were Rs. 2,400, find the missing frequency.

  • 7/31/2019 Ch01 2Statistics Intro

    70/74

    Mode

    Mode is the value of the variablecorresponding to the highest frequency. It canbe calculated only for frequency data.

    The mode can be computed just by inspectionin case ofungrouped frequency data.

    Mode can be easily located on histogram incase of continuous grouped frequency data.Otherwise it can be calculated using a formulaafter detecting the modal class.

  • 7/31/2019 Ch01 2Statistics Intro

    71/74

    Mode for grouped frequency distributions

    Find the modal class, that is the class with themaximum frequency. We can locate the modal

    class but to find mode the following formula is

    used:

    Mode = L + [(fm-f1) / (fm-f1)+(fm-f2)] x h

    L is the lower class boundary of modal class

    fm is the Frequency of the model classf1 is the frequency of the class previous to model class

    f2 is the frequency of the class next to model class

    h is the size of model class

  • 7/31/2019 Ch01 2Statistics Intro

    72/74

    Expenditures 0-10 10-20 20-30 30-40 40-50

    No. of

    families

    14 23 28 20 15

    Example:

    Following is the distribution of 100 families, according

    to expenditures on a luxury product in 1000 rupees. Findthe mode of the following data.

  • 7/31/2019 Ch01 2Statistics Intro

    73/74

    When Mode is Ill-defined

    Some times data have more than one mode- maybe bimodal, trimodal or multimodal.

    For non-symmetrical distribution mode can becalculated using the following formula

    Mode = 3 median- 2 mean

    Problem: For a skewed distribution, the meanand mode are 35.4 and 32.1 respectively. Find thevalue of median.

    Ans. 34.3

  • 7/31/2019 Ch01 2Statistics Intro

    74/74

    Assignment

    What are geometric mean (GM) and harmonic

    mean(HM)? Define.

    What are the merits , demerits and uses of

    median, mode, HM and GM?

    Solve 3 problems of each average from a

    reference book.