(6) graphical presentation 2

Upload: asclabisb

Post on 14-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/30/2019 (6) Graphical Presentation 2

    1/30

    Applied Statistics and Computing Lab

    GRAPHICAL PRESENTATIONS 2(REPRESENTATION OF NUMERICAL DATA)

    Applied Statistics and Computing Lab

    Indian School of Business

    1

  • 7/30/2019 (6) Graphical Presentation 2

    2/30

    Applied Statistics and Computing Lab

    Learning Goals Why we use graphs

    What are the various types of graphs for

    presenting numerical data

    Which graph to use in which scenario Graphical Distortion

    2

  • 7/30/2019 (6) Graphical Presentation 2

    3/30

    Applied Statistics and Computing Lab

    Why use Graphs: An example A private insurance firm interested in

    marketing its insurance products in region A.To target precisely, needs to know age

    distribution. Questions-

    In which age group does the highest number of

    people lie?

    Needs to divide population into 4 different agegroups, to sell 4 different products

    It has the following data-

    3

  • 7/30/2019 (6) Graphical Presentation 2

    4/30

    Applied Statistics and Computing Lab

    Data on ages of 507 people23,21,23,26,22,27,29,37,55,53,21,19,20,18,32,20,28,19,23,33,40,28,24,36,23,29,34,31,34,42,45,23,46,26,30,25,20,37,24,36,28,29,23,23,25,24,37,42,30,28,29,39,26,

    20,21,20,19,20,40,25,45,28,21,22,19,24,24,20,29,27,27,40,43,22,22,21,24,23,23,45,20,25,25,33,21,23,20,34,20,41,25,32,24,65,28,25,38,23,22,20,35,34,67,38,33,26,25,52,21,32,43,24,28,62,45,40,21,23,30,20,28,41,32,26,37,38,27,23,50,25,23,43,33,22,26,37,32,23,37,23,27,23,27,24,21,25,23,23,46,34,25,29,45,44,35,55,25,31,19,45,34,19,20,29,33,37,21,23,51,31,27,27,37,25,37,33,25,29,25,20,25,28,24,31,25,27,23,20,28,40,21,62,44,49,34,25,29,19,20,20,26,19,36,34,24,27,23,20,28,40,21,62,

    44,49,34,25,29,19,20,20,26,19,36,34,24,27,22,22,48,21,27,33,34,54,25,35,22,21,41,23,19,29,27,36,21,20,20,24,35,33,25,45,55,49,30,28,25,23,26,21,26,32,32,32,35,19,26,22,23,25,38,30,43,60,32,26,23,24,21,28,25,20,64,39,27,32,23,24,23,29,44,20,24,42,27,43,37,20,47,45,20,28,21,37,27,26,22,21,62,27,27,22,22,52,42,30,19,19,19,24,21,36,32,52,26,56,30,23,21,44,37,51,38,23,44,26,23,20,44,25,18,22,35,24,25,23,22,24,26,26,28,34,24,33,46,51,25,19,35,19,19,20,41,33,44,19,29,35,33,22,33,

    44,29,46,19,30,26,20,32,20,27,22,40,42,29,31,22,29,36,37,25,46,25,43,43,24,24,19,46,29,26,32,29,34,26,34,22,25,41,38,21,34,37,56,28,35,29,22,22,24,36,40,40,37,23,34,20,23,40,20,30,32,30,21,39,37,22,39,49,24,20,40,24,39,32,24,22,20,27,21,26,28,26,18,30,22,30,18,52,25,28,42,23,41,32,22,24,25,27,24,27,31,35,21,36,20,23,19,25,31,32,40,41,36,43,34,26,29,23,45,33,29,29,45,48,19,38,26,48,22,32,44,44,19,32,30,

    4

  • 7/30/2019 (6) Graphical Presentation 2

    5/30

    Applied Statistics and Computing Lab

    Inferences

    What can you infer from the data? Practically nothing!

    How long before you come up with answers?

    Probably the first thing you do, is count the observations for each age.

    Note down the observations along with the corresponding age

    That makes a frequency table for you!

    Frequency table- Just like in categorical data, a frequency table for

    discrete numerical data lists each possible value (either individually or

    grouped into intervals), the associated frequency and sometimes the

    corresponding relative frequency.

    Note: Age is, in theory, a continuous variable as it can assume any value.

    But here the variable is, age, in whole years, which is discrete.

    But 44 distinct values in your data!

    Hence frequency table with 44 rows and one frequency column

    5

  • 7/30/2019 (6) Graphical Presentation 2

    6/30

    Applied Statistics and Computing Lab

    So, list individually or group? List Individually or group into intervals?-

    In the ages data, there are 44 distinct values. If we list individually, wehave data with 44 rows! Cumbersome to interpret

    Insurance company interested in selling 4 different products cateringto the needs of 4 different age groups.

    Interested in 4 age categories

    In general, depending on need and size of data, decide whether togroup or not ( For discrete data).

    For continuous data it is necessary to group.

    How to make groups?- Find max and min. Choose suitable classwidth= (max-min)/(desired no of classes), round off to the nextinteger, if decimal. If not, then the next integer

    6

  • 7/30/2019 (6) Graphical Presentation 2

    7/30

    Applied Statistics and Computing Lab

    Construction of Frequency table

    Do we have the answers in a minute from this table?

    The age group 17-29 has the maximum number of people

    We also have the exact number of people in each age group

    This same data can be represented pictorially in a number of

    ways!

    Class Interval Frequency

    17-29 298

    30-42 142

    43-55 56

    56-68 10

    7

  • 7/30/2019 (6) Graphical Presentation 2

    8/30

    Applied Statistics and Computing Lab

    Types of Graph

    Graphs for presenting Numerical data:

    Bar chart (for discrete variable)

    Histogram

    Frequency Polygon

    Ogive

    Line Diagram

    8

  • 7/30/2019 (6) Graphical Presentation 2

    9/30

    Applied Statistics and Computing Lab

    Bar Chart (Numerical Data) Graph of the frequency distribution

    Similar to bar chart for categorical data Each frequency or relative frequency is represented by

    a rectangle centered over the corresponding value (orrange of values for grouped data)

    Area of the rectangle is proportional to thecorresponding frequency or relative frequency

    We could name the groups group 1, group 2, group 3and group 4 and plot the corresponding frequencies,exactly like in case of categorical data (Exercise)

    Conceptually hence there is no difference between thetwo

    9

  • 7/30/2019 (6) Graphical Presentation 2

    10/30

    Applied Statistics and Computing Lab

    Histogram( for continuous

    numerical data)-

    Graph of the frequency distribution

    of continuous data Suppose given the ages of 507

    people in continuous form- (Nowage not reported in whole years,can take any value on real line)

    We draw histogram instead of barchart

    Similar to bar chart for numericaldata except that there are no gaps

    between the bars Length of each rectangle represents

    frequency of each equal class-interval , so that area representedby histogram= total frequency

    If class-intervals are not equal, thenlength represents relativefrequency,(= class frequency/classinterval) then total area enclosed byhistogram=1

    10

  • 7/30/2019 (6) Graphical Presentation 2

    11/30

    Applied Statistics and Computing Lab

    Inferences: Maximum concentration is in the age group 20-25

    Gives an idea about shape of the distribution-

    for eg, we can say that the distribution of ages is not symmetric, it is highly right skewed

    (See module on Skewness and kurtosis)

    Extent of spread or variation (see module on dispersion)

    What is bin width?- Bin width refers to the length of each class interval.

    How to choose bin width?- Well, R chose a bin width for you!

    The default bin width in R is given by Sturges Rule

    Some other Thumb Rules- Doanes Formula, Rice Rule, Scott Rule, Freedman Diaconis Rule,

    All you need to do is specify the option in breaks= in R (see histogram in R-code slide)

    For more details on these rules- http://en.wikipedia.org/wiki/Histogram11

  • 7/30/2019 (6) Graphical Presentation 2

    12/30

  • 7/30/2019 (6) Graphical Presentation 2

    13/30

    Applied Statistics and Computing Lab

    Frequency Polygon (for

    representing continuousdata)

    A frequency polygon is formed

    by plotting the frequencies ofeach class against their

    midpoints and joining the points

    by straight lines

    To get a closed polygon, we take

    two additional classes, one at

    each end, that have zero

    frequencies. ( The midpoints

    corresponding to these classes

    thus have zero frequencies)

    Basically, if superimposed on a

    histogram, it joins the midpointsof each rectangular bar by

    straight line segments

    We draw the frequency polygon

    for the ages data over thehistogram itself 13

  • 7/30/2019 (6) Graphical Presentation 2

    14/30

    Applied Statistics and Computing Lab

    Inferences

    But is there any additional information you can derive from a frequency polygon, over

    and above which the histogram gives?

    Not really! In fact histogram gives more information since while it lists the entire

    class intervals, a frequency polygon only shows the midpoint. To appreciate fully, lookat a frequency polygon without the corresponding histogram-

    In the construction we have made a simplification by drawing the class frequency

    corresponding to the mid point of the class interval thereby losing more information14

  • 7/30/2019 (6) Graphical Presentation 2

    15/30

    Applied Statistics and Computing Lab

    Why use Frequency Polygon?

    For comparing between two sets of data the corresponding

    frequency polygons can be drawn on the same graph

    Drawing two histograms on the same diagram for comparison

    purposes is confusing

    The insurance company is looking at the profitability of

    investing in two regions- region A and region B. Region with a higher proportion of 50 plus population

    demands more insurance.

    The ages.both.regions.csv data gives the ages of a randomsample of 507 people in both region A and in region B

    Draw two histograms on the same diagram and try to

    compare-

    15

  • 7/30/2019 (6) Graphical Presentation 2

    16/30

    Applied Statistics and Computing Lab

    Why use Frequency polygonQ. What can you infer? Practically nothing, right!

    16

  • 7/30/2019 (6) Graphical Presentation 2

    17/30

    Applied Statistics and Computing Lab

    Why Frequency Polygon (Contd)Draw two frequency polygons on the same diagram and compare.

    What can you infer? Can you infer better?

    Which region should have the higher insurance demand?17

  • 7/30/2019 (6) Graphical Presentation 2

    18/30

    Applied Statistics and Computing Lab

    Ogives- Cumulative Frequency Curves

    Now suppose the insurance company wants answers to more

    particular questions- In region A, how many are 50 years or more?

    In region A, how many people are 20 years or less?

    In region A, how many people are 60 years or more?

    (Similar questions for region B)

    It wants to design separate products for the age groups 20 or

    less, 20-50, 50 and above and a few additional schemes for 60

    plus people

    Clearly, it needs to know the cumulative frequency for each

    age group!

    18

  • 7/30/2019 (6) Graphical Presentation 2

    19/30

    Applied Statistics and Computing Lab

    Ogives for Region A A cumulative frequency curve

    or ogive is obtained by plottingcumulative, rather thanindividual class frequencies.There maybe two types ofogives- A curve showing the number

    of observations equal to orgreater than the lower classlimit of each correspondingclass- referred to as more

    than type ogive A curve showing the number

    of observations equal to orless than the upper class limitof each corresponding class-

    referred to as less than typeogive

    Each successive point is joinedby line segments to give theogive

    19

  • 7/30/2019 (6) Graphical Presentation 2

    20/30

    Applied Statistics and Computing Lab

    Ogives- For Region A

    The black plot gives the less than type Ogive

    The purple plot gives the more than type Ogive

    From diagram insurance company readily has the answers- From the less than type Ogive,

    observe that there are 114 people aged less than 22, around 100 people aged less than 20 491 people aged less than 52, roughly 500 people less than 50

    From the more than type ogive, we infer there are very few people above 60, something

    around 5

    Q. Draw the ogives for Region B and try to answer the above questions. Compare with the

    age distribution for region A20

  • 7/30/2019 (6) Graphical Presentation 2

    21/30

    Applied Statistics and Computing Lab

    Line Diagram

    Year Households

    with computer

    1985 8.2

    1990 15

    1994 22.8

    1995 24.1

    1998 36.6

    1999 42.1

    2000 51

    Source: Falling Through The Net: Toward Digital Inclusion

    ( U.S Department of Commerce,October 2000)

    The ages data is an example of cross section data.

    Use any of the above diagrams depending on nature of cross section data.

    But what if given a time series data- a series of observations given corresponding to

    each time point?

    For eg, consider the following data

    How to represent this graphically?

    Need to represent each value

    corresponding to each given year

    21

  • 7/30/2019 (6) Graphical Presentation 2

    22/30

    Applied Statistics and Computing Lab

    Line Chart Plot years on the horizontalaxis and mark the values

    corresponding to each year

    on the vertical axis

    Join the points by linesegments. We have our line

    graph ready!

    Think: Can we construct a

    histogram , ogive or bar chart

    with this data? Why or why

    not?

    Line diagram is meant forrepresenting chronological

    data. It exhibits the

    relationship of the variable

    with time. 22

  • 7/30/2019 (6) Graphical Presentation 2

    23/30

    Applied Statistics and Computing Lab

    Line Chart: Inferences

    Shows an increasing trend over the years- that is, from 1985 to 2000, the percentageof households with computers consistently rising

    From under 10% in 1985 it has crossed to over 50% in 2000, signifying an over 400%

    increase from 1985 to 2000

    Useful for analysing time trend- that is, the long-term movement of time series data

    23

  • 7/30/2019 (6) Graphical Presentation 2

    24/30

    Applied Statistics and Computing Lab

    Graphical Distortion of Data

    As much as graphs can be used to summarize and represent various aspects of

    data succinctly it can also be used to distort data

    First might be inadequate representation of data. Consider the following line graph

    showing the population above poverty line of a hypothetical country A-

    Seeing this graph, we conclude that poverty has been falling in this country as the

    number of people above poverty line is rising.

    24

    0

    200

    400

    600

    800

    1000

    1200

    1990 1995 2000 2005 2010

    People above poverty line

    People above poverty line

  • 7/30/2019 (6) Graphical Presentation 2

    25/30

    Applied Statistics and Computing Lab

    Graphical Distortion: Continued

    But now, this graph used inadequate information- this is the table from which the

    graph has been produced

    Draw a line chart showing the relative share of people above poverty line

    We see that the relative share of people above poverty line is actually decliningand thus the relative share of people below poverty line is actually rising

    Our earlier conclusion, based on representation of inadequate data, led to a

    fallacious conclusion

    25

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.70.8

    1990 1995 2000 2005 2010

    Relative share of people above

    poverty line

    Relative share of

    people above poverty

    line

  • 7/30/2019 (6) Graphical Presentation 2

    26/30

    Applied Statistics and Computing Lab

    Graphical Distortion of data Contd..

    The above is just an example. There might be

    numerous ways in which data can be misrepresented

    For eg, one common misuse might be distortion withscale

    With the explosion of data visualization techniques

    and sophisticated displays like 3-D charts datadistortion can be easier to achieve

    For more information read-

    http://lilt.ilstu.edu/gmklass/pos138/datadisplay/bad

    chart.htm

    26

  • 7/30/2019 (6) Graphical Presentation 2

    27/30

    Applied Statistics and Computing Lab

    R Codes

    Histogramdata=read.csv('ages.continuous.csv',header=TRUE,sep=',')

    View(data)

    age=data$age

    max(age)

    colors=c("red", "bisque", "darkslategray", "violet", "orange",

    "blue", "pink", "cyan","brown","cornsilk")

    # hist for histogram,right=TRUE means right-closed, left-open intervals

    hist(age,right=TRUE,col=colors)

    # To specify bin widths on your own

    bins=seq(17,67,by=5)

    hist(age,right=TRUE,breaks=bins,col=colors)

    #Example of Histogram with too small binwidth

    bins=seq(17,67,by=2)

    # Example of Histogram with too large binwidth

    bins=seq(17,67,by=25)

    hist(age,right=TRUE,breaks=bins,col=colors)

    # Drawing a frequency polygon over a histogram

    bins=seq(17,67,by=10)

    hist(age,right=TRUE,breaks=bins,col=colors,xlim=c(10,75)) # draw the histogram

    lines(c(12,seq(22,62,by=10),72),c(0,as.vector(table(cut(age,seq(17,67,by=10)))),0),lwd=2) #draw the frequency

    polygon27

  • 7/30/2019 (6) Graphical Presentation 2

    28/30

    Applied Statistics and Computing Lab

    R CodesFrequency Polygon

    RegionA.age=data$RegionA.age

    RegionB.age=data$RegionB.age

    max(RegionA.age)

    min(RegionA.age)

    max(RegionB.age)min(RegionB.age)

    bins.A=seq(17,67,by=10)

    bins.B=seq(15,75,by=10)

    #To draw two frequency polygons on the same graph

    plot(c(12,seq(22,62,by=10),72),c(0,as.vector(table(cut(RegionA.age,seq(17,67,by=10)))),0),type="b",main="Frequency

    distribution of age",xlab="age ",ylab="frequency", xlim=c(10,80),ylim=c(0,270))

    lines(c(10,seq(20,70,by=10),80),c(0,as.vector(table(cut(RegionB.age,seq(15,75,by=10)))),0),lwd=2,col="violet")

    Line Chart

    data=read.csv('Households with computer.csv',header=TRUE,sep=',')

    household.comp=data$Households.with.computer.percentage

    Year=data$Yearx=c(0,0,0,0,0)

    y=c(0,0,0,0,0)

    plot(x,y,xlab="Year",ylab="Percentage of Households with Computer",type="b",xlim=c(1985,2000),ylim=c(5,65))

    lines(Year,household.comp,type="b",col="blue")

    title("Line Chart")

    28

  • 7/30/2019 (6) Graphical Presentation 2

    29/30

    Applied Statistics and Computing Lab

    R CodesOgives

    min(data)

    max(data)

    NumberOfClasses = 10

    ClassInterval = (67 - 17)/10

    ClassInterval

    ClassEnds = seq(17,67,5)

    classes=cut(data[,1], breaks=ClassEnds)

    FrequencyDistribution = table(classes)

    CumulativeFrequencies = c(cumsum(FrequencyDistribution))

    cbind(CumulativeFrequencies)

    #Less than type Ogive

    plot(ClassEnds,c(0,as.vector(CumulativeFrequencies)),type="b",xlim=c(10,70),ylim=c(0,700),main="Ogives",xlab="ClassIntervals",y

    lab="Cumulative Frequency of Age")

    #More than type Ogive

    cbind(FrequencyDistribution)Frequency=as.vector(FrequencyDistribution)

    cbind(as.vector(FrequencyDistribution))

    More.than.cum.freq=cumsum(rev(Frequency))

    Upper.limit=rev(ClassEnds)

    lines(Upper.limit,c(0,More.than.cum.freq),type="b",col="violet")

    29

  • 7/30/2019 (6) Graphical Presentation 2

    30/30

    Thank you

    Applied Statistics and Computing Lab30