comparing ca

Upload: need-moar-sleep

Post on 14-Apr-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Comparing CA

    1/2

    Larry Zhang

    AP Statistics, Mr. Thill

    Case Study: Comparing Two Groups

    In this case study, I compare 500 high school students in New York and California from whom

    various data was collected in 2013. The students travel times (in minutes) to school were compared

    based on which state they came from. Data was obtained via the CensusAtSchool random sampling form(http://www.amstat.org/censusatschool/RandomSampleForm.cfm).

    To the left is a dot plot of the two groups.

    Note that there was an outlier in the New York

    group of 11,000 minutes, more time than in an

    entire day, which is obviously impossible. This

    outlier was removed. Additionally, 10 students

    from California neglected to answer this question,

    as did 3 from New York, leaving a total of 334 cases

    from California and 152 from New York.

    It is quite evident that the New York group

    (bottom) is more spread out and centered farther

    to the right than the California group (top). The

    average time required to commute to school for

    the California group is 20.6 minutes, and 40.2 for

    the New York group, with a standard deviation of

    15.2 and 22.9 respectively. With an IQR of 20 and

    22.5 respectively, though, it seems that, while the

    overall spread is large, the amount of time most

    students take to get to school within each group have approximately the same range of times of

    commute, only averaging at different times. However, the students who live far away from their schoolshave a longer commute in New York than they do in California.

    Case #24, an individual from California, had an average travel time of 28 minutes. With the

    mean and standard deviation for Californians

    being 20.6 minutes and 15.2 minutes

    respectively, this individuals z-score is (28-

    20.6)/15.2 = 0.487. This means that she is 0.487

    standard deviations (15.2 minutes) above the

    mean commute time of 20.6 minutes for all

    California resident cases in this study.

    A percentile plot of all Californians vs

    their commute time shows this individual at the

    71.1th percentile. This means that this individual

    has a commute time longer than 71.1% of all

    California student cases in this study.

  • 7/27/2019 Comparing CA

    2/2

    Larry Zhang

    AP Statistics, Mr. Thill

    Were a normal model used to approximate the California group, it would predict a percentile

    corresponding to a commute time of 28 minutes (z-score of 0.487) ofnormalcdf(-, 0.487) = 68.7%. This

    means that the normal model predicts that about 68.7% of students would have a commute time lower

    than that of the individual with a z-score of 0.487, or a commute time of 28 minutes.

    A normal model actually does approximate the majority of this group, to a certain degree. It

    doesnt exactly fit the group down to every case, but neither can it be thrown out as a possibility. We

    first look at the normal quantile plot.

    Looking at the entire group shows us

    immediately that a normal model is nowhere near

    appropriate. However, eliminating a few

    outliers on the edges of the group gives us the

    next graph.

    In this trimmed group, a normal model

    fits the data much better. Note that the large

    group at about 5 minutes is to be expected, as

    we generally divide shorter periods of time into 5

    and 15 minute pieces; students with a shortcommute time most likely psychologically

    gravitate towards answering with a simple 5

    minutes instead of trying to figure out the actual

    time. That being said, this is guesswork.

    However, it cannot be ignored that this normal

    quantile plot of the trimmed group exhibits a

    distinct normal-like behavior.

    Looking at summary statistics, we see that the mean of the California group is distinctly different

    from its median (20.6 as compared to 15). Again, though, this is most likely due to the large group of

    students with long commute times at the edge of the group. In a normal model, though, the mean isequal to the median; this isnt even close to being true in the general California group. However, looking

    at the percentiles calculated in part 2b and 3 (71.1% actual, vs 68.7 normal predicted), again we see that

    a normal model (almost) accurately describes part of the graph.

    Therefore, the California group is NOT described well by a normal model in general. However, if

    some data points at the higher end of the group are trimmed off, the group is now described

    moderately well by the normal model.