big data: data analysis boot camp visualizing the iris...

31
1/31 Introduction Histograms Scatter plots Box plots Outliers Hands-on Q&A Conclusion References Files Big Data: Data Analysis Boot Camp Visualizing the Iris Dataset Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD Chuck Cartledge, PhD 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017 23 September 2017

Upload: duonghanh

Post on 23-Jul-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 1/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Big Data: Data Analysis Boot CampVisualizing the Iris Dataset

    Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD

    23 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 201723 September 2017

  • 2/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Table of contents (1 of 1)

    1 Introduction

    2 Histograms

    3 Scatter plots

    4 Box plots

    5 Outliers

    6 Hands-on

    7 Q & A

    8 Conclusion

    9 References

    10 Files

  • 3/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    What are we going to cover?

    Were going to talk about:

    Visually explore the iris dataset.

    See how messy data can affectthe presentation.

  • 4/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Background

    What is a histogram?

    Consider a seriesof rectangles on equalbase c and whoseheights are respectivelythe successive terms ofthe binomial(p + q)n c , wherep + q = 1.

    K. Pearson [2]

    Image from [2].

  • 5/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Background

    In laymans terms:

    1 Take the range of data anddivide into equal range bins

    2 Count the number of piecesof data (frequency) in eachrange bin

    3 Plot the count vs. the rangebin

    Image from [2].

  • 6/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Looking at sepal widths

    data(iris)

    hist(iris$Sepal.Width)

  • 7/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Same image.

  • 8/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    An annotated look at sepal widths

    1 Compute the mean andstandard deviation

    2 Add a normal (a.k.a.,Gaussian distribution curve

    3 Add 3 vertical lines

    Attached file.

  • 9/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Same image.

    Attached file.

  • 10/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    What is a normal distribution?

    1 Ideas and base equationsattributed to Carl FriedrichGauss and Abraham deMoivre[1] (de Moivre ismore general than Gauss)

    2 Based on the idea of acentral value and avariation from that value 2

    3 Equation:

    P(x) = 1

    2e

    (x)2

    22

    4 The probability of x isdependent on and

  • 11/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Sigmas () are important

    Likelihood that a value exists based on a normal distribution is:Range Expected population

    within range0.5 38.291.0 68.26

    1.5 86.632.0 95.44

    2.5 98.753.0 99.73

    3.5 99.954.0 99.99

    https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule

    https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule

  • 12/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Choose any two values and see what they look like

    Part of the data explorationtoolset

    Used to visually identify, orverify correlation betweenattributes

    1 l i b r a r y ( g g p l o t 2 )2 q p l o t ( P e t a l . Width , S e p a l . Width ,

    data=i r i s , c o l o u r=S p e c i e s ,s i z e=I ( 4 ) )

  • 13/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Sometimes you dont know which attribute to choose

    2D plots are easy tounderstand

    3D plots are harder tounderstand

    > 3D requires specialtraining

    How to choose whichattributes are interesting?

    pairs(~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,

    data=iris)

  • 14/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Sometimes there are new insights

    library(ggplot2); qplot(Petal.Width,Petal.Length, data=iris,colour=Species, size=I(4))

  • 15/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Background

    Some terminology

    Quartiles of a ranked set of data values, dividethe data set into four equal groups, each groupcomprising a quarter of the data

    Q1: splits off the lowest 25% of data from thehighest 75%

    Q2: cuts the dataset in half (median)

    Q3: splits off the highest 25% of data from thelowest 75%

    IQR: Interquartile range = Q3 Q1Lower fence = Q1 1.5 IQRUpper fence = Q3 + 1.5 IQR

    Image from [4].

  • 16/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Box plot visual

    Image from [4].

  • 17/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Looking at versicolor petal widths

    par(mfrow=c(2,1))

    data

  • 18/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Same image.

  • 19/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Looking at setosa petal widths

    Change one line, and replot.data

  • 20/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Iris data

    Same image.

  • 21/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    iris data

    What are outliers?

    A definition:

    Outliers are observations that do not follow thepattern of the majority of the data. Outliers in amultivariate point cloud can be hard to detect, especiallywhen the dimension p exceeds 2, because then we cannot longer rely on visual perception.

    Rousseeuw and Van Zomeren [3]

    The difficulty is defining a pattern in the data, and then definingwhat it means to not follow the pattern.

  • 22/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    iris data

    Bringing things together.

    Ideas that come together into simple visualizations:

    1 Density plot (curved lines on the previous histograms)

    2 Density plots that may not be normal or Gaussian inshape

    3 Box plots showing where the bulk of the data points are

    4 Outliers are points that dont fit a pattern

    1 l i b r a r y ( g g p l o t 2 )2 g g p l o t ( i r i s , a e s ( x=S p e c i e s , y=S e p a l . Width ) ) + geom v i o l i n ( t r i m=FALSE) + geom

    b o x p l o t ( width =0.1)

    Change the y value.

  • 23/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    iris data

    Violin plot of iris Sepal Width

  • 24/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    iris data

    Violin plot of iris Sepal Length

  • 25/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    iris data

    Violin plot of iris Petal Width

  • 26/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    iris data

    Violin plot of iris Petal Length

  • 27/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Some simple exercises to get familiar with datavisualization

    1 Does a histogram of irispetal length support anormal distribution(qualitative vicequantitative)

    2 What does the scatterploton page 12 say about usingwidths as a classificationcriteria?

    3 Which combination ofsepal and petal, lengthsand widths is best?

    4 What is the purpose of therug function on page 17?

    5 How do the other irisspecies petal lengthscompare to Versicolor onpage 17?

    6 Create a geom violin plotof the built in mtcarsdataset that show therelationship betweennumber of gears and mpg

  • 28/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Q & A time.

    Q: Whats Dr. Presumes fullname?A: Dr. Livingston I. Presume.

  • 29/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    What have we covered?

    Basic ways to visualize data1 Histograms2 Scatter plots3 Box plots

    Outliers and how they can affectdataLooked at iris data using basicplotting functions and a little ofthe ggplot library

    Next: LPAR Chapter 3, data visualization with Lattice

  • 30/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    References (1 of 1)

    [1] Abraham De Moivre, The doctrine of chances: or, a method ofcalculating the probabilities of events in play, vol. 1, ChelseaPublishing Company, 1756.

    [2] Karl Pearson,Contributions to the Mathematical Theory of Evolution,Philosophical Transactions of the Royal Society of London. A185 (1894), 71110.

    [3] Peter J Rousseeuw and Bert C Van Zomeren,Unmasking Multivariate Outliers and Leverage Points, Journalof the American Statistical Association 85 (1990), no. 411,633639.

    [4] Wiki Staff, Quartile,https://en.wikipedia.org/wiki/Quartile, 2017.

    https://en.wikipedia.org/wiki/Quartile

  • 31/31

    Introduction Histograms Scatter plots Box plots Outliers Hands-on Q & A Conclusion References Files

    Files of interest

    1 Create annotated iris

    histogram

    2 Calculus to derive equationfor normal

    distribution

    3 YouTube video aboutderiving the normaldistribution: https://www.youtube.com/

    watch?v=ebewBjZmZTw

    4 R library script file

    rm(list=ls())

    main

  • The Normal Distribution: A derivation from basic principles

    Dan TeagueThe North Carolina School of Science and Mathematics

    Introduction

    Students in elementary calculus, statistics, and finite mathematics classes oftenlearn about the normal curve and how to determine probabilities of events using a table forthe standard normal probability density function. The calculus students can work directly

    with the normal probability density function p x ex

    b g= FHG IKJ1

    2

    12

    2

    and use numerical

    integration techniques to compute probabilities without resorting to the tables. In thisarticle, we will give a derivation of the normal probability density function suitable forstudents in calculus. The broad applicability of the normal distribution can be seen fromthe very mild assumptions made in the derivation.

    Basic Assumptions

    Consider throwing a dart at the origin of the Cartesian plane. You are aiming atthe origin, but random errors in your throw will produce varying results. We assume that:

    the errors do not depend on the orientation of the coordinate system. errors in perpendicular directions are independent. This means that being too high

    doesn't alter the probability of being off to the right. large errors are less likely than small errors.

    In Figure 1, below, we can argue that, according to these assumptions, your throw is morelikely to land in region A than either B or C, since region A is closer to the origin.Similarly, region B is more likely that region C. Further, you are more likely to land inregion F than either D or E, since F has the larger area and the distances from the originare approximately the same.

    Figure 1

  • 2

    Determining the Shape of the Distribution

    Consider the probability of the dart falling in the vertical strip from x to x x+ .Let this probability be denoted p x x( ) . Similarly, let the probability of the dart landing inthe horizontal strip from y to y y+ be p y y( ) . We are interested in the characteristics ofthe function p. From our assumptions, we know that function p is not constant. In fact,the function p is the normal probability density function.

    Figure 2

    From the independence assumption, the probability of falling in the shaded regionis p x x p y y( ) ( ) . Since we assumed that the orientation doesn't matter, that any regionr units from the origin with area x y has the same probability, we can say that

    p x x p y y g r x y( ) ( ) ( ) = .This means that

    g r p x p y( ) ( ) ( )= .

    Differentiating both sides of this equation with respect to , we have

    0 = +p x dp yd

    p ydp x

    d( )

    ( )( )

    ( )

    ,

    since g is independent of orientation, and therefore, .

    Using x r= cos b g and y r= sin b g, we can rewrite the derivatives above as0 = + p x p y r p y p x r( ) ( ) cos ( ) ( ) sin b gc h b gc h.

    Rewriting again, we have 0 = p x p y x p y p x y( ) ( ) ( ) ( ) . This differential equation canbe solved by separating variables,

    = p xx p x

    p yy p y

    ( )( )

    ( )( )

    .

  • 3

    This differential equation is true for any x and y, and x and y are independent. That canonly happen if the ratio defined by the differential equation is a constant, that is, if

    = =p xx p x

    p yy p y

    C( )( )

    ( )( )

    .

    Solving =p xx p x

    C( )( )

    , we find that =p xp x

    Cx( )( )

    and ln ( )p xCx

    cb g= +22

    and finally,

    p x AeC

    x( ) = 2

    2

    .

    Since we assumed that large errors are less likely than small errors, we know that C mustbe negative. We can rewrite our probability function as

    p x Aek

    x( ) =

    2

    2

    ,with k positive.

    This argument has given us the basic form of the normal distribution. This is the

    classic bell curve with maximum value at x = 0 and points of inflection at xk

    = 1 . We

    now need to determine the appropriate values of A and k.

    Determining the Coefficient A

    For p to be a probability distribution, the total area under the curve must be 1. Weneed to adjust A to insure that the area requirement is satisfied. The integral to beevaluated is

    Ae dxk x

    z 22 .If Ae dx

    k x

    z =22 1, then e dx Ak x z =22 1 . Due to the symmetry of the function, this areais

    twice that of e dxk xz 220 , so

    e dxA

    k xz =220 12 .Then,

    e dx e dyA

    k x k y z zFHG IKJFHG IKJ=2 220 20 214 ,

  • 4

    since x and y are just dummy variables. Recall that x and y are also independent, so wecan rewrite this product as a double integral

    e dy dxA

    k x y + zz =200 22 2 14e j .(Rewriting the product of the two integrals as the double integral of the product of theintegrands is a step that needs more justification than we give here, although the result iseasily believed. It is straightforward to show that

    f x dx g y dy f x g y dy dxM M MM

    ( ) ( ) ( ) ( )0 0 00z z zzFHG IKJFHG IKJ=

    for finite limits of integration, but the infinite limits create a significant challenge that willnot be taken up.)

    The double integral can be evaluated using polar coordinates.

    e dx dy e r dr dk x y k r + zz zz=200 200 22 2 2e j / .

    To evaluate the polar form requires a u-substitution in an improper integral. Performingthe integration with respect to r, we have

    e r dr dk

    e du ddk k

    k r u zz z z z= LNM OQP = =200 2 0 2 0 0 22 1 2 / / / .

    Now we know that 1

    4 22A k= , and so A k=

    2. The probability distribution is

    p xk

    ek x

    ( ) =

    22

    2

    .

    Determining the Value of k

    A question often asked about probability distributions is "what are the mean andvariance of the distribution?" Perhaps the value of k has something to do with the answer

    to these questions. The mean, , is defined to be the value of the integral x p x dx( )

    z .The variance, 2 , is the value of the integral x p x dx

    z b g2 ( ) . Since the functionx p x( ) is an odd function, we know the mean is zero. The value of the variance needsfurther computation.

  • 5

    To evaluate x p x dx2 2( )

    z = , we proceed as before, integrating on only thepositive x-axis and doubling the value. Substituting what we know of p x( ), we have

    22

    2

    0

    2 22k

    x e dxk x

    z = .The integral on the left is evaluated by parts with u x= and dv xe

    k x=

    2

    2

    to generate theexpression

    22 0

    2

    0

    22

    2 1kM

    k xM k xx

    ke

    ke dx

    lim

    +LNMM

    OQPP

    z .Simplifying, we know that lim

    M

    k xM

    xk

    e

    =20

    2 0 and we know that 1 1 22

    22

    0ke dx

    k k

    k xz =

    from our work before. So 22

    2

    0

    22

    22

    1 22

    1kx e dx

    kk k k

    k x

    z = = so that k = 12 .

    The Normal Probability Density Function

    Now we have the normal probability distribution derived from our 3 basicassumptions:

    p x ex

    b g= FHG IKJ1

    2

    12

    2

    .

    The general equation for the normal distribution with mean and standard deviation iscreated by a simple horizontal shift of this basic distribution,

    p x ex

    b g= FHG IKJ1

    2

    12

    2

    .

    References:

    Grossman, Stanley, I., Multivariable Calculus, Linear Algebra, and DifferentialEquations, 2nd., Academic Press, 1986.

    Hamming, Richard, W. The Art of Probability for Engineers and Scientists,Addison-Wesley, 1991.

    .ls.objects