exploring and visualizing data: techniques for a clearer presentation ...€¦ · exploring and...
TRANSCRIPT
Exploring and Visualizing Data: Techniques for a clearer presentation of data
Brian VegetabileUCI Statistics PhD StudentNovember 17th, 2015
1
Outline
• A Case for Data Exploration & Visualization
• Exploring & Visualizing a Single Variable
• Comparing Distributions of Data
• The Iteration Process of Creating a Graphic
• Data & Image Sources
2
A Case for Considering Data Exploration and Visualization
3
A Case for Considering Data Visualization
• Graphics can be useful to aid the presentation of technical data in the sciences
• Sometimes though they are created without thought to the perception of the reader
• A misuse of graphics can often times lead to vital information in the data being missed by both an analyst, as well as a potential reader
• Also as a reader, it is your responsibility to be able to look for inconsistencies between technical graphics and conclusions within text
4
Example Graphics: Two Ways of Looking at Sunspots (1)
1700 1750 1800 1850 1900 1950 2000
050
100
150
200
250
Yearly Sunspot Totals
Year
Suns
pot N
umbe
rs
• Most standard graphics packages create plots that are squares
• ‘Squishes’ the information in the plot leaving information lost to the reader
• Fails to communicate a key piece of information to the reader
5
Example Graphics: Two Ways of Looking at Sunspots (1)
• Transforming the aspect ratio of the graphics width compared to its height reveals information hidden in the previous sunspot graphic
• Sometimes called “Banking”
• Observe the steep rise in sunspot numbers and the gradual decline following a maximum.
• A consideration to how graphics are displayed can be instrumental in communicating the maximum amount of information to a reader
1700 1750 1800 1850 1900 1950 2000
010
025
0
Yearly Sunspot Totals
Year
Suns
pot N
umbe
rs
6
Example Graphics: Perception of the difference in Curves
• Another example of how information can be lost in the graphing process is the difference between curves
• The distance between the curves on the right appears to greatly decrease as we increase in the independent variable
0 1 2 3 4 5
020
4060
8010
012
0
Inependent Variable
Res
pons
e Va
riabl
e
7
Example Graphics: Perception of the difference in Curves
• Once we add another graphic that captures the differences between the curves, we see that the difference is almost constant!
• Considering all possible presentations of your data is crucial for not only your understanding of the data, but your readers
0 1 2 3 4 5
020
4060
8010
012
0Inependent Variable
Res
pons
e Va
riabl
es
0 1 2 3 4 5
1315
17
Inependent VariableDiff
eren
ce in
Res
pons
e Va
riabl
es
8
Example Graphics: Space Shuttle Challenger Analysis (2)
• January 27, 1986, the night before the space shuttle Challenger accident
• Three-hour teleconference among people at Morton Thiokol, Marshall Space Flight Center and Kennedy Space Center.
• The discussion focused on the forecast of a 31°F temperature for launch time the next morning, and the effect of low temperature on O-ring performance.
50 60 70 800.
01.
02.
03.
0
Space Shuttle Incidents vs. TemperaturePrior to Challenger
Calculated Joint Temperature (F)
Num
ber o
f Inc
iden
ts
9
Example Graphics: Space Shuttle Challenger Analysis (2)
• The engineers had only presented the failures and not the successes
• Based on the U configuration of points, it was concluded that there was no evidence from the historical data about a temperature effect.
50 60 70 800.
01.
02.
03.
0
Space Shuttle Incidents vs. TemperaturePrior to Challenger
Calculated Joint Temperature (F)
Num
ber o
f Inc
iden
ts
10
Example Graphics: Space Shuttle Challenger Analysis (2)
• Adding the successes to the graphic we observe a temperature dependence between incidents and joint temperature
• The Rogers Commission concluded that "A careful analysis of the flight history of O-ring performance would have revealed the correlation of O-ring damage in low temperature"
50 60 70 800.
01.
02.
03.
0
Space Shuttle Incidents vs. TemperaturePrior to Challenger
Calculated Joint Temperature (F)
Num
ber o
f Inc
iden
ts
11
Example graphic: Typical Graphic from Science
• Pick up any issue of Science Magazine and you’ll find graphics similar to the one on the right.
• “…Data are means ± SEM of seven to eight mice per genotype for (B) and six mice per genotype for (C). Statistical significance was analyzed by unpaired two-tailed t test. *P < 0.05”
• This graphic is confusing since it represents the data by a “bar chart”, but the data is not categorical.
12
Exploring and Describing the Distribution of a Single Continuous Variable - Variables of One Dimension
13
Visualization of a Single Continuous Variable
• Visualizing a single variable is helpful in understanding the distribution of the data.
• Reveals insights beyond summary tables.
• See mean, median, mode, quantiles, etc.
• Many statistical tests assume certain distributions for the process that generated the data
• Students t-Test
• Presented are techniques for assessing the distribution of a variable to aid in its summary
• Note: 100 points were simulated randomly to highlight these cases
14
Dynamite Plots for a Single Variable
• Dynamite plots are rampant throughout the sciences.
• Plotted is a dynamite plot of the simulated data
• Shows the mean as a measure of central tendency and an error bar that is standard deviation past mean.
• These plots obscure major information that is hiding within the data!
Dynamite Plot forDistribution of Data
Value
0.0
0.5
1.0
1.5
15
Box & Whisker Plots
• To the right is a Box & Whisker plot of 100 simulated data points.
• Introduced by John Tukey in his toolkit of exploratory data analysis
• Useful for beginning to understand the data, or to supplement another plot (dot plot or histogram)
• Some packages will also highlight any outliers
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
Distribution of a Variable
x−value
16
Dot plots
• Each data point is plotted along a line.
• Spread and distribution of points are now more obvious.
• Plotted with a measure of central tendency.
●● ●●● ●● ●●● ●● ●● ●●● ●●●● ● ●●●● ● ●● ● ●●● ●● ●● ●●●● ● ●● ●●●● ●● ●● ●●●●●●● ● ●● ●● ●●●● ● ● ●●●● ● ●●● ●●● ●● ●● ● ●● ●● ● ●●●● ●● ●● ●
Distribution of a Variable
x−value
−1.0 −0.5 0.0 0.5 1.0 1.5 2.017
Dot plots
• Key Concept: Central Tendency
• A central tendency is a central or typical value for a probability distribution.
• Included in the graphic is a ‘red’ line that shows the median
• The median is a more stable measure of central tendency than the mean and is less likely to be influenced by skew within the distribution of data.
●● ●●● ●● ●●● ●● ●● ●●● ●●●● ● ●●●● ● ●● ● ●●● ●● ●● ●●●● ● ●● ●●●● ●● ●● ●●●●●●● ● ●● ●● ●●●● ● ● ●●●● ● ●●● ●●● ●● ●● ● ●● ●● ● ●●●● ●● ●● ●
Distribution of a Variable
x−value
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
18
Dot plots - Adjusting the Alpha Level
• Adjusting the alpha level amounts to changing how transparent each data point is
• Adds a level of “depth” to the graphic
• The plot below has an alpha level set to 0.5
• Darker areas have more points than lighter areas
Distribution of a Variable
x−value
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
19
Dot plots - Adding Jitter
• Adding jitter amounts to adding random noise to where each data point lies on its line
• Combined with adjusting the alpha level we have a better idea of the distribution of our data points
Distribution of a Variable
x−value
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
20
Histograms
• Histograms reveal even more information than the previous two!
• Simulated data was actually multi-modal
• Note: When using histograms it’s also necessary to consider bin width
Distribution of a Variable
x−value
Frequency
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
05
1015
21
Histograms - Comparing bin widthsBin width +− 0.05
x−value
Frequency
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
01
23
45
6
Bin width +− 0.1
x−value
Frequency
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
02
46
810
Bin width +− 0.2
x−value
Frequency
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
05
1015
Bin width +− 0.5
x−value
Frequency
−1 0 1 2
05
1015
2025
Bin width +− 1
x−value
Frequency
−2 −1 0 1 2 3
010
2030
40
Bin width +− 5
x−value
Frequency
−6 −4 −2 0 2 4
020
4060
80100
22
Combining plots
• Combining plots sometimes tells a clearer picture
• Shows modality, total number of points and relative five number summary
Utilizing Three Plots
Frequency
−1.0 −0.5 0.0 0.5 1.0 1.5 2.0
05
1015
23
Quantile Plots - Normal QQ-Plot
• Quantile-Quantile Plots are both simple and powerful
• Many statistical tests require that the data being tested were generated by a Normal Distribution.
• Normal QQ-Plots offer a way to visualize the quantiles of a sample to the theoretical quantiles of a normal distribution
−2 −1 0 1 2−2
−10
12
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
24
Quantile Plots - QQ-Plot
• What does the sampled data look like compared with a normal distribution?
• As expected, the multi-modal data does not compare well against the normal distribution.
• This is another plot to understand the distributional characteristics of the observed data
−2 −1 0 1 2−1
.00.
00.
51.
01.
52.
0
Normal Q−Q Plot
Theoretical Quantiles
Sam
ple
Qua
ntile
s
25
Logarithmic Transformation of a Distribution
• Again many tests assume that data is Normally distributed as an assumption of the test
• Many types of data though aren’t naturally normal on their original scale.
• It’s sometimes necessary to transform the data to a new scale that preserves the order of the data, but where it is now normally distributed
• Data such as salaries and non-negative data often can be natural datasets to transform
26
141 Major North American River LengthsObtained by USGS
River Length
Freq
uenc
y
0 1000 2000 3000 4000
010
3050
log(River Length)
Freq
uenc
y
3 4 5 6 7 8 90
510
2030
Visualizing the Distribution of a Single Continuous Variable - Variables with More Dimensions
27
Scatterplots
• Scatter plots are essentially an analog to dot plots in multiple dimensions
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−3−2
−10
12
3
Dimension 1
Dim
ensi
on 2
● ●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●
●
● ●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
● ●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●● ●
●
●●
●
● ● ●●
●●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●●
● ●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
28
Scatterplots
• Similar to dot plots, adjusting alpha reveals a ‘depth’ of points
−4 −2 0 2 4
−3−2
−10
12
3
Alpha: 0.25
Dimension 1
Dim
ensi
on 2
−4 −2 0 2 4
−3−2
−10
12
3
Alpha: 0.5
Dimension 1
Dim
ensi
on 2
−4 −2 0 2 4
−3−2
−10
12
3
Alpha: 0.75
Dimension 1
Dim
ensi
on 2
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
−4 −2 0 2 4
−3−2
−10
12
3Alpha: 1
Dimension 1
Dim
ensi
on 2
● ●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
● ●
●
● ●●●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●● ● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
29
Scatterplots with Histograms
• These can be combined with additional plots to make the picture more clear
Freq
uenc
y
−6 −4 −2 0 2 4 6
020
4060
8010
012
0
−6 −4 −2 0 2 4 6
−3−2
−10
12
3
Dimension 1
Dim
ensi
on 2
Frequency
0 50 100 150 200 250 300
−3−2
−10
12
3
30
More Dimensions —> Pairs Plots
• As dimensions of a variable get larger, combining scatter plots and histograms in pair plots can have a great effect
Variable 1
Freq
uenc
y
−2 0 2 4 6
020
4060
8010
0
−5 0 5
−20
24
6
Variable 2
Varia
ble
1
0 1 2 3 4 5 6 7
−20
24
6
Variable 3
Varia
ble
1
0 2 4 6
−20
24
6
Variable 4
Varia
ble
1
Variable 2
Freq
uenc
y
−10 −5 0 5
020
4060
8010
0
0 1 2 3 4 5 6 7
−50
5
Variable 3
Varia
ble
2
0 2 4 6
−50
5
Variable 4
Varia
ble
2
Variable 3
Freq
uenc
y
0 2 4 6
050
100
150
0 2 4 6
01
23
45
67
Variable 4
Varia
ble
3
Variable 4
Freq
uenc
y
0 2 4 6 8
050
100
150
31
Visualizing Categorical Variables
32
Categorical Data
• Categorical Data is often represented as a table of quantities.
• MLB National League East Rankings as of July 26th, 2015
Team Wins Losses Percentages
Washington Nationals 52 45 0.5360825
New York Mets 51 48 0.5151515
Atlanta Braves 46 52 0.4693878
Miami Marlins 41 58 0.4141414
Philadelphia Phillies 37 63 0.370000033
Categorical Data - Pie Charts
• Many people interested in data visualization will tell you to never to use pie charts…
• Often used to show “Percent of the Whole”
• … but relative scale between variables is often lost
Washington NationalsNew York Mets
Atlanta Braves
Miami Marlins
Philadelphia Phillies
34
Categorical Data - Bar Charts
• One method to remedy this is to observe the data as a bar chart.
• Relative win percentage is now more clear.
• Nationals doing much better than the Phillies.
Was
hing
ton
Nat
iona
ls
New
Yor
k M
ets
Atla
nta
Brav
es
Mia
mi M
arlin
s
Phila
delp
hia
Philli
es
NL East Win Percentangeas of July 26th, 2015
Win
Per
cent
age
0.0
0.2
0.4
0.6
0.8
1.0
35
Categorical Variables - Dot and Line Charts
• Changing to a ‘dot a line plot’ yields more information. We see the relative amounts of wins compared with losses across the league.
NL East Standings as of July 26, 2015
●●
●●
●●
●●
●●
−65 −55 −45 −35 −25 −15 −5 5 15 25 35 45 55
Losses Wins
Philadelphia Phillies
Miami Marlins
Atlanta Braves
New York Mets
Washington Nationals
36
Comparing Distributions
37
Comparing Distributions
• Often we are interested in comparing more than one distribution.
• Simulated are 1000 draws from 3 separate beta distributions
Distribution 1
X
Density
0.00 0.05 0.10 0.15 0.20 0.25 0.30
02
46
810
12
Distribution 2
X
Density
0.1 0.2 0.3 0.4
02
46
Distribution 3
X
Density
0.60 0.62 0.64 0.66 0.68 0.70 0.720
510
1520
25
38
Comparing Distributions - Common Scale
• Adjusting to a common scale for each distribution allows us to see relative spreads, relative centers, etc.
Distribution 1
X
Density
0.0 0.2 0.4 0.6 0.8 1.0
04
812
Distribution 2
X
Density
0.0 0.2 0.4 0.6 0.8 1.0
02
46
8
Distribution 3
X
Density
0.0 0.2 0.4 0.6 0.8 1.0
05
15
39
Comparing Distributions - Common Plot
• Finally moving to a common plot we see how the densities compare with each other on two common scales
All 3 Distributions
X
Density
0.0 0.2 0.4 0.6 0.8 1.0
05
1015
20
40
Comparing Distributions
• This can be even more dramatic in more dimensions
9 10 11 12 13 14 15
1213
1415
1617
18
Distribution 1
X1
Y 1
0 2 4 6 8 10
510
1520
25
Distribution 2
X2
Y 2
41
Comparing Distributions - Common Scale
• Common scales allow us to see the relative sizes of the distributions
0 5 10 15
510
1520
25
Distribution 1
X1
Y 1
0 5 10 15
510
1520
25
Distribution 2
X2
Y 2
42
Comparing Distributions - Common Plot
• And with a common plot we can see the relative distance between each center and assess overlap
0 5 10 15
510
1520
25
Distribution 1 vs. Distribution 2
X
Y
43
The Iterative Process of Creating a Graphic
44
Exploring data: Stepping through the Process
• Data simulated as an illustration using the following study:
• “Maternal exposure to childhood trauma is associated during pregnancy with placental-fetal stress physiology, Biological Psychiatry (to apprear)”[3]
• Goal: Examine the hypothesis that intergenerational transmission may begin during intrauterine life via the effect of maternal childhood trauma exposure on placental-fetal stress physiology, specifically placental corticotrophin-releasing hormone (pCRH).
• Interested in examining the effects of childhood trauma exposure on placental corticotrophin-releasing hormone production over gestational age.
• This simulated data will help demonstrate the iterative design process of a graphic
45
Describing the data
• The simulated data is of “sociodemographically-diverse cohort of 88 pregnant women.”
• Placental CRH concentrations were quantified in maternal blood collected serially over the course of gestation.
46
What does the data look like?
• What does the relationship between pCRH and gestational age look like prior to taking into considering treatment effects or individual effects?
• We are interested in understanding the general effect of pCRH across gestational age.
15 20 25 30 35 400
400
800
1200
Relationship between Gestational Age and pCRH
Gestational Age
pCR
H
47
Transforming the Response
• From the last plot we notice an exponential relationship
• It’s often of interest to see if this relationship is linear on a logarithmic scale in order to perform linear regression
• We’ve plotted a transformed log(pCRH) to the right
• Notice that there is a clear linear relationship on this scale
15 20 25 30 35 403
45
67
Relationship between Gestational Age and log(pCRH)
Gestational Age
log(
pCR
H)
48
Is there a difference between the groups?
• We can now begin to explore differences between in the production of pCRH across gestational age in those who had experienced childhood trauma and those that did not.
• …. it doesn’t look like there’s much of a difference.
• Let’s investigate the possibility that the variability in slopes is different between the two groups?
15 20 25 30 35 40
23
45
67
8
Experienced Childhood Trauma
Gestational Age
log(
pCR
H)
15 20 25 30 35 40
23
45
67
8
Did Not Experience Childhood Trauma
Gestational Agelo
g(pC
RH
)
49
Do the individual trajectories vary between the groups?
• Adding lines between the points for the individual trajectories allows us to see if there is variability between the two groups
• ….again, it doesn’t look like there’s much of a difference.
• It appears that we’ve got 5 different collection phases across gestational age. What if we bin these together and investigate that way?
15 20 25 30 35 40
23
45
67
8
Experienced Childhood Trauma
Gestational Age
log(
pCR
H)
15 20 25 30 35 40
23
45
67
8
Did Not Experience Childhood Trauma
Gestational Agelo
g(pC
RH
)
50
Grouping by Week Clusters?
• We’ve created ‘groups’ by their week clusters
• Now we can look at the distribution of points within each cluster.
• It’s hard to tell if there is a difference between these two plots with them plotted this way
• Let’s add them back to the same plot for a side by side comparison!
●●
●●
34
56
7
Experienced Childhood Trauma
Gestational Age Grouped Every Five Weeks
log(
pCR
H)
<20 20−25 25−30
●
●
34
56
7
Did Not Experience Childhood Trauma
Gestational Age Grouped Every Five Weekslo
g(pC
RH
)<20 20−25 25−30
51
Side by Side Distributions
• Comparing the Distribution at each ‘week’ tells us a lot more information
• We now see that the median pCRH for those who experienced childhood trauma is lower than those who did not experience trauma across gestational age
• We also see that the differences between the medians gets smaller across gestational age, suggesting an interaction between gestational age and pCRH.
• Now let’s tell the whole story!
●
●●
●●
●
34
56
7
Comparison of log(pCRH) Across Trauma
Gestational Age Grouped Every Five Weeks
log(
pCR
H)
<20 20−25 25−30 30−35 35−40
Did Not Experience TraumaExperienced Trauma
52
Telling the Whole Story: A Completed Graphic
• We can now take the graphics that we’ve created through the exploratory phase and construct a combined graphic to tell the whole story
• The two left most graphics highlight the individual trajectories, while the last graphic captures the temporal change in the relationship
●●
●
●
●
● ●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
●
15 20 25 30 35 40
23
45
67
8
Experienced Childhood Trauma
Gestational Age
log(
pCR
H)
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
15 20 25 30 35 40
23
45
67
8
Did Not Experience Childhood Trauma
Gestational Age
log(
pCR
H)
●
●●
●●
●
34
56
7
Comparison of log(pCRH) Across Trauma
Gestational Age Grouped Every Five Weeks
log(
pCR
H)
<20 20−25 25−30 30−35 35−40
Did Not Experience TraumaExperienced Trauma
53
Outlining the General Strategy for the Creation of Graphics
• It’s necessary to explore your data to fully understand how it’s behaving
• The goal is to pack a large amount of quantitative information into a small region.
• Consider how a reader would perceive the graphic that you’ve presented.
• Combine graphics when needed to tell the entire story.
• Carefully study the domain area and understand when it is necessary to further investigate the data
• Graphing data should be an iterative, experimental process
54
Further Investigation
• Multidimensional Visualization techniques
• Visualizing Categorical Variables
• Visualization Techniques for combining Categorical and Continuous Variables
• Loess Smoothing for Scatter Plots
• Techniques for Time Series Data
• Techniques for Spatial Data55
Texts/References
• Texts
• The Elements of Graphing Data - William S. Cleveland
• Visualizing Data - William S. Cleveland
• The Visual Display of Quantitative Information - Edward Tufte
• Articles
• Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Models - William Cleveland and Robert McGill
• Let’s Practice What We Preach: Turning Tables into Graphs - Andrew Gelman, Cristian Pasarica, and Rahul Dodhia
56
References
1. Cleveland, William S. The Elements of Graphing Data. Murray Hill, NJ: AT & T Bell Laboratories, 1994. Print.
2. Siddhartha R. Dalal , Edward B. Fowlkes & Bruce Hoadley (1989) Risk Analysis of the Space Shuttle: Pre-Challenger Prediction of Failure, Journal of the American Statistical Association, 84:408, 945-957, DOI: 10.1080/01621459.1989.10478858
3. Moog, N.K,, Buss, C., Entringer, S., Shahbaba, V., Gillen, D., Hobel, C.J., and Wadhwa, P.D. (2015), Maternal exposure to childhood trauma is associated during pregnancy with placental-fetal stress physiology, Biological Psychiatry (to apprear).
57
Data & Image Sources
• Image - Flight Patterns - http://users.design.ucla.edu/~akoblin/work/faa/
• Data - Sunspots - WDC-SILSO, Royal Observatory of Belgium, Brussels (http://www.sidc.be/silso/datafiles)
58