data visualization short course - test science · 2019-01-22 · data visualization short course 3...
TRANSCRIPT
2
3
MCOTEA Example
4
▪ Air Force Magazine Feb 2017 trends for women as a
percent of the force
Air Force Example
http://www.airforcemag.com/MagazineArchive/Magazine%20Documents/2017/February%202017/0217infographic.pdf
5
Callan Chart of Sector Performance (Quilt Chart)
https://www.callan.com/wp-content/uploads/2017/01/Callan-PeriodicTbl_KeyInd_2017.pdf
6
▪ Stephen Few is a guru in the data visualization world
▪ Let’s take his quiz on best practices at
www.perceptualedge.com
▪ Goal is to get every one wrong—0/10 is success!
One Last Warm-Up
7
▪ Appreciate the historical perspective of data visualization
▪ Know the value of data visualization offers to analytics
and Big Data
▪ Understand what makes a good graphical display and
some of the common mistakes to avoid in graphical
design
▪ Be familiar with some methodologies for the data
visualization process
▪ Appreciate how to do data viz with a few common
software packages
Objectives
8
Data Visualization is Not New
Scottish political economist William Playfair in 1786
recognized superiority of graphs over tabular presentations—
published 43 time series plots and one bar chart
Developed the first pie chart in 1801 to show distribution of
Turkish Empire over Europe, Africa, and Asia
Stephen Few states we really didn’t progress much from these
original ideas until late 1970s with Princeton’s John Tukey and
his Exploratory Data Analysis (EDA)
He argues most are unaware of modern methods
9
Data Visualization is Not New
▪ Area chart using color was masterful
▪ Playfair credited with the introduction of bar charts
10
Data Visualization is Not New
11
John Tukey, Princeton, 1977
Too much emphasis on hypothesis tests as confirmatory
analysis—focus should also be on discovery
Objectives
– Suggest hypotheses of observed data
– Assess statistical test assumptions
– Support selection of appropriate methods and tools
– Serve as basis for further data collections and
experiments
If we need a short suggestion of EDA, I would suggest
that
– It is an attitude; a flexibility; and requires graph
paper and transparencies
Exploratory Data Analysis
The greatest value of a picture is when it forces us to notice what we never
expected to see…John Tukey
12
Most important aspect of data visualization is the data itself
Value goes beyond the enterprise/transactional data itself
– Unstructured data, social networks, Internet of Things
Data Visualization Fuel
If we have data, let’s look at the data. If all we have are opinions, let’s go with mine.
Jim Barksdale, Netscape
Data quality is key and dataviz
can help improve that!
Phil Simon rates organizations
on visualization framework
– Data (big or small)
– Visualization (static or
interactive)
Start small and scale
13
Data is Growing
• Big Data is overused term, but we know there is GOLD in those data mountains
• 15 Tb of Twitter daily is a lot of data generated; how much gold do we have? We are exposed to more information in a day than someone from the 15th century was over a lifetime. 90% of today’s data was created in last 2 years (IBM); 2.5 quintillion bytes per day
Graphic from IBM Research India, presented at Text Mining Workshop Jan 2014
In 2015 the number of networked devices
doubles the entire global population
Of interest: Tera, Peta,
Exa, Zetta, Yotta,
Brontabytes
14
Data Visualization Needs Credible Data!
Do not trust any statistics you did not fake yourself…Churchill
Figures don't lie, but liars do figure…Twain
15
Traits of Meaningful Data
High Volume
Historical
Consistent
Multivariate
Atomic
Clean
Clear
Dimensionally Structured
Richly Segmented
Of Known Pedigree
15 Reference: Now You See It by Stephen Few
Data Map and Contour Plots
are “best practices”
16
Data is the new business capital.
Data visualization: discovery of solutions that offer highly interactive
and graphical user interfaces, are built on in-memory architectures,
and are geared toward addressing business users’ unmet ease-of-
use and rapid deployment needs. These solutions typically enable
users to explore data without much training, making them accessible
by a wider range of employees than traditional business analysis
tools. SAS
Key to making “analytics” approachable is visualization
– Visual thinking is essential skill for all
– Both an art and science => craft (Berinato, Harvard Business
Review)
Data is a great but messy story; visual analytics is the master
filmmaker to bring the story to life (SAS)
Not a great term…was Shakespeare a word sequencer?
Data Visualization Definition
A picture is worth a thousand data points
17
Characteristics (Card et al, Information Visualization)
– Computer supported
– Interactive
– Visual representations of location, length, size, color, shape to allow us to see trends
– Abstract data with no physical form (e.g. human body)
Amplify cognition by assisting memory by representing data in ways our brain can easily comprehend
3 facts: Pervasiveness has raised quality expectations, Big Data is here, and the Democratization of Data
90% of data analyses required by most organizations is possible with simple data visualization methods
– Excel is getting better
– Boss wants to know why graphs in meetings are not nearly as pretty as she sees on fitness tracker (Berinito)
Data Visualization
Everyone in our business knows they need to visualize data, but it’s easy to do
poorly. We invest in it. We want to use it right while they use it wrong. Daryl Morey
18
Consider recent data on automobile fuel economy from the
EPA for 2017 year vehicles
Attributes such as make, model, mpg, class, cylinders,
transmission, valve timing etc
Downloaded from
http://www.fueleconomy.gov/feg/download.shtml
Quick exploration with Excel Pivot Tables, Tableau, and JMP
Interactive Data Visualization with Excel
19
Allows viewing of vast quantities of data quickly and efficiently
Provides better insight into the business problem through
discovery
Generates a call to action
Performs better if interactive and not static for quick
stratification, drill down, and filtering
Relies less on the IT department and empowers workers once
they have access to the data with intuitive tools
Data Visualization
www.introtopolicyinformatics.wikispaces.asu.edu
20
Data visualization methods should allow employees who are not data analysts or scientists the ability to quickly and easily explore data
Domain and business expertise critical to data understanding
More rapidly find trends, generate hypotheses, identify inconsistencies, and determine additional data support requirements
Reduce IT and analyst staff burden—everyone should be numerate
Tension growing in non-data driven organizations
Need to shorten the “kill-chain” of time data is collected until presented as actionable solution to decision makers
– Find, Fix, Track, Target, Engage, Assess (F2T2EA)
Democratization of Data Viz
Goal: Self- Service Approachable Analytics
21
Flight misery map
Interactive Data Visualization For All
Source: Sviokla, Harvard Business Review
22
Police Department: Interactive Criminal Activity
http://www.raidsonline.com/?address= San%20Antonio%20TX
23
San Francisco Police Department with JMP Data is sample
file in jmp
Use Graph
Builder to plot
each crime by
color
Add street map
Add filter on
station
Create html with
data file
24
San Francisco PD with JMP A bit more
interactive is the
Distribution
platform
Where is there a
disproportionate
amount of drug
activity
What days of the
week correlate
with runaways?
What are some
safe precincts?
25
Data visualization is no longer just static charts created by IT
professionals for meetings
Even this graphic is outdated. Many are creating graphs
continuously
Democratization of Data Analytics
Source: TDWI Research, 2013
26
Huge advances in past 25 years in data collection, storage
and access; have ignored the primary tool to make information
meaningful—the human brain
We acquire more information from vision than from all other
senses combined
20 Billion neurons in brain used to form patterns from visual
information
The eye and visual cortex of brain form a massively parallel
processor that provides highest bandwidth channel into human
cognitive centers—Colin Ware, UNH
We seek patterns
The Human Side of Data Visualization
Strive for Interocular Traumatic Impact
27
Jacque Bertin’s Semiologie Graphique in 1967
describes basic vocabulary of vision of abstract data
– Pre-attentive attributes form the core of good data
visualization methods
– Pre-attentive means without prior conscious
awareness—the things that “pop out” most
We can only “remember” at most chunks of 3
visualizations and even then for only a short period
– So don’t make comparisons difficult-like on next
chart or scroll down further. Side-by-side is best.
The Human Side of Data Visualization
We have selective visual attention; we are drawn to
familiar patterns, and our working memory is limited
28
Pre-Attentive Attributes
Position
Length
Grouping
Shape
Size
Hue/Contrast
Color
Enclosure
Symmetry
29
Xan’s Pre-attentive Processing Quiz
30
Pre-attentive Processing
31
Graphic Attributes: Quantitative Scales
Better WorsePosition (unaligned)
Color Hue
Color Density
Area
Angle
Position Length Slope
Based on “Graphical Perception: Theory, Experimentation, and Application …” by William Cleveland and Robert McGill, JASA, Sept. 1984
32
The Human Side of Data Visualization Color is a key pre-attentive attribute
5% Females and 9% Males are color blind
– Red-Green is most common
There is a psychology to color
– Red is the color of extremes love, violence, danger, anger, and adventure
– Yellow captures our attention more than any other color happiness, and optimism, of enlightenment and creativity, sunshine and spring. Lurking in the background is the dark side of yellow: cowardice, betrayal, egoism, and madness. Furthermore, yellow is the color of caution and physical illness (jaundice, malaria, and pestilence). .
http://www.colormatters.com/yellow
33
Color Color choice may tell a very different
story.
– Measles rate of ID vs TN?
HSV
– Hue: wavelength red, yellow…
– Saturation: 1=color, 0=white
– Value: brightness, 1=bright,
0=black
– Contrast with RGB additive system
Beware of default color choices—not
often going to send correct message
– Rainbow schemes
– Intuition (red should be bad, green
good)
Consider your organization’s branding
guidelines
34
Color Psychology
Red is the color of assertive, bold, power, extremes love, violence,
danger, anger, stop, and adventure
Pink is soft, tranquil, passive, feminine, health, joy
Orange is warmth, compassion, enthusiasm, fun, energy
Green is nature, balance, environment, healthy, calm, rebirth
Blue is dignified, professional, successful, loyal, positive, authoritative,
but also melancholy
Yellow captures our attention most: happiness, optimism, creativity,
sunshine and spring; dark side of yellow: cowardice, betrayal, egoism,
caution, madness, and medical illness (jaundice, malaria,..)
Purple: royalty, luxury, wisdom, inspiration, spiritual
Brown: Natural, reliable, strong, rustic, conservative, ordinary
Black: classy, formal, authority, power, death, troubles, mourning
White: pure, innocent, clean, new, simple, bland
Gray: neutral, respect, humility, stable, wise,
35
Visualization Expectations
35
Test: http://www.youtube.com/watch?v=xAFfYLR_IRY
36
Which direction is the top middle wheel moving?
37
Are the blocks side by side or stacked?
38
Blue and Black or White and Gold?
39
Ebbinghaus Illusion
40
Visualization
40
41
Visualization—Spooky
41
42
Visualization
42
Jared Leto
Dallas Buyers Club, Fight Club,
Thirty Seconds to Mars and super-
stoked about data visualization
43
Visualization in Logos
43
44
We don’t go in order like reading—the top title may not be read
until well after the visual middle; we spend disproportionate
amounts of time in different features
We see first what stands out—peaks, valleys, intersections,
dominant colors, and outliers
We see only a few things at once—with more than 5-10 variables
or elements individual meaning begins to fade
Empirical Findings (Berinato)
We seek meaning and make
connections—we incessantly
construct narratives of the graph
consciously and subconsciously
We rely on conventions and
metaphors—red is bad, green
good, A-N-AF-M, time is on x
45
Business intelligence and predictive analytics may be viewed as
black boxes and not trustworthy; data visualization can add trust
and provide insight to these solutions
Ultimately it is all about making better business decisions
Value of Data Visualization
Source: TDWI Research, 2013
46
The two areas of data visualization:
– Explanation – tell a story to the audience
– Exploration – understand what the data is telling you
Will take into account audiences expectations and composition
Help you to detect relationships in data
Allows you to understand “Big Data”
▪ …of those who are most effective with Big Data, 98% use data visualization techniques
Value of Data Visualization
47
May be overhyped by media, but is here. 90% of data today
wasn’t here 2 years ago
– Transition from mainframe to client server to mobile cloud
– Extract-Transform-Load model is aging
Big Data really has not been solved by most organizations
– Resources dedicated to collecting, storing, organizing, and
cataloging;
– Exploiting Big Data through analytics and viz are behind
Web is more visual, efficient, and data-friendly (Phil Simon,
Visual Organization)
More on Big Data
48
Popular Choices in Data Visualization
Pie charts, line charts and
bar charts still have their
place, but are quickly being
replace by more informative
and dynamic tools
49
Tabled data in files and spreadsheets are precise and
summary statistics are helpful to understand structure up to a
point.
Graphs quickly convey meaningful relationships that tell a
story and point you in the right direction to solve your problem.
Interactive visualization takes the graphical capabilities a step
further for rapid discovery and hypothesis generation.
Why Graph Data?
The greatest value of a picture is when it forces us to notice what we never
expected to see…John Tukey
50
Understand your data
Size
– Cardinality => high = unique values (acct #); low = repeats (gender)
Determine what you are trying to visualize and information conveyed
Know your audience and how they process information
Use a visual that conveys the information best, simplest, and quickest
A “good” graphic is context sensitive (Berinato)
– Who will see it?
– What do they want? What do they need?
– What could I show? What should I show?
– How will I show it?
Basic Concepts for Data Visualization
51
Enforce visual comparison
– Conclusions can be drawn by comparing data
Show causality
– A graph without causality will have no meaning
Show multivariate data
– Display data using more than two dimensions
Integrate all visual elements
– Use words, numbers and images where appropriate
Content-driven design
– Quality, relevance and integrity
Tufte’s Principles
52
Principles of Good Graphical Design
Communicate the data with clarity, precision and efficiency
Encourage the eye to compare different pieces of data
focusing on substance; intriguing and curiosity provoking
Make large data sets coherent presenting many numbers in
a small space
Reveal the data in several layers of detail
Serve a clear purpose: description, exploration, tabulation,
or decoration
Are closely integrated with statistical and verbal
descriptions of a data set
Are simple, which is much better than unnecessary
complexity
52
Generate the greatest number of ideas in the shortest time
with the least ink in the smallest space
Reference: The Visual Display of Quantitative Information by Edward Tufte—known as the Strunk and White of
graphics
53
Graphical Pillars for Statistical Stories Simple
Informative and Important
Seamless
Emphasis
Clean
53 Reference: Now You See It by Stephen Few
Clear and concrete
Contextual
Sequential
Disclose Uncertainty and Truth
Actionable
Demonstration-R, JMP
Best graph
ever?
A quick sketch
is better than a
long speech
Napoleon
(perhaps)
54
Other Candidates for Best Graph Ever
https://commons.wikimedia.org/wiki/File:Nightingale-mortality.jpg
55
Other Candidates for Best Graph Ever
56
Other Candidates for Best Graph Ever
57
Other Candidates for Best Graph Ever
https://flowingdata.com/2015/04/02/how-we-spend-our-money-a-breakdown/
58
Other Candidates for Best Graph Ever
http://science.sciencemag.org/content/345/6196/558
Birth and Death of 150,000 Notable People Over Last 2000 Years
59
Data Visualization Best Practices
Use appropriate scales—start at 0 for bar charts and end a little
above max value. Stop at 100% when using percentiles.
Consider adding reference lines (typically for the y axis) such as the
mean, an industry standard, or at 0
Split data into meaningful sub-graphs (trellis graphics) with exactly
same scales and structure to better interpret multivariate data
Examine your data using a combination of data visualization
methods
Beware of overplotting—e.g. scatterplot that is very dense with
points, need to show the volume within each region
Could make points smaller, hollow, jittered or use heat map for
multiple observations
Source: Now You See It, Stephen Few
60
Inappropriate display choices
Too much information
Misleading axis scaling
Difficult to understand: all capital letters, too many
abbreviations or jargon, vertical text, insensitive to color,
obscure legend
Inconsistent ordering or placement
Graph is taller than it is wide
Too small in presentation
Too artistic
Bad Practices of Graphical Design
If the graphic is bad, the information will be perceived as less credible!
61
Graphs That Cry Help!
61
http://scienceblogs.com/goodmath/2009/03/more_stupid_graphs.php
http://www.macworld.com/article/134708/2008/07/http://adesigndive.blogspot.com/2010/11/show-and-tell
http://www.muschealth.com/weight/graph.htm
62
Beach Ball “Graph”
62
1. Poor color choices
2. Distracting beach ball chartjunk
3. Different fonts throughout
4. Tall not wide
5. Out of order on x scale
6. ALL CAPITAL LETTERS
7. 3 D boxes for 2D data
63
Ink on the graph represents the data
– Maximize data ink and erase as
much non-data ink as possible
– Tufte
– Data-Ink Ratio = 1 – Proportion of
graph that can be erased
– Erase non-data ink so that the
audience is not drawn away from
the importance of the data
– Think of gridlines—how important are they and at what
frequency?
Data-Ink Ratio
64
Proportion of the graph that is dedicated to displaying data
Maximize data density and the size of the data matrix
– Include more data points
– Include more variables
Sparklines
– Demonstration in Excel with GoPro
– Demonstration in JMP with CrimeData
Data Density
65
A value to describe the relation between the size of effect
shown in a graphic and the size of effect shown in the data.
Exaggeration or changing of the scale in a graph
Lie Factor
Reference: The Visual Display of Quantitative Information by Edward Tufte
66
▪ Decorative elements that provide no data and cause
confusion
▪ Distract the viewer from valuable information
Chart Junk
67
Chart Junk—Another Point of View
Borkin et al. from Harvard and MIT conducted experiment on
what makes a graph memorable
Over 2,000 images; 400 were shown to study participants for 1
second, they then took quiz on which ones they saw.
68
Chart Junk—Another Point of View
Results showed human recognizable objects most important
for memorability
Also helpful are if visualization is “distinct”, visually dense,
colorful and has low data-ink ratio
69
How To Do Data Visualization
Scott Berinato, Good Charts, Havard Business Review Press
Two primary questions before choose graphic:
– Is the information conceptual or data-driven?
– Am I declaring or exploring something?
70
How To Do Data Visualization-Plan
Scott Berinato, Good Charts, Havard Business Review Press
Prepare (5 mins): have paper and pen, put aside data to think
about ideas, write the basics of who visualization is for and
what setting
Talk and Listen (15 mins): discuss with colleague what you’re
trying to prove or explore; capture words, phrases and
statements to summarize goals
Sketch (20 mins): Focus on keywords from above steps,
quickly sketch out multiple visuals
Prototype (20 mins): take best sketch and make it more
accurate and detailed
Fight the impulse to directly graph your data with preset options
71
How To Do Data Visualization-Create
Scott Berinato, Good Charts, Havard Business Review Press
Focus on structure and hierarchy
– Need title (12%), subtitle (8%), visual field (75%), and data source line (5%)
Focus on design clarity (“hit the ball squarely”)
– Aggressively remove extraneous elements and let them highlight the idea
– Make sure each element has a single purpose that cannot be misinterpreted
– Use natural conventions and metaphors
Focus on design simplicity
– Minimize number of colors-gray for second level information (gridlines, etc)
– Place labels and legends close to what they describe
Goal is to make the graph more understandable—not more attractive
72
How To Do Data Visualization-Refine to Persuade
Scott Berinato, Good Charts, Havard Business Review Press
Hone the main idea
– What am I trying to show versus I need to convince them…
– Think active words
Make it stand out
– Emphasize with color, pointers, labels, markers, …
– Isolate it by reducing other elements
Adjust what’s around it
– Add reference points and lines
– Remove elements that distract with integrity
– Create context and comparisons
How can you sell this most effectively?
73
How To Do Data Visualization-Presentation
Scott Berinato, Good Charts, Havard Business Review Press
Show chart and stop talking
Don’t read the picture, talk about ideas not structure
Guide audience for unconventional visualization
Add context by showing reference, average or ideal graph
versus one shown
Turn off chart when have something important to say
Put a more detailed version in backup for them to take away
Create tension by showing parts (builds) so they speculate—
makes it more memorable; use time and reveal gradually
Bait and switch by luring into what they expect and show
different
Deconstruct and reconstruct—drill down dynamically
Does presentation hide something that would rightfully challenge idea?
74
How To Do Data Visualization-Be a Critic
Scott Berinato, Good Charts, Havard Business Review Press
Note what you see first—what stands out?
Note the first idea that forms, then search for more
What are likes, dislikes, wish I saws
What 3 things would you change and why
Sketch your own version and critique yourself
It’s not the critic who counts…the credit belongs to the man who is
actually in the arena. Teddy Roosevelt
75
Plots of deterministic relationships
Univariate
Bivariate
Multivariate
Time Series
Maps
Graphs
76
Deterministic equations easily plotted across the range of
input variables
Graphing Functions
77
Univariate Categorical Plots
Simple example
but shows key
points
Evaluate the
relative
proportions: part-
to-whole.
Determine best
graphs: bar graphs
and Pareto plots,
though pie charts
are (too) often
used.
77
Percent of USAA Mbrs
Family Mbr Retired Officer Enlisted Employee Gov't Civilian Other
38%
15%13%
12%
10%
8% 4%
Percent of USAA Mbrs
Family Mbr
Retired
Officer
Enlisted
Employee
Gov't Civilian
Other
0%
5%
10%
15%
20%
25%
30%
35%
40%
Percent of USAA Mbrs
Distribution of USAA Membership Eligibility
Family Mbr Retired Officer Enlisted Employee Gov't Civilian Other
Good
Better
Best
78
What’s the Matter With Pie Charts
They are overused and do not hit the “pre-attentive”
attributes as hard as other methods
– Length and position are the most discriminatory
Difficult to compare 2-D areas or angles.
DON’T go to 3-D!!
Tough to decipher when have a legend you have to go
back and forth to
78http://annarborchronicle.com/wp-content/uploads/2012/12/DNR-StatewideByCounty.jpg
http://www.outsidethebeltway.com/wp-content/uploads/2010/01/us-states-population-pie-chart.png
79
If You Use Pie Charts Keep the number of slices to a maximum of 4 or 5
– Actually not bad since we are so familiar with these
Adding text labels for percentages and categories if
desired
Exploded Pie emphasizes a proportion; don’t explode
more than 25% of the slices
Put largest wedge at 1:00 and make progressively
smaller clockwise until 12:00
Donut charts do nut [sic] solve the problem
79
PercentMbrs
Family Mbr Retired Officer Enlisted Employee Gov't Civilian Other
80
A graphical representation of the distribution of a continuous variable
Group data into similar sized bins and count frequencies in data set
Easy to infer probabilities and relative importance
Commonly occurring “shapes” are known probability distributions
(normal, uniform, exponential, Weibull, Beta…)
Bars can be omitted for a Frequency Polygon
Histograms
Excel now has menu
option and Analysis
Toolpack
Bin width is a critical
parameter, should
be the same for all
bins
81
Excellent choice to show distributions
– Line at median, size of box is interquartile range (25th and
75th percentile), whiskers extend to 1.5 time IQR or
max/min
– Display differences between populations means without
making assumptions—best for multiple boxes
– IQR is robust estimate of standard deviation (test for equal
variances)
– Excel!! Finally
Box Plots
82
Best used when there are several categorical levels of a
variable
– Can quickly evaluate if the variances (size of boxes) are
approximately equal
– Can determine if the means/medians are close based on
relative positioning
– Limited inherent capability in Excel; Box Charter add-in
Box Plots
83
Intuitive way to quickly identify observations that are at the
extreme ends of the distribution
Example:1918 US Flu Epidemic. Shown below are death
rates per 100,000 for several age groups. Not surprisingly the
babies and elderly had the highest rates. Interestingly, 23-34
year olds also had very high rates. WWI vets returning via
crowded and infected trains
Excel Conditional Formatting is excellent.
Heat Maps
Age M FTotal
Population
<1 2520.5 2020.4 4540.91-4 712 724.2 1436.2
5-14 162.5 190.2 352.715-24 700.6 475.1 1175.723-34 1216.6 781.4 1998
35-44 691.1 406.5 1097.645-54 411.8 275 686.855-64 420.8 339.2 760
65-74 655.8 636.5 1292.375-84 1112.9 1239 2351.9>84 2111.2 2320.5 4431.7
84
• A creative and best practice approach for part-to-whole graphics
• 80/20 rule—80% of Italy’s wealth held by 20% of residents
• Excel chart option now and accessible via Histogram option in Data
Analysis Tool Pack for continuous data
• Pareto Plots are often based on categorical levels of the input factor
Pareto Plot
85
0
5
10
15
20
25
30
35
40
45
4 6 8
Hybrid
Non hybrid
▪ Powerful Excel analytical and graphical capability
▪ Data summarization tool that sums across levels of
variables
▪ Flexible construction of tables/graphs
▪ Easy to filter, dynamic
Pivot Table and Pivot Chart
Average of MPG Hwy Column Labels
Row Labels Hybrid Non hybrid Grand Total
4 38.75 30 30.52238806
6 24.04615385 24.04615385
8 23 19.96610169 20.01666667
Grand Total 35.6 24.76470588 25.046875
8686
Demonstration Univariate Graphs ▪ Excel, JMP, R
87
▪ An association exists between two variables if the
distribution of one variable changes when the level (or
values) of the other variable changes.
▪ If there is no association, the distribution of the first
variable is the same, regardless of the level of the other
variable.
Association between Variables
Anscombe’s
Quartet
All have same
mean, variance,
regression
equation and
correlation
coefficient
88
▪ Consider the fuel efficiency data, we can quickly see
relationships between several variables
Scatterplot Matrix
89
▪ Useful to display multiple dependent variables as a function of
a single continuous independent variable
▪ Can also do part-to-whole
▪ Beware of “hiding” data
Area Plot
90
Specify two or more variables for the response and can
effectively use color for another variable
Parallel Plot, Parallel Coordinates Plot, Parallel Axis Plot
91
• Contour lines show level of a variable
• Useful in multivariate applications to find locations of min/max
response
Contour Plots—Association Between 3 Variables
Series1
Series2
Series3
Series4
Series5
Series6
60
70
80
90
100
110
120
130
12
34
56
78
92
▪ Best graphs: bar graphs, histograms, mosaic plots, and
tree maps:
• Brushing methods enable you to see the distribution
of a categorical variable conditioned on the setting of
another.
• Mosaic plots and tree maps are especially good with
multiple levels of a categorical variable; unfortunately
Excel does not have these graphs.
▪ For two (or more) categorical variables, the relationship is
measured by association or dependence, not correlation.
Association – Categorical Variables
93
Association – Categorical Variables Titanic Example▪ Mosaic plots allow you to
determine if survival is dependent on what class you were in and also if gender makes a difference
▪ Excel not set up for Treemaps, but MS has add-in
94
▪ Brushing or dynamic linking is a great interactive way to
explore relationships in your data by evaluating how a
subset of observations based on the level of one variable
behaves in another.
▪ We see the cross hatched values correspond to the 8
cylinder vehicles that also have more HP and less MPG
Association – Brushing
95
▪ Researchers have been looking into the science of data
visualization
▪ Rensink 2010: Weber’s Law to scatterplots “a noticeable
change in stimulus is a constant ratio of the original
stimulus”
▪ Think of a match lit in a pitch black room versus a lighted room
▪ First time method is available to calculate graph effectiveness
▪ Good-scatterplots in positive correlation direction
▪ Okay-parallel coordinate plots, scatterplot negative direction,
stacked area
▪ Bad-stacked bar, radar plots
Which is Best
96
▪ Profilers and Contour Plots
▪ F-18 Central Composite Design
Multiple Response Optimization
Modeling Visualization
97
▪ Excel, JMP, R
▪ Anscombe’s Quartet
Demonstration for Multivariate Graphs
98
Time Series- Book Sales
98
99
Amazon.com Data Mining Example Peter Lawrence’s The Making of a Fly
Bargain at $23,698,655.93 + $3.99 shipping
Two wholesalers algorithms went awry
– 17 used at $35, but for new n=2 copies
– Bordee=1.27 X Profnath; Profnath=.998 Bordee
– Algorithm quickly goes out of control—Apr 18 $23M
– Apr 19 Profnath =$106 and Bordee=$135
100
Time Series
Look for six basic patterns: overall trend, variability, rate of
change, covariation, cycles, and exceptions/outliers.
Best graphs are line graphs, overlay plots, and bar graphs;
many other displays are possible over time (for example,
box plots and bubble plots).
Run charts with
control limits are
the cornerstone to
process control
methods.
75% of graphs are
time series
100
101
Time Series in Excel
Decent capability for customizable line graphs
New for 2016 was Waterfall Charts
Demonstration
101
102
Time Series: Google Trends
Is interest in golf waning? What does this mean for Under Armour?
102
103
JMP Output Google Trends
103
104
Time Series
Line Plots—use to display patterns, trends, cycles, and
exceptions
Bar Graphs—use to emphasize or compare values (e.g.Budget
versus Actual)
Dot Plots—use if have irregular time intervals. In Excel, just
delete line in line plot with markers
Heat Maps for high volume of data to find exceptions and cycles
Box Plots for analyzing distribution (mean and variance)
changes over time
Animation may be useful to see changes over time
104
105
Time Series Line Chart shows comparison, variability, cycles, trend
105
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
Lord Abbett Monthly Price Change Compared to S&P 500
S&P Lord Abbett
-0.25
-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
9/1/2003 10/1/2003 11/1/2003 12/1/2003 1/1/2004 2/1/2004 3/1/2004 4/1/2004 5/1/2004 6/1/2004 7/1/2004 8/1/2004 9/1/2004 10/1/2004 11/1/2004 12/1/2004 1/1/2005 2/1/2005 3/1/2005 4/1/2005 5/1/2005 6/1/2005 7/1/2005 8/1/2005 9/1/2005 10/1/2005 11/1/2005
Stacked Column Chart for Lord Abbett vs S&P 500
S&P Lord Abbett
Stacked Column shows the
differences better
Both use Excel defaults
apart from title and legend
size
Time series analysis
methods exist to forecast
106
▪ A small intense, simple, word-sized graphic with
typographic resolution
▪ Typically in a cell after the last value in an ordered series
(e.g. time)
▪ Can be everywhere a word or number can be: embedded in
a sentence, table, headline, map, spreadsheet, graphic
▪ Started with Excel 2010—add-in available for 2007
Sparklines
107
▪ Use to highlight specific observations that exceed a certain
value (e.g. 30 MPG), between a range, and so forth
▪ Top or bottom 10%, 10 values, above the mean, below the
mean
▪ Data bars to show what percentile the observation ranks
▪ Icon sets for dashboard type
graphics
Conditional Formatting—Beyond Heat Maps
MPG Hwy
21
23
40
34
33
48
25
23
25
38
30
26
28
26
33
33
31
35
35
MPG Hwy
21
23
40
34
33
48
25
23
25
38
30
26
28
26
33
33
31
35
35
17
17
18
17
MPG Hwy
21
23
40
34
33
48
25
23
25
38
30
26
28
26
33
33
31
35
35
17
17
18
17
MPG Hwy
21
23
40
34
33
48
25
23
25
38
30
26
28
26
33
33
31
35
35
17
17
18
108
Maps Big push in recent years has been geospatial or mapping
Many software options
Maps can be extremely useful, but they do limit our pre-
attentive attributes—use with caution
Still very good for visual discovery and analytics
Think beyond geography of what a map is!
109
▪ NFL 2015 season predictions from Nate Silver’s
fivethirtyeight.com. Data viz software is D3.js
Alluvial Plots
http://www.brightpointinc.com/2015-nfl-predictions/
110
Visualization of Text Data
Consider the Pareto of word counts from an article this year on cnn.com
Can you get the general idea of what might have happened by these frequencies alone?
Note we’ve “stemmed” words=> hors = horse, horses, horsing, horse’s,…and, of course, hors
111
WordCloud
http://www.cnn.com/2015/06/06/us/belmont-stakes-american-
pharoah/
112
Word Clouds
www.wordle.net is a very fun (free) site to paste in your text
and make your own word clouds
113
Text Groupings From Eigenvectors113
Statistical methods can tame the unstructured text to find
words that cluster together and common themes
114
Sentiment Analysis
In many applications, such as with online product reviews,
we would like to know whether the customer base has a
positive, neutral, or negative attitude about the product or
service
We can count the number of “positive” and “negative” words
using a generic list of terms; it may be useful to have a
custom “positive” and “negative” word lists.
Combine this with Twitter and other social media, then you
have real-time feedback; “Opinion Mining”
Method just uses the DTM and cross-tabulates with the
Harvard list of positive and negative sentiments
115
Sentiment Analysis
Look at Bible by Book and Chapter
116
Sentiment Analysis Demonstration in JMP
117
Sentiment Analysis Demonstration in JMP
118
Correlation of Word Pairs from DTM
119
What Words Associated with Fatal?
Different word frequencies
Not Fatal Fatal
120
What Words Associated with Fatal?
Crosstab Fatal structured variable with word counts
121
Tree Model for Fatal on NTSB DTM
• Classification Tree groups observations based on presence of absence of word
• If “land” in write up, very unlikely a fatality unless “mountain” is too
• If “stall/spin” in write up, very likely to be a fatality
122
Identifying Topics Via Latent Semantic Analysis
• Factor loadings from Singular Value Decomposition (SVD) of the document term matrix– Creates U (document) and V (term or words) reduced rank matrices
– Fortunately, they are linked so we can go back and forth between the two
• Plotting first two eigenvectors of V can show most dominant themes
123
Other Text Visualizations
• Arc Diagram of Les Miserables
http://gastonsanchez.com/software/les_miserables_arcdiagram.pdf
124
Text Visualization
Text analytics not going away
Word clouds and frequency counts are helpful
Document Term Matrix is the key to finding relationships
between words
Visualizing Singular Value Decomposition of DTM allows
you to find topics, quantify unstructured data, and cluster
both words and documents
Sentiment analysis visualization methods help gauge
overall preferences and emotions
125
Dashboards!
Avalanches in Tableau
126
Executive level summary graphs showing key metrics; 4 stages:
▪ NOTICE-get eye to move to right place
▪ FOCUS-quickly get to understand insights
▪ INVESTIGATE—intuitive way to drill down and explore
▪ ACT—right insight at right time to take right action
Accessible via mobile devices
Dashboards
Dashboards have become a popular means to present critical business information at a glance, but few do so effectively. Huge investments are made in Information Technology to produce actionable information, only to have it robbed of meaning at the very last stage of the process: the presentation of insights to those responsible for making decisions. When designed well, dashboards engage the power of visual perception to communicate a dense collection of information in an instant with exceptional clarity. Stephen Few
127
Survey on Dashboard Effectiveness
Metrics of a good dashboard
Most organizations have room to improve—especially with
unstructured data
TDWI Research Survey 2013
128
Stephen Few’s Common Mistake 1: Exceeding the
Boundaries of a Single Screen
More difficult for mind to recall information that is no longer visible
Seeing everything on one screen allows for quicker and easier
comparisons, which lead to quicker insights
People often think information that they must scroll to see is of
less importance than what is directly in front of them
http://www.helpsystems.com/sites/default/files/a
rticle/SQ_APRIL_14-key-performance-indicator-
vertical-dashboard.gif
129
Common Mistake 2: Supplying Inadequate Context
for the Data
Meaningful context is key to understanding the information
presented
Context should be incorporated in a way that does not
distract the reader from the key message
Context should only be included when it adds real value to
avoid crowding and distraction
http://www.excelchart
s.com/blog/wp-
content/uploads/2008
/03/dundas-
gauges1.png
130
Common Mistake 3: Displaying Excessive Detail or
Precision
Too much detail slows reader without providing benefit
http://www.
funkylab.co
m/post/KPI
-What-
Where-
Why-and-
how-many
131
Common Mistake 4: Expressing Measures Indirectly
Must know what is being measured and in what units
Must find the measure that conveys the meaning most effectively
Find the message needed by viewer, then select best measure to
support message
132
Common Mistake 5: Choosing Inappropriate Display
Media
Pie charts don’t display quantitative data effectively
Humans can’t compare 2-D areas effectively
Linear displays such as bar graphs convey information more effectively
Common mistake in all quantitative data presentations
http://www.danielp
radilla.info/blog/wp
-
content/uploads/2
012/11/pie_charts
_vs_bar_charts_2.
png
133
Common Mistake 6: Introducing Meaningless Variety
People typically don’t like using the same type of chart or graph more than once on a dashboard. This often is detrimental to the dashboard.
Always use the display medium that is most effective even if the dashboard already uses that display medium.
Wherever appropriate, consistency in means of display allows readers to use the same strategy in interpreting information, which saves time.
http://sourceforge.net/p/art/wi
ki/Images/attachment/dashbo
ard-example.png
134
Common Mistake 7: Using Poorly Designed
Display Media
Components of the dashboard must be designed to
communicate clearly and efficiently
Most graphs are designed poorly
Legends force reader’s eyes to go back and forth, wasting
time
http://www.perceptualed
ge.com/blog/wp-
content/uploads/2009/0
2/sas-revenue-
graph.jpg
135
Common Mistake 8: Encoding Quantitative Data
Inaccurately
Sometimes design errors result in graphs representing
values inaccurately
http://2.bp.blogspot.com/
_z6QlRCOBLgo/TD3vOz
HzA2I/AAAAAAAAABw/
Bz0rOm4Ijns/s1600/stan
dard_vs_diffable+(1).png
136
Common Mistake 9: Arranging Information Poorly
Dashboards must be well organized, with data appropriately based on importance and proper viewing sequence and framed within a visual design that segregates information into meaningful groups – Stephen Few
Make the dashboard look good but most importantly, arrange the information in a manner that fits its use
Make important information stand out
Data that needs to be compared should be arranged and visually designed to foster comparisons
http://flylib.com/bo
oks/en/2.412.1.25/
1/
137
Common Mistake 10: Highlighting Information
Effectively or Not at All
Viewer’s eye should be directed to the most critical
information
Not all information is of equal importance
http://www.bright
edge.com/sites/d
efault/files/LG_02
_Customizable_D
ashboards.jpg
138
Common Mistake 11: Cluttering the Display with
Visual EffectsBackgrounds, artistry, and decorations only distract from
the important information presented
http://4.bp.blogspot.com/-
WSVMiug9SS4/URFgm
O0yYxI/AAAAAAAAAug/
AbVYKgTd-
wU/s1600/Snap4.png
139
Common Mistake 12: Misusing or Overusing Color
Color should not be used haphazardly
Hot colors demand attention
Cool colors do not demand attention
Contrasts call attention
Same color creates a relationship between two displays on a dashboard
http://1.bp.blogspo
t.com/-
zT3nda4OvLU/UZ
yS2mFmn6I/AAAA
AAAAEEA/aMems
Fmvxxc/s1600/Mai
n+DB.png
140
Common Mistake 13: Designing an Unattractive
Visual Display
Ugly dashboards make the viewer want to look away,
making him or her less inclined to understand all of the
information presented
http://www.d
ashboardzon
e.com/wp-
content/uplo
ads/2008/04/
image-76.jpg
141
Tableau Software
Strictly data visualization software; most popular in industry
– Stock symbol: DATA
Connects to standard data sources, proprietary data bases,
and big data such as Hadoop, Teradata, GoogleBigQuery
Highly interactive, pretty powerful, and can quickly make
graphs
https://public.tableau.com/s/gallery/good-value-mbas
Goals:
– Make data
understandable
– Manage large data
streams
– Promote data discovery
– Help business decisions
142
Tableau-Hurricanes
Bring in Excel file, add rows (lat), cols (long), color (basin),
label (name)
Change marks (line), path (ISO time/hour), size (wind), color
(name), filter (basin)
Animate by pages(ISO time(day)), check Show History
143
Sports Analytics
1. Yankees 2012 HR waterfall chart in Tableau (running total, rev)
2. Spurs 2014-15 performance stats using Graph Builder/Col Sw
144
Sports Analytics
Hockey greats scatterplot by position
145
R Statistical Programming Language
Definitely not strictly data visualization software; most used
open source stats software
It can certainly do data visualization though may require some
proficiency in the language first
Decent capability from main packages and libraries
ggplot2 seems to have the most following and capabilities
– Grammar of graphics; good defaults; layered customizable
results, static
– Partner ggvis enables web-based interactive graphics
Web-based with Shiny and Markdown; interactive htmlwidgets,
plotly, d3, googleVis, and many more packages
Decent review on interactive and data wrangling in post in
March 2017 ComputerWorld
146
ggplot2
148
Hans Rosling
http://www.gapminder.org/tools
Great TED talks!
149149
Thank you. Thank you very much.
Questions?