using visualizations to monitor changes and harvest insights from a global-scale logging...
DESCRIPTION
Slides from my talk at the IEEE Conference on Visual Analytics Science and Technology (VAST) 2014 in Paris, France. ABSTRACT Logging user activities is essential to data analysis for internet products and services. Twitter has built a unified logging infrastructure that captures user activities across all clients it owns, making it one of the largest datasets in the organization. This paper describes challenges and opportunities in applying information visualization to log analysis at this massive scale, and shows how various visualization techniques can be adapted to help data scientists extract insights. In particular, we focus on two scenarios:\ (1) monitoring and exploring a large collection of log events, and (2) performing visual funnel analysis on log data with tens of thousands of event types. Two interactive visualizations were developed for these purposes: we discuss design choices and the implementation of these systems, along with case studies of how they are being used in day-to-day operations at Twitter.TRANSCRIPT
Krist Wongsuphasawat & Jimmy Lin@kristw
Using visualizations to monitor changes and harvest insights
from log data at Twitter
@lintool
Logging user activities & data analysis
UsersUseTwitter
UsersUse
Product Managers
Curious
UsersUse
Curious
Engineers
Log datain Hadoop Write Twitter
Instrument
Product Managers
What are being logged?
tweetactivities
What are being logged?
tweet from home timeline on twitter.com tweet from search page on iPhone
activities
What are being logged?
tweet from home timeline on twitter.com tweet from search page on iPhone
sign up log in
retweet etc.
activities
Organize?
log event a.k.a. “client event”
[Lee et al. 2012]
log event a.k.a. “client event”
client : page : section : component : element : actionweb : home : timeline : tweet_box : button : tweet
1) User ID 2) Timestamp 3) Event name
4) Event detail
[Lee et al. 2012]
Log data
UsersUse
Curious
Engineers
Log datain Hadoop Twitter
Instrument
Write
Product Managers
bigger than Tweet data
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Ask
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find
Ask
Instrument
Write
Product Managers
Log data
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean
Ask
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean
Ask
Monitor
Instrument
Write
Product Managers
UsersUse
Curious
Engineers
Log datain Hadoop
Data Scientists
Find, Clean, Analyze
Ask
Monitor
Instrument
Write
Product Managers
Log data
EngineersData Scientists
Usersin Hadoop
Find, Clean, Analyze
Use
Monitor
Ask
Curious
1 2
Instrument
Write
Product Managers
Part I Find & Monitor Client Events
Motivation
Log datain Hadoop
Engineers & Data Scientists
billions of rows
Log datain Hadoop
Aggregate
10,000+ event types
date client page section comp. elem. action count
20141011 web home home - - impression 100
20141011 web home wtf - - click 20
Engineers & Data Scientists
Client event collection
Log datain Hadoop
Aggregate
10,000+ event types
date client page section comp. elem. action count
20141011 web home home - - impression 100
20141011 web home wtf - - click 20
Engineers & Data Scientists
Client event collection
(Who-to-Follow)
Log datain Hadoop
AggregateClient event collection
Engineers & Data Scientists
Log datain Hadoop
Aggregate
Find
client page section component element action
Search
Client event collection
Engineers & Data Scientists
Log datain Hadoop
Aggregate
Find
client page section component element action
Search
Client event collection
Engineers & Data Scientists
section? component?
element?
client page section component element action
Search
Find
Log datain Hadoop
Aggregate
web home * * impression*
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
web home * * impression*
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
search can be better
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
10,000+ event types
search can be better
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
web : home : wtf : - : - : impression
Aggregate
search can be better
10,000+ event types
not everybody knowsWhat are all sections under web:home?
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
Aggregate
search can be better
one graph / event
10,000+ event types
not everybody knowsWhat are all sections under web:home?
Client event collection
Engineers & Data Scientists
client page section component element action
Search
Find
Query
Return
Log datain Hadoop
Resultsweb : home : home : - : - : impression
Aggregate
search can be better
one graph / eventx 10,000
10,000+ event types
not everybody knowsWhat are all sections under web:home?
Client event collection
Engineers & Data Scientists
!
• Search for client events
• Explore client event collection
• Monitor changes
Goals
• Session analysis
!
• Monitor network logs, not user activity logs
Related work
[Lam et al. 2007, Shen et al. 2013]
[Ghoniem et al. 2013]
Design
Client event collection
Engineers & Data Scientists
See
Client event collection
Engineers & Data Scientists
See
Interactions search box => filter
Client event collection
narrow down
Engineers & Data Scientists
See
How to visualize?
narrow down
Client event collection
Engineers & Data Scientists
Interactions search box => filter
See
How to visualize?
narrow down
Client event collection
Engineers & Data Scientists
client : page : section : component : element : actionInteractions search box => filter
Client event hierarchy
iphone home -
- - impression
tweet tweet click
iphone:home:-:-:-:impressioniphone:home:-:tweet:tweet:click
Detect changes
iphone home -
- - impression
tweet tweet click
iphone home -
- - impression
tweet tweet click
TODAY
7 DAYS AGO
compared to
Calculate changes
+5% +5% +5%
+10% +10% +10%
-5% -5% -5%
DIFF
Display changes
iphone home -
- - impression
tweet tweet click
Map of the Market [Wattenberg 1999], StemView [Guerra-Gomez et al. 2013]
Display changes
home -
- - impression
tweet tweet click
iphone
Demo Scribe Radar
Twitter for Banana
• Since Dec 2013
• 500 unique users, 10 users / day
!
• No training
Deployment
Users: PMs, Data Scientists, Engineers
• Search
• Monitor
• See effects after major product launch
Use cases
read the paper :)
Part II Analysis
Count page visits
banana : home : - : - : - : impressionhome page
Funnel
home page
profile page
Funnel analysis
banana : home : - : - : - : impression
banana : profile : - : - : - : impression
1 jobhome page
profile page
1 hour
Funnel analysis
banana : home : - : - : - : impression
banana : profile : - : - : - : impression banana : search : - : - : - : impression
home page
profile page search page
2 jobs2 hours
Funnel analysis
banana : home : - : - : - : impression
banana : profile : - : - : - : impression banana : search : - : - : - : impression
home page
profile page search page
Specify all funnels manually!
n jobsn hours
Goal
banana : home : - : - : - : impression
… ……
1 job => all funnels, visualized
home page
• Visualize an overview of event sequences
!
Related work
[Wongsuphasawat et al. 2011, Monroe et al. 2013, …]
• Visualize an overview of event sequences
!
• Big data? eBay checkout sequences
!
One funnel at a time Checkout > Payment > Confirm > Success
Related work
[Wongsuphasawat et al. 2011, Monroe et al. 2013, …]
[Shen et al. 2013]
LifeFlow [CHI2011]
!
(simplified)
User sessionsSession#1
A
B
start
end
Session#4
start
end
A
Session#2
B
start
end
A
Session#3
C
start
end
A
Aggregate4 sessions
A
BB C
start
end endend
A A
end
A
Aggregate
A
BB C
start
end endend
end
4 sessions
Aggregate
C
start
end endend
end
A
B
4 sessions
Aggregate
C
start
end endend
end
A
B
4 sessions
Aggregate
C
start
end endend
A
B end
4 sessions
Aggregate
C
start
endend
A
B end
4 sessions
Aggregate
C
start
endend
A
B end
4 sessions
Aggregate
start
endend
A
CB end
4 sessions
Aggregate
4,000,000 sessions
endend
A
CB end
start
try with sample data (~millions sessions, 10,000+ event types)
!
original paper (100,000 sessions, ~10 event types)
not meaningful !
small slice of data but huge file
How to make it work?
# of unique sequences
1. Reduce event types
Reduce # of unique sequences
1. Reduce event types
Reduce # of unique sequences
10,000 types select
tweet sign up log out
1. Reduce event types
Reduce # of unique sequences
10,000 types select
tweet sign up log out
1. Reduce event types
Reduce # of unique sequences
10,000 types select merge
tweet from home timeline tweet from search page tweet …
= tweet
1. Reduce event types
2. Reduce sequence length
Reduce # of unique sequences
1. Reduce event types
2. Reduce sequence length
Reduce # of unique sequences
session
1000 events
1. Reduce event types
2. Reduce sequence length
Reduce # of unique sequences
session
10 events after (window size & direction)
1000 events
visit home page (alignment)
1. Reduce event types
2. Reduce sequence length
Reduce # of unique sequences
Ask users for input}
1. Reduce event types
2. Reduce sequence length
3. More aggregation on Hadoop
Reduce # of unique sequences
Ask users for input}
Collapse eventsSequence ABBBCCCC ABBCC ABC ABCCCC ABCD ABCCCD ABCCE ABCDF ABCDG ABCDH
e.g. tweet, tweet, tweet, … = tweet
Sequence ABC ABC ABC ABC ABCD ABCD ABCE ABCDF ABCDG ABCDH
Collapse events
Group & CountSequence ABC ABCD ABCE ABCDF ABCDG ABCDH …
Count 2000 80 20 1 1 1 …
Group & CountSequence ABC ABCD ABCE ABCDF ABCDG ABCDH ABCDI ABCDJK ABCDJL
Count 2000 80 20 1 1 1 1 1 1
rare sequences (count < threshold)
TruncateSequence ABC ABCD ABCE ABCDx ABCDx ABCDx ABCDx ABCDJx ABCDJx
Count 2000 80 20 1 1 1 1 1 1
Replace last event with x (…)
Sequence ABC ABCD ABCE ABCDx ABCDJx
Count 2000 80 20 4 2
Group & Count
Truncate moreSequence ABC ABCD ABCE ABCDx ABCDx
Count 2000 80 20 4 2
Group & CountSequence ABC ABCD ABCE ABCDx
Count 2000 80 20 6
1. Define set of events
2. Pick alignment, direction and window size
3. Run Hadoop job (with more aggregation)
4. Wait for it… (2+ hrs)
5. Visualize
Final process
~100,000 patterns (10MB)
gazillion patterns (TBs)
Demo Flying Sessions
• Since Jan 2013
• Fewer users, but more in-depth ad-hoc analysis
• Initial meeting to provide support
Deployment
• What did users do when they visit Twitter? (in demo)
• Where did users give up in the sign up process?
• more in the paper
Case studies
Case studies
click on “sign up”
fill personal info
import address book
etc.
• What did users do when they visit Twitter? (in demo)
• Where did users give up in the sign up process?
• more in the paper
• What did users do when they visit Twitter? (in demo)
• Where did users give up in the sign up process?
• more in the paper
Case studies
read the paper :)
• Large-scale User Activity Logs + Visual Analytics
Conclusions & Future work
• Large-scale User Activity Logs + Visual Analytics
• Find, Monitor & Explore + Anomaly detection & automatic alert
• Funnel Analysis + More interactivity & data / reduce wait time / latency study?
• Used in day-to-day operations at Twitter
Conclusions & Future work
Conclusions & Future workChallenge
big data
small data
visualize & interact
• Large-scale User Activity Logs + Visual Analytics
• Find, Monitor & Explore + Anomaly detection & automatic alert
• Funnel Analysis + More interactivity & data / reduce wait time / latency study?
• Used in day-to-day operations at Twitter
aggregate & sacrifice
• Large-scale User Activity Logs + Visual Analytics
• Find, Monitor & Explore + Anomaly detection & automatic alert
• Funnel Analysis + More interactivity & data / reduce wait time / latency study?
• Used in day-to-day operations at Twitter
• Generalize to smaller systems
Conclusions & Future workChallenge
big data
small data
visualize & interact
aggregate & sacrifice
• Data Scientists & Engineers @Twitter — Linus Lee, Chuang Liu
• Feedback from reviewers, Ben Shneiderman & Catherine Plaisant
Acknowledgement
Twitter is looking for data analyst.
(contractor)
• Large-scale User Activity Logs + Visual Analytics
• Find, Monitor & Explore + Anomaly detection & automatic alert
• Funnel Analysis + More interactivity & data / reduce wait time / latency study?
• Used in day-to-day operations at Twitter
• Generalize to smaller systems
Conclusions & Future workChallenge
big data
small data
visualize & interact
[email protected] / @kristw
aggregate & sacrifice
Questions?
Thank you