guerrilla analytics: 7 principles for agile analytics (predictive analytics world 2015)
TRANSCRIPT
#GuerrillaAnalytics http://guerrilla-analytics.net
Guerrilla Analytics:7 Principles for Agile Analytics
ENDA RIDGE, PHD
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
2What You Will Learn Why you must identify and mitigate disruptions in projects How the Guerrilla Analytics Principles help Case study on the Guerrilla Analytics Principles in action
How this will help you Data Scientists: you need a defensive Guerrilla Analytics mindset. Without it you will be
overwhelmed by the highly iterative nature of predictive analytics Managers and Directors: you need a Guerrilla Analytics capability for a high performing team
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
3What I’ve Learned
PhD‘Design of Experime
nts for Tuning
Algorithms’
Boutique Consultanc
y
Forensic Data
Analytics
Senior Manager
Professional
Services
Head of Algorith
ms
Copyright Enda Ridge 2015
No matter the industry, teams are always plagued by the same problem …
Time is wasted in the confusion and chaos of highly iterative Data Science
2004 2008 2010 2012 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
4Teams Need ‘Guerrilla Analytics’
Copyright Enda Ridge 2015
Data• Extraction• Receipt• Loading
Analytics• Transform• Algorithms• Consolidate
Insight• Reporting• Work Products
Disruptio
n
#GuerrillaAnalytics http://guerrilla-analytics.net
57 Guerrilla Analytics Principles
Principle 1: Space is cheap, confusion is expensive
Principle 2: Prefer simple, visual project structures and conventions
Principle 3: Prefer automation
Principle 4: Maintain Data Provenance
Principle 5: Version control changes
Principle 6: Consolidate team knowledge
Principle 7: Prefer code that runs from start to finish
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
6
Case Study
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
7Case Study: Business Problem
Situation: A pharma organization’s programme to improve its Identity Access Management (IAM). IAM ensures that IT access privileges are granted according to one interpretation of policy
Objective: identify ‘permission roles’ that group up common IT permissions
Benefits: IT efficiency. Assign roles instead of individual permissions Staff and systems are properly authenticated and audited Ensure company data is not at risk for being misused Avoid regulatory non-compliance
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
8Case Study: Data Science ProblemSystem User Permission
System01 Chaz EmailSystem01 Chaz NetworkSystem01 Dave EmailSystem02 Chaz EmailingSystem02 Chaz SharepointSystem02 Dave SharepointSystem02 Meg EmailSystem02 Meg SharepointSystem02 Meg Network…. … …
Find common subsets of permissions These are ‘permission roles’ for Identity
Access Management 70 systems Thousands of permissions Users can access several systems All systems are different Team is mobilized and ready to review
permissions
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
9Case Study: ApproachSystem User Permission
System01 Chaz EmailSystem01 Chaz NetworkSystem01 Dave EmailSystem02 Chaz EmailingSystem02 Chaz SharepointSystem02 Dave SharepointSystem02 Meg EmailSystem02 Meg SharepointSystem02 Meg Network…. …. ….
User Permission
Chaz EmailChaz SharepointChaz NetworkDave EmailDave SharepointMeg EmailMeg SharepointMeg Network…. ….
Copyright Enda Ridge 2015
Seems like a popular group
#GuerrillaAnalytics http://guerrilla-analytics.net
10Case Study: ApproachSystem User Permission
System01 Chaz EmailSystem01 Chaz NetworkSystem01 Dave EmailSystem02 Chaz EmailingSystem02 Chaz SharepointSystem02 Dave SharepointSystem02 Meg EmailSystem02 Meg SharepointSystem02 Meg Network…. …. ….
User Permission
Chaz EmailChaz SharepointChaz NetworkDave EmailDave SharepointMeg EmailMeg SharepointMeg NetworkSarah EmailSarah SharepointSarah Network…. ….
Copyright Enda Ridge 2015
Or is it this bigger group?
#GuerrillaAnalytics http://guerrilla-analytics.net
11
DataData• Extraction• Receipt• Loading
Analytics• Transform• Algorithms• Consolidate
Insight• Reporting• Work Products
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
12Data Receipt: Situation2015-10-01.log
EMAIL_Server.csv
EMAIL_Server.csv2
IAM from Joe.log
2015-10-05.log
Security logs.log
2015-10-07.log
…
Multiple files from 70 different systems No consistency Delivered at different points in time Refreshed at irregular intervals
Copyright Enda Ridge 2015
Disruptio
n
#GuerrillaAnalytics http://guerrilla-analytics.net
13Data Receipt: Guerrilla Analytics
Copyright Enda Ridge 2015
Data
D001• 2015-10-01.log
D002• EMAIL_Server.csv
D003• EMAIL_Server.csv2
D004• IAM from Joe.log
D005• 2015-10-05.log
…
Principle 1: Space is cheap, confusion is expensive
Principle 2: Prefer simple, visual project structures and conventions
Principle 4: Maintain Data Provenance
Robust to multiple data deliveries Robust to random file names and
customer inconsistencies
#GuerrillaAnalytics http://guerrilla-analytics.net
14Data Loading: Situation
Raw
Sch
ema
2015-10-01.log
EMAIL_Server.csv
EMAIL_Server.csv2
IAM from Joe.log
2015-10-05.log
Security logs.log
2015-10-07.log
…
Files loaded all over the analytics environment
Files renamed Files moved Files ‘archived’ Raw files edited
Copyright Enda Ridge 2015
Disruptio
n
#GuerrillaAnalytics http://guerrilla-analytics.net
15Data Loading: Guerrilla Analytics
Copyright Enda Ridge 2015
Raw
Sch
ema
D001 2015-10-01.logD002 EMAIL_Server.csvD003 EMAIL_Server.csv2D004 IAM from Joe.log
D005 2015-10-05.log
D006 Security logs.log
D007 2015-10-07.log
…
Principle 1: Space is cheap, confusion is expensive Keep everything
Principle 2: Prefer simple, visual project structures and conventions One place for raw data
Principle 4: Maintain Data Provenance Don’t rename, move, modify in any way
Robust to crazy inconsistent files Force code to explicitly use data IDs
#GuerrillaAnalytics http://guerrilla-analytics.net
16
Analytics
Data• Extraction• Receipt• Loading
Analytics• Transform• Algorithms• Consolidate
Insight• Reporting• Work
Products
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
17Transformation: SituationLots of renaming
IDent Usr sys PTY
3477 Charlie Email4.5 Read
4598 Snoopy Email4.5 Read; send
… … … …
70 different systems Unhelpful field names Evolving understanding of correct
fields
Copyright Enda Ridge 2015
id user system permission
3477 Charlie Email4.5 Read
4598 Snoopy Email4.5 Read; send
… … … … Disruptio
n
#GuerrillaAnalytics http://guerrilla-analytics.net
18Transformation: Guerrilla Analytics
Principles in Action Principle 3: Prefer automation Principle 4: Maintain Data
Provenance Principle 5: Version control changes Principle 6: Consolidate team
knowledge
Robust to evolving names and inconsistencies
Data provenance of field names
Copyright Enda Ridge 2015
IDent Usr sys PTY
3477 Charlie Email4.5 Read
4598 Snoopy Email4.5 Read; send
… … … …
id user system permission
3477 Charlie Email4.5 Read4598 Snoopy Email4.5 Read;
send… … … …
dataset from toSys1 IDent idSys1 Usr user… … …
#GuerrillaAnalytics http://guerrilla-analytics.net
19Algorithm: Situation1
• Choose data
• Apply mapping
2
• Cast• Index
3
• Reshape & Join
• Apply Rules
• Tidy
4
• Apply Algorithm
• Check Output
Copyright Enda Ridge 2015
Disruptio
n
Where do my outputs go? How to iteratively develop code/rules etc?
Different algorithm parameters Different algorithms
How do I iterate with the broader team and customer?
#GuerrillaAnalytics http://guerrilla-analytics.net
20Work Products: Guerrilla Analytics
Copyright Enda Ridge 2015
Principles in action
Wor
k Pr
oduc
tsWP001010_Reshape.sql020_Apply_Rules.sql030_Algorithm.py050_Reports.py050_Report.ppt
WP002
WP003
…
Principle 1: Space is cheap, confusion is expensive Keep everything
Principle 2: Prefer simple, visual project structures and conventions One place for each output
Principle 4: Maintain Data Provenance Code, plots, reports etc in one place
Robust to multiple iterative work products Scalable to team of any size
#GuerrillaAnalytics http://guerrilla-analytics.net
21Early result
ABC AB ABD EF E A BD G ZY WZY50
55
60
65
70
75
80
85
90
95
100 Taking too long to cover users Still too many permission groups
suspect data quality Could tweak the itemset mining
algorithms
Need to iterate and improve
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
22
Iteration
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
23Analysis: Situation
Data• Latest data• Latest mapping
Analysis• Tidy data format• Apply itemset
mining
Insight• ?
Copyright Enda Ridge 2015
Disruptio
n
Wasted effort in repetition Risk of inconsistency in repetitions Need clear view of how understanding has evolved
#GuerrillaAnalytics http://guerrilla-analytics.net
24
Analysis 1
Analysis 2
…
Guerrilla Analytics: Consolidate
1
• Choose data
• Apply mapping
2
• Cast• Index
3
• Reshape & Join
• Apply Rules
• Tidy
4
• Published Interface Datasets
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
25
Analysis 1
Analysis 2
…
Guerrilla Analytics: Consolidate
1
• Choose data
• Apply mapping
2
• Cast• Index
3
• Reshape & Join
• Apply Rules
• Tidy
4
• Published Interface Datasets
Copyright Enda Ridge 2015
Build tool automation
Version controlled code
#GuerrillaAnalytics http://guerrilla-analytics.net
26
Analysis 1
Analysis 2
…
Guerrilla Analytics: Consolidate
1
• Choose data
• Apply mapping
2
• Cast• Index
3
• Reshape & Join
• Apply Rules
• Tidy
4
• Published Interface Datasets
Copyright Enda Ridge 2015
Build tool automation
Version controlled code
Principle 3: prefer automation
Principle 4: maintain data provenance
Principle 5: version control changes
Principle 6: consolidate team knowledge
#GuerrillaAnalytics http://guerrilla-analytics.net
27
Reporting
Data• Extraction• Receipt• Loading
Analytics• Transform• Algorithms• Consolidate
Insight• Reporting• Work
Products
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
28Iterative Analysis
ABC AB ABD EF E etc50
55
60
65
70
75
80
85
90
95
100 Data cleaning and algorithm tuning give better results
Clear version of ‘consolidated knowledge’
Clear work products for each iteration
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
29
Reporting
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
30Reporting: SituationWe analysed the latest data, applying an itemset mining algorithm to recommend permission roles.Results suggest an optimal cut-off of 3 permission roles to cover 80% of user activities. The remaining users should be reviewed in light of….
Copyright Enda Ridge 2015
ABC AB ABD EF E etc50556065707580859095
100
#GuerrillaAnalytics http://guerrilla-analytics.net
31Reporting: SituationWe analysed the latest data, applying an itemset mining algorithm to recommend permission roles.Results suggest an optimal cut-off of 3 permission roles to cover 80% of user activities. The remaining users should be reviewed in light of….
Which latest data? Which systems?
Which algorithm? parameters?
Which business rules? What recommendations? How is it different from last iteration?
Copyright Enda Ridge 2015
ABC AB ABD EF E etc50556065707580859095
100
Disruptio
n
#GuerrillaAnalytics http://guerrilla-analytics.net
32Guerrilla Analytics: Project Structure
FilesPr
ojec
t
data
D001
D002
D010
…
…
work prod
WP_001
WP_002
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
33Guerilla Analytics: Project Structure
FilesPr
ojec
t
data
D001
D002
D010
…
…
work prod
WP_001
WP_002
Data Science environment
Proj
ect
data
D001
D002
…
build
clean_data
algo_input
work prod
WP_001
WP_002
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
34Reporting: Guerrilla AnalyticsWe analysed the latest data, applying an itemset mining algorithm to recommend permission roles.Results suggest an optimal cut-off of 3 permission roles to cover 80% of user activities. The remaining users should be reviewed in light of….
Which latest data? Which rules? Which systems?
Build version 2.2
Which algorithm parameters? What recommendations?
Work product 042
How is it different from last iteration? Work product 031 versus 042
Copyright Enda Ridge 2015
ABC AB ABD EF E etc50556065707580859095
100
#GuerrillaAnalytics http://guerrilla-analytics.net
35Guerrilla Analytics Success
Coped with multiple inconsistent data deliveries
Robust to evolving business rules and moving target of live systems
Quick turn around of different algorithms while closing out permission roles in a live system
Project delivered in weeks rather than months
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
36Guerrilla Analytics Capability
Agility
3. Guerrilla Analytics Mindset
2.Supporting
Tools
1. Simple
Conventions
Copyright Enda Ridge 2015
#GuerrillaAnalytics http://guerrilla-analytics.net
37Guerrilla Analytics Capability
Agility
3. Guerrilla Analytics Mindset
2.Supporting
Tools
1. Simple
Conventions
Copyright Enda Ridge 2015
• 7 Guerrilla Analytics Principles
• 100+ practice tips• Data Science patterns
•Build Tools•Tracking•Version control
•Data receipt•Data load•Tidy Data format•…
#GuerrillaAnalytics http://guerrilla-analytics.net
38Summing up Agility means delivering despite disruptions High performing agile teams have capability to
mitigate disruptions 7 Guerrilla Analytics Principles for defensive
Data Science Guerrilla Analytics Principles in action across
Data receipt Data load Iterative work products Consolidation Reporting
Copyright Enda Ridge 2015
@Enda_Ridge
http://guerrilla-analytics.net