lecture on data science in a data-driven culture
TRANSCRIPT
Data-Driven CultureDATA-DRIVEN and DATA-SCIENCE
Johan Himberg / Reaktor 29.2.2016
survey data on the business practices and IT investments of 179 large, publicly traded companies
Firms that emphasise “data driven decision making”have output and productivity that is 5-6% higher than what would be expected given other investments and IT usage.
relationship also appears in asset utilisation, return on equity and market value
Why “data-driven”WHY
2
Brynjolfson et al (2011) on Data-Driven
Business acumen what for
Operations Researchoptimal decisions and actions
Probability theory how to handle uncertainties
Analyticsinsights and machine learning from data
Computer Science how to implement all that
Data Science in businessWHY
3
Data Science & analyticsBASICS
BASICS
5
Some dimensions 1. Business case
2. Analytical task
1. Active - Passive system
2. Informative - Operative aim
3. Modelling (model selection and fitting)
4. Data: structure, amount, velocity, and source
REAKTOR / JOHAN HIMBERGFEBRUARY 2016
Data Science & analyticsBUSINESS CASES
SECTION TITLE
7
Beware of empty “data-speak”
A quote from my colleague Janne Sinkkonen from a presentation at Helsinki University Machine learning course:
“Data-speak” hides the processes behind data. What creates the data? What is done with the results?
The goal is not “data analysis”
Define your goal and setup without using the word ‘data’.
REAKTOR2016
Sell audiences Google, Facebook, media, …
Sell information credit rating, car register,…
Information businessBUSINESS CASE
8
OperationsBUSINESS CASE
9
Create beneficial eventsmarketing: targeting, cross-sell, up-sell, conversionfind right product/service to sell or buy, find a good doctor, expert etc.
Avoid non-beneficial eventschurn, people leaving, waste, credit loss, fraud, …system failures, …
Optimizecustomer value,work force, schedules, prices, discounts, stocks,relevancy for customer,production quality, speed
Rationaliseprocess efficiency, lead times, handle complexity, search time …
Understand: customer & product base, transactions, or processes internally: ERP, CRM, HR, sales systems, production, …externally: location, routes, weather, demographics, estates, …
Efficiency and competitionReact faster, streamlined decision making, risk awarenessFinancial efficiencyInnovations
Well-informed strategic decisionsUnderstanding customer groups’ needs for product and service developmentUnderstanding and predicting world events, economics, demographics, ….React to market fluctuation or changes in financial environment
Internal and external image and cultureTransparency, learning as a part of company cultureCustomer satisfaction, personalisation, brand
StrategicBUSINESS CASE
10
Netflix"The goal of our ranking system is to find the best possible ordering of a set of items for a member, within a specific context, in real-time. ... Our business objective is to maximize member satisfaction and month-to-month subscription retention, which correlates well with maximizing consumption of video content.
- 2012 Xavier Amatriain and Justin Basilico, Personalization Science and Engineering
ExampleVIRTUES
11
Data Science & analyticsTASKS & RISKS
BASICS
13
Some dimensions 1. Business case
2. Analytical task
1. Active - Passive system
2. Informative - Operative aim
3. Modelling (model selection and fitting)
4. Data: structure, amount, velocity, and source
REAKTOR / JOHAN HIMBERGFEBRUARY 2016
BASICS
14
Informative - Operative
Informative (for understanding)
Analysis results for understanding things, results for management for making decisions: reports, predictions, what-if analyses, simulations, visualisations,…
Operative
Automated system that makes decisions based on some rules or models, or
results that are directly operative, if not automated.
REAKTOR / JOHAN HIMBERGFEBRUARY 2016
BASICS
15
Active - Passive
Active
You make an “intervention” and gather evidence in tests designed to reveal an effect.
Example: A/B testing.
Passive
Data is just collected, captured “as it happens”: customer transactions, sales, web-browsing, tweets
REAKTOR / JOHAN HIMBERGFEBRUARY 2016
BASICS
16
Use cases
REAKTOR2016
Descriptive What has happened?
Diagnostic Why did it happen?
Passive Active
Customer profiles
Customer segmentation
Shopping cart analysis
Predictive What will happen?
Prescriptive What should I do?
Informative
Operative
Marketing impact analysis
Price elasticity analysis
Web design testing
Up-sell/cross-sell
New customer acquisition
Churn prediction
Life-time value prediction
Demography prediction
Marketing impact optimisation
Recommendation system
in a dynamic environment
Data Science & analyticsRISKS & PROBLEMS
RISKS / PROBLEMS
18
Issues by analytics use case
REAKTOR2016
Descriptive • isolated / ad hoc reports • isolated ad hoc decisions • feedback loop (report - decision
- effect) • ignoring statistics • analysts as sql-monkeys • UI / visualization
Diagnostic • statistical skills • testing and organisation • correlation vs. causality • requires lots of
communication
Passive Active
Predictive • what to predict: how to
quantify the target • access to historical data • quantifying and understanding
the risk(s) • prediction accuracy validation
for future
Prescriptive • what to optimize? • complex software system • technical feedback loop • co-op between “human” and
“artificial intelligence” • monitoring
Informative
Operative
•Focusing on wrong things•not recognising the analytics use cases•“data first”: long time from investment to benefits•not starting from the beef: actions and decisions• thinking only IT solutions and products•careful examination and validation of the algorithms, but not setting targets and risks according to the business target
•Organisation •silos: communication through hierarchy•no access to data, internal politics• technical details decided by business people•business criteria set by technical people
Examples…RISKS / PROBLEMS
19
•Underestimating complexity (time & scope)•both software and analytics to be build simultaneously• the time and effort needed with “data wrangling”• the time used for UIs and visualisations• the feedback loop
•Unrealistic expectations (quality) •on analytical systems in general (they are not that intelligent); rules needed•a product, a model, an algorithm, a data scientist solves all the problems•risks and targets cannot always be defined properly right away• there is no guarantee on accuracy on a particular case before trying
…more examplesRISKS / PROBLEMS
20
Culture that helps to handle riskWISE - DETERMINED - CURIOUS
Wise: Solve the right problems with analytics! Determined: aim at specific, concrete thingsCurious: be ready to divert, seek for evidenceBayesian: understand uncertainties and risksTruthful: don’t bend results upon wishes, it’s data scienceCourageous: act on evidenceActive and Agile: test, don’t just observe; inspect - adapt - learnTransparent and Helpful: co-operate from end-to-end, don’t silo
Culture that helps to handle riskVIRTUES
22
Culture that helps to handle riskWISE - DETERMINED - CURIOUS
Netflix prize competition (2006-2008)
Who gets the best RMSE (root mean squared error) on true user likings?
BUT
"The goal of our ranking system is to find the best possible ordering of a set of items for a member, within a specific context, in real-time. ... Our business objective is to maximize member satisfaction and month-to-month subscription retention, which correlates well with maximizing consumption of video content. We therefore optimize our algorithms to give the highest scores to titles that a member is most likely to play and enjoy.”---Netflix Prize objective... is just one of the many components of an effective recommendation system... We also need to take into account factors such as context, title popularity… Supporting all the different contexts in which we want to make recommendations requires a range of algorithms that are tuned to the needs of those contexts.”
- 2012 Xavier Amatriain and Justin Basilico, Personalization Science and Engineering
Aim at the right thingsVIRTUES
24
Always aim at something specific … but be open-minded and curious
Example: Röntgen and Fleming (Nobel laureates)
• their most famous findings were “accidental”, but
• they were skilled scientists doing disciplined research for some other aim
Explore occasionally “from data to insights”. But not aimlessly.
If you find something interesting, make a disciplined analysis, preferably a test.
CuriosityVIRTUES
25
Culture that helps to handle riskBAYESIAN - TRUTHFUL
The main ingredients of data science!
Making decisions based on data analysis requires the concepts of risk and probability.
Understanding probabilities VIRTUES
27
Culture that helps to handle riskCOURAGE
Courage
“Data driven means that progress in an activity is compelled by data rather than by intuition or personal experience. It is often labeled as the business jargon for what scientists call evidence based decision making
- Wikipedia 2016-02-24
“I take risks, sometimes patients die. But not taking risks causes more patients to die, so I guess my biggest problem is I've been cursed with the ability to do the math.
- Fictional character Dr. House in Fox television series “House”
Culture that helps to handle riskHELPFUL - TRANSPARENT - AGILE
Agile - Transparent Doing data-driven work and data science in any organisation model boils down to
“Involve everyone along the information path”
Agile development - Team decides details
Start from
•concrete actions that can be optimized
•decisions they require, and
•how to measure the effects properly
Remember the feedback loop!
Develop constantly
Lecture @AaltoBIZ, Johan Himberg, 2015
Action
optimize decide deploy
Data
big, small, open local, web, meta, …
Information
report visualize
model
Bus
ines
s dr
iver
s
aim 1
aim 2
aim 3
aim 4
aim 5For example
• Automatised decisions; recommendation, targeting
• Simulation
• prescriptive, predictive modelling
For example
• documentation on meaning of the data
• KPIs, profiles, segments, factors, DW dashboards
• descriptive, diagnostic, predictive modelling
For example
• source integrations
• Extract - Load - Transform
• Metadata
• modelling for cleansing & consistency
modellingwhat are the actions what are the insights
wranglingwhat data means
testingwhat is the impact
Think & plan from deployment to data
Pick an aim!
Lecture @AaltoBIZ, Johan Himberg, 2015
Action Data Information
Bus
ines
s dr
iver
s
aim 1
start from here!
aim 3
aim 4
aim 5
For example
• Business: need optimising for customer retention
• Marketing: we could start with special offer by SMS
• Data Scientist: we’ll set up test & control groups!
For example
• Solution expert: Field ZPOR means revenue per unit and it is calculated based on …
• Customer transactions are not in Data Warehouse, they’re aggregated on monthly level - Let’s get daily data from system Z
For example
• Now we have transactions for 1M users for 1 yr fields a,b,c,d,e …
• …
modellingwhat are the actions what are the insights
wranglingwhat data means
testingwhat is the impact
Data-Driven is inherently iterative and benefits from agility. Data and processes are often not like assumed.Be curious, keep backlog, inspect, adapt.
Lecture @AaltoBIZ, Johan Himberg, 2015
Action Data Information
Bus
ines
s dr
iver
s
aim 1
aim 2
aim 3
aim 4
aim 5For example
• deploy campaign, collect responses
For example
• calibrate & apply model
For example
• get data for modeling
• store results
modellingwhat are the actions what are the insights
wranglingwhat data means
testingwhat is the impact
Execute based on model, collect data
THE LOOP: results
Action Data Information
Bus
ines
s dr
iver
s
aim 1
aim 2
aim 3
aim 4
aim 5Backlog example
• test & control group handling in marketing automation
• Involve N.N. to the process
Backlog example
• define new information source
• Look for a new data source for determining income on zip code areas
• correct documentation
• automatization for the campaign modelling
Backlog example
• better system configuration & architecture
• automatization for the campaign process…
• new data: record information on all campaigns
modellingwhat are the actions what are the insights
wranglingwhat data means
testingwhat is the impact
Information path focused backlog
Lecture @AaltoBIZ, Johan Himberg, 2015
Don’t silo • A change of culture; information (not data) is everybody’s business as well as
money
• One data scientist can’t excel all of this:
• PO / Technical Account Manager
• Business specialist
• Solution owner / process owner
• Data Steward
• Developer
• Visualization / UX expert
Data Scientists’ special role • Data scientists main tasks are in methods, but also in
processes and machinery of
• making evidence based decisions (automated if possible)
• finding out confidence on the outcome (by active tests if possible)
• getting insights based on models and data
• Data scientist often act as a “glue”.
Lecture @AaltoBIZ, Johan Himberg, 2015
Culture that helps to handle riskTECHNOLOGY
Technology• Different analytical tasks need different tools. One has to integrate
different systems. Remember that you need a feedback loop!
• Prefer systems
• that give mass-access to historical, transactional data on individual level instead of just aggregates (avoid being “blinded by averages”)
• from which you’ll get the data, transformations, and results out to another system (avoid being “data hostage”)
• where you see what the analytics actually does at least on modular level (avoid being “method hostage”) Prefer being able to see the actual implementation (open source)
• Pick a product when you know the task, your needs, the product quality.
Lecture @AaltoBIZ, Johan Himberg, 2015
References• Brynjolfsson, Erik and Hitt, Lorin M. and Kim, Heekyung Hellen, Strength in Numbers: How Does Data-
Driven Decisionmaking Affect Firm Performance? (April 22, 2011). Available at SSRN:http://ssrn.com/abstract=1819486 or http://dx.doi.org/10.2139/ssrn.1819486
• Netflix case: http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
• Big Data landscape: http://mattturck.com/2016/02/01/big-data-landscape/#more-917
• Data science skills
• http://www.oralytics.com/2012/06/data-science-is-multidisciplinary.html
• http://www.oralytics.com/2013/03/type-i-and-type-ii-data-scientists.html
www.reaktor.com