early lessons learned in applying big data to tv advertising presentation presented by dave morgan
DESCRIPTION
TRANSCRIPT
Early Lessons Learned in Applying Big Data To TV Advertising
IAB ITV for Agencies DayDave Morgan, CEO, Simulmedia
2
About Us
We are a New York based start-up. We are venture backed by Avalon Ventures, Union Square Ventures and Time-Warner.
Our 35 person team has veterans of:
Television is still the most powerful advertising medium in the world. While addressability will come, we’re not waiting for it. We’ve taken a few strategies we learned from the Internet and are applying it to linear TV advertising, today.
Through partnerships with major data providers, we have assembled the world’s largest set of actionable television data.
We sell television advertising. With inventory in over 106 million US households, we can cost-effectively extend reach into high-value target audiences across virtually any advertiser category. We use big data and science to do this.
Who We Are
Where We Have Been
What We Believe
How We Do It
How We Make Money
3
Why Did We Leave The Web?
Television remains the dominant consumer medium
(a) Nielsen US TV Viewing Audicence Traditional Live-Only TV based on average monthly viewing during 1Q2011. Internet and Online Video based on average monthly consumption during July 2011. Video on Demand based on consumption during May 2011.
4
TV Spend Is Increasing
Source: MAGNAGLOBAL
5
Audience Is Fragmenting
Source: Nielsen via TVbythenumbers.com
6
Campaign Reach Is Declining
Source: Simulmedia analysis of data from SQAD, Nielsen and TVB
Impossible for measurement and planning tools to keep pace
Highly Confidential
Big Data
8
Big Data Is Driving Growth
“We are on the cusp of a tremendous wave of innovation, productivity and growth, as well as new modes of competition and value-capture –
all driven by Big Data.”- McKinsey Global Institute, May 2011
“For CMOs, Big Data is a very big deal.”- Alfredo Gangotena, CMO, Mastercard, July 2011
9
Size Is Relative
1 byte x 1000 = 1 kilobyte…x 1000 = 1 megabyte…x 1000 = 1 gigabyte…x 1000 = 1 terabyte…x 1000 = 1 petabyte…x 1000 = 1 exabyte
10
Size Is Relative
Telegram = 100 bytes
Data © 1997-2011, James S. Huggins http://www.jamesshuggins.com/h/tek1/how_big.htm
11
Size Is Relative
Page of an Encyclopedia = 100 kilobytes
Data © 1997-2011, James S. Huggins http://www.jamesshuggins.com/h/tek1/how_big.htm
12
Size Is Relative
Pickup truck bed full of paper = 1 gigabyte
Data © 1997-2011, James S. Huggins http://www.jamesshuggins.com/h/tek1/how_big.htm
13
Size Is Relative
Entire print collection of the Library of Congress = 10 terabytes
Data © 1997-2011, James S. Huggins http://www.jamesshuggins.com/h/tek1/how_big.htm
14
Size Is Relative
All hard drives produced in 1995 = 20 petabytes
Data © 1997-2011, James S. Huggins http://www.jamesshuggins.com/h/tek1/how_big.htm
15
Size Is Relative
All printed material = 200 petabytes
Data © 1997-2011, James S. Huggins http://www.jamesshuggins.com/h/tek1/how_big.htm
16
But Big Data Is More Than Size
Time:
Focus:
Supports:
What happened?
Why did it happen?
BIG DATA
What’s going to happen next?
Past Future
Reporting Prediction
Human decisions
Machine decisions
StructuredAggregated
UnstructuredUnaggregated
Data:
DashboardsExcel
DiscoveryVisualization
Statistics & Physics
Human Skills:
17
Accelerating The Push To Big Data
Hadoop, cloud computing, Facebook, Yahoo, quants, Bittorrent, machine learning, Stanford,
large hadron collider, Wal-Mart, text processing, Amazon S3 & EC2, open source intelligence, NoSQL, social media, Google,
commodity hardware, Hive, fraud detection, trading desks, MapReduce, natural language
processing
18
What Can It Mean For TV Advertising?
Big data drove the rise of web & search advertising
• Accumulation of high volume of direct measurement of media consumption
• Better predictions about consumer interests• Real time return path• Automation• Interim step for addressability• More diligence around consumer privacy• Media buyers and sellers rethinking their approach to
audience packaging, campaign planning, technology, data assembly and people
19
Post Modern Architecture
Have we reached the limits of classic data storage architecture?
Data Warehouses• Yahoo!: 700 tb1 • Australian Bureau of Statistics: 250 tb1
• AT&T: 250 tb1
• Nielsen: 45 tb1
• Adidas: 13 tb1
• Wal-Mart: 1 pb2
1 Oracle F1Q10 Earnings Call September 16, 2009 Transcript2 Stair, Principles of Information Systems, 2009, p 1813 Dhruba Borthakur, Facebook, December 2010, http://www.facebook.com/note.php?note_id=4682111939194 Simulmedia estimate
Data Lakes• Facebook: 30 pb3 (7x
compression)• Yahoo: 22 pb4
• Google: ???
20
Our Idea of Big Data
Set Top Boxes
• 17+ million boxes
• Completely anonymous viewing• Live• DVR• VOD• Pay channels
Program
• 3 different sets of schedule data
• Proprietary metadata
Public
• US census• Military• Business
Ad Occurrence
• What ads ran?
• Where did they run?
Client Proprietary
• Business Development Indices (BDI)
• Commercial Development Indices (CDI)
• Regional sales data
Nielsen Ratings
• All Minute Respondent Level Data (AMRLD)
Bringing the data set together in a single platform
Our (comparatively modest) data set:• 200 tb (approx. 7x compression)• 113,858,592 daily events• Approximately 402,301 weekly ads• Double capacity every 6 months…And we don’t load every data point across all data sets, yet
21
Rethinking Media Data Architecture
• No clouds allowed (ISO compliance)• Expect hardware failure
• Learn from those who have done it• Participate in the Open Source community
• ELT (Extract, Load, Transform)• Meddle• Machine learning
Commodity Hardware
Open Source Software
Write Your Own Software
Applying big data to television required us to rethink what our technical architecture should be
• Advanced statistical techniques• Experimentation
Science
22
Some Wrinkles In The Matrix
No standards for set top boxesChannel mapping
Time synchronizationOn/off rules
….
Consult the sagesBuild the team
23
The People We Needed
• New core skills for everyone in the company• Pattern recognition• Visualization• Technology• Experimentation
• Where do you find hard to find tech skills?• You don’t find them. You make them.
• A dedicated Science team• Non traditional researchers (Brain imaging, bioinformatics,
economic modeling, genetics) • People who watch a lot of television
A different approach required different skill sets
Highly Confidential
10 Lessons We’ve Learned
25
Some Things To Know, First
• Live viewing unless otherwise noted• Time shifting lessons is a whole other presentation• Time shifting + live viewing lessons is a whole other other presentation• Video on demand is a whole other other other presentation
• We name names and provide numbers where clients and data partners permit• Client confidentiality is important to us
• None of this work would’ve been possible without the help of our clients and partners
Read me…This box will contain important information about the graphs on
each page.
Highly Confidential
60% of TV Viewers Watch 90% of TV
27
Networks with relatively fewer lighter viewer impressions
Networks with relatively more lighter viewer impressions
OXYGEN 7.4
WE 7.6
PLANET GREEN
7.7
OVATION 7.8
STYLE 7.8
MTV2 7.8
SUNDANCE 7.9
IFC 7.9
TCM 13.6
HALLMARK 13.7
ADSWIM 14.0
NICKNITE 14.3
CNBC 15.7
FOX NEWS 18.0
Higher rated networks
Lowerrated
networks
Where The Other 40% Are
Vertical: Ratio of Heavy Viewers to light viewer impressions. Horizontal: Low rated to Highly rated networks Call outs: Ratio is the number of Heavier Viewer impressions you would deliver to reach a Lighter Viewer on a given network Sources: Nielsen & Simulmedia’s a7
28
Where The Other 40% Are
To capture light viewers, media planning and measurement tools must quickly apply new methods to emerging data sets
Highly Confidential
Quality Control Is A Full Time Job
30
When Data Goes Missing
Automation of error checking/quality control is essential
Reuse the data to solve other problems
Occasionally observe missing data
Three choices:• Pick up the phone• Estimate missing fields • Work around the missing
data
Source: Simulmedia’s a7
Time series of SYFY network. 10645 observations from 2010.02.28 at 7:00pm Eastern to 2010.10.14 at 12:30pm Eastern
Highly Confidential
More Data Really Is Better
32
Disambiguation: The Madonna Problem
OR
Pop Icon? Religious icon?
33
The Revolution of Simple Methods
More data beats better algorithms.
The best performing algorithm underperforms the worst algorithm when given an order of magnitude more data.
Simple algorithms at very large scale can help better predict audience movement.
Peter Norvig | Internet Scale Data Analysis | June 21, 2010
Original graph sourced from: Banko & Brill, 2001. Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing
34
Packaging Reach
Peter Norvig | Internet Scale Data Analysis | June 21, 2010
Very large data sets better predict TV audience movements
35
The Cost Of More Data
• All data online. All the time.
• Less expensive hardware• Extremely flexible
• All data online. All the time.
• More expensive talent• Physicists & statisticians ain’t
cheap• Hard to find programmers
• Not everything meets your needs
• Evolving technologies in mission critical functions
More data drives better results but there are costs
Highly Confidential
The Data Isn’t Biased Just Because It Comes From A
Set Top Box
37
Applying Simple Methods At Scale
Sources: Nielsen & Simulmedia’s a7
Regression analysis of Nielsen Household Cume Rating against Simulmedia’s a7 cume rating. 20 Primetime Network shows with HAWAII FIVE-0. Fall 2010.
High correlation of a7
measures and Nielsen estimates.
Either bias is insignificant or Nielsen data and our data share the same bias.
Multiple methods yield similar results
38
And Then We Kept Going
Two samples1. Sample 1: Fall 2010: 20 Primetime
broadcast series launches + promos
2. Sample 2: Jan 2011: 15 Primetime cable series premieres + promos (Plus one multi-season/year primetime broadcast premiere + promos)
• Hand selected programs • Mix of genres • Mix of new vs. returning shows
How we sliced it• Entire a7 data set • Cross correlated individual data
sets contained in a7 aggregate data set
• Aggregate cross geographies (DMA to DMA)
Observations• Sample 1 average r2>0.85• Sample 2 average r2>0.93
We measured program Tune-In, Spot Tune-In, Campaign Reach, Campaign Rating using multiple slices of our data set using two
different sample sets and time frames
Highly Confidential
Addressability Is Here
40
Closing The Loop On Program Promotion
Sources: Simulmedia’s a7
Spring 2010 broadcast premiere promotion. Horizontal: Left to right moves back in time. 0 is the premiere time. Vertical: Conversion rate is measured in percent. Size of the bubble represents total conversions for a given spot.
41
Closing The Loop On Program Promotion
Sources: Simulmedia’s a7
Spring 2010 broadcast premiere promotion. Horizontal: Left to right moves back in time. 0 is the premiere time. Vertical: Conversion rate is measured in percent. Size of the bubble represents total conversions for a given spot.
42
Long held beliefs and rules of thumb in planning may or may not be supported by data
TV marketers now have more options for show promotion
Closing The Loop
Highly Confidential
Nielsen’s Ratings Are Good (Surprisingly Good)
44
Time Series: Broadcast: CBS
Sources: Nielsen & Simulmedia’s a7
Hour by hour time series Mar 20 to April 8, 2011. Z score plots with Nielsen estimates in red. Simulmedia measurements in blue. Where Nielsen provided no estimate, estimates were imputed using Multiple Imputation (Rubin (1987))
60 networks. High correlation between Nielsen large sample measurement and a7 measures
45
Time Series: Broadcast: Fox
Sources: Nielsen & Simulmedia’s a7
Hour by hour time series Mar 20 to April 8, 2011. Z score plots with Nielsen estimates in red. Simulmedia measurements in blue. Where Nielsen provided no estimate, estimates were imputed using Multiple Imputation (Rubin (1987))
46
Time Series: Broadcast: ABC
Sources: Nielsen & Simulmedia’s a7
Hour by hour time series Mar 20 to April 8, 2011. Z score plots with Nielsen estimates in red. Simulmedia measurements in blue. Where Nielsen provided no estimate, estimates were imputed using Multiple Imputation (Rubin (1987))
47
Time Series: Cable: Investigation Discovery
Sources: Nielsen & Simulmedia’s a7
Hour by hour time series Mar 20 to April 8, 2011. Z score plots with Nielsen estimates in red. Simulmedia measurements in blue. Where Nielsen provided no estimate, estimates were imputed using Multiple Imputation (Rubin (1987))
48
Time Series: Cable: Golf
Sources: Nielsen & Simulmedia’s a7
Hour by hour time series Mar 20 to April 8, 2011. Z score plots with Nielsen estimates in red. Simulmedia measurements in blue. Where Nielsen provided no estimate, estimates were imputed using Multiple Imputation (Rubin (1987))
49
Time Series: Cable: Bravo
Sources: Nielsen & Simulmedia’s a7
Hour by hour time series Mar 20 to April 8, 2011. Z score plots with Nielsen estimates in red. Simulmedia measurements in blue. Where Nielsen provided no estimate, estimates were imputed using Multiple Imputation (Rubin (1987))
50
Time Series: Cable: ESPN2
Sources: Nielsen & Simulmedia’s a7
Hour by hour time series Mar 20 to April 8, 2011. Z score plots with Nielsen estimates in red. Simulmedia measurements in blue. Where Nielsen provided no estimate, estimates were imputed using Multiple Imputation (Rubin (1987))
51
Time Series: Cable: Speed
Sources: Nielsen & Simulmedia’s a7
Hour by hour time series Mar 20 to April 8, 2011. Z score plots with Nielsen estimates in red. Simulmedia measurements in blue. Where Nielsen provided no estimate, estimates were imputed using Multiple Imputation (Rubin (1987))
Highly Confidential
…but…
53
When You Look Closer
Sources: Nielsen & Simulmedia’s a7
Hour by hour time series Mar 20 to April 8, 2011. Z score plots with Nielsen estimates in red. Simulmedia measurements in blue. Where Nielsen provided no estimate, estimates were imputed using Multiple Imputation (Rubin (1987))
54
High Frequency Time Series: ABC Family
Sources: Nielsen & Simulmedia’s a7
Nielsen
Sample graph from High Frequency (Second and Minute level) Time Series Analysis of 45 networks on January 19th 2011. Simulmedia a7 Sample (Second by Second to Minute) Nielsen Sample (Minute by Minute)
a7
Volatility in dayparts, low rated networks, demographics…. Unrated networks “don’t exist.” Did NOT look at local.
Highly Confidential
Women Are More Different Than Men
56
Gender Driven Geographic Variation
Viewing by zip code among women across markets is more varied than men in the same zip codes
Women 18-54 Men 18-54
Fraction of view time for ages 18-54 as fraction of view time for all TV viewers. Week 2 vs. the same fraction for week 1 (last two weeks in January). Three markets: Philadelphia (blue) Atlanta (red) and Chicago (green) Each point represents a zip code in one of these markets. Source: Simulmedia’s a7
57
Gender Driven Geographic Variation
Planning tactics for female targeted campaigns should be different than male target campaigns
PS…Also a good case for geo based creative versioning
Highly Confidential
Privacy Matters
59
Privacy By Design
• All marketing data companies need to care
• Make consumer privacy protection part of the business from the beginning • Anonymous, aggregated data only• No personal data or data that can
be related to particular individuals or devices
• Broad marketing segmentations, not profiling
• No sensitive dataDon’t be creepy
Highly Confidential
Mass Reach Is Indiscriminant
61
Fragmentation Effects On Frequency
Source: Nielsen & Simulmedia’s a7
Each segment was above 70% reach but the frequency distribution was nearly identical
Percent of audience reached for major animated motion picture campaign 2011. Two weeks prior to release. Each stacked bar is a different audience segment. Each color with the stacked bar represents the frequency of ad view for each segment.
62
Fragmentation Effects On Frequency
Source: Nielsen & Simulmedia’s a7
Fragmentation is affecting all high reach campaigns.
Percent of audience reached for insurance advertisers September to October 2010. Approximately 8000 ads. Each stacked bar is a different audience segment. Each color with the stacked bar represents the frequency of ad view for each segment.
63
The TV advertising market can’t continue to support this
Fragmentation Effects On Frequency
Highly Confidential
40% Of The Audience Is Getting 85% Of The
Impressions
65
Fragmentation Rears It’s Head Again
Source: Nielsen & Simulmedia’s a7
0.0
1.4
4.3
9.1
24.8
0.0%
3.6%
10.8%
23.0%
62.6%
Average Frequency Per Quintile
% of Total Impressions Per Quintile
Campaign impressions increasingly concentrated against
heavy viewers.
Percent of audience reached for a different major animated motion picture campaign 2011. Two weeks prior to release. The stacked bar represents quintiles. Blue labels are average frequency per respective quintile. Red labels are % of total campaign impressions by respective quintile.
Total US Television
Audience
66
Fragmentation Effects on Frequency
Advertisers won’t continue to support this
Highly Confidential
What Happens Next?
68
Choices
• If fragmentation is causing declining campaign reach and frequency imbalances, marketers must make choices.• Reduce reach
• Do nothing• Use other channels
• Stabilize or improve reach• Re-aggregate audiences using big data
What do you think?
70
About Our Science Team
• Krishna Balasubramanian, Chief Scientist• Previously: Chief Scientist, Tacoda. Chief Scientist, Real Media.• Doctoral Candidate, Physics. (Condensed Matter Physics) The Ohio State University• MS, Computer & Information Systems. The Ohio State University• MSc, Physics. Indian Institute of Technology, Kanpur
• Yuliya Torosjan, Scientist• Previously: Clinical Research (Brain Imaging), Mount Sinai College of Medicine• MA, Statistics. Columbia University• BSE, Computer Science & Engineering. University of Pennsylvania• BA, Psychology. University of Pennsylvania
• Mario Morales, Scientist• Previously: Lecturer, Bioinformatics, New York University. Senior Consultant, Weiser LLP.• MS, Statistics. Hunter College• MS, Bioinformatics. New York University
• Dr. Sidd Mukherjee, Scientist• Previously, Visiting Scholar (Atomic Scattering experiments), The Ohio State University• Post doctoral research, Heat capacity of Helium-4. Pennsylvania State University• PhD, Physics. (Thesis: Measurements of Diffuse and Specular Scattering of 4He Atoms from
4He Films), Ohio State University• MS, Computer &Information Systems. The Ohio State University• BSc, Physics & Mathematics. University of Bombay