citi bike - texas a&monline.stat.tamu.edu/dist/analytics/capstone/tl2.pdf · 2017-06-08 ·...
Post on 26-Apr-2020
4 Views
Preview:
TRANSCRIPT
1
Citi Bike Modeling the Relationship between
Earned Media Activity and Service EngagementAllyson Hugley
TAMU Analytics 2017
March 2017
Table of Contents
2
Executive Summary pp. 3 - 7
Data & Data Sources pp. 8 - 12
Model Development pp. 13 - 29
Conclusion & Impact pp. 30 - 33
3
Citi Bike Modeling the Relationship between
Earned Media Activity and Service EngagementAllyson Hugley
TAMU Analytics 2017
WIP – 2/20/2017Executive Summary
Executive Summary: PR Industry Challenge
INDUSTRY SITUATION
The PR industry is under increasing scrutiny to use
more sophisticated performance analytics
Use of modeling techniques is hindered by lack of
access to business outcome data (e.g., sales data).
BUSINESS QUESTION
With access to business outcome data, can models
be developed to quantify the contribution of PR
activities to business outcomes?
4
Executive Summary: Citi Bike Project
PROJECT OVERVIEW
Test the potential for developing models to evaluate
the impact of earned media
Citi Bike was identified as suitable for model
development activities
• Outcome data availability
• News coverage data availability
AGENCY BUSINESS VALUE
This project was designed to advance thinking around
media performance model development
5
Executive Summary: Project Focus
BRAND SITUATION
Citi Bike is a privately owned public bicycle sharing system that
serves parts of New York City.
It is the largest bike sharing program in the United States.
Sponsored by Citigroup and designed to carry the Citibank logo.
It is estimated that in the first year of operations the bank netted
$4.4 million worth of earned media.
However, no relationship between earned media (i.e., news
coverage) and use of Citi Bike services has been established
BUSINESS QUESTIONS
What substantive role, if any, does earned media play in driving
subscriptions to and use of Citi Bike services in New York?
Which modeling techniques are most appropriate for quantifying
and forecasting earned media impact?
6
Executive Summary: Key Findings
Oct 2014 – Sept 2016
22,440,823TRIPS
75,713ANNUAL SUBSCRIPTIONS
4,399Online News Articles
1,458,800,934IMPRESSIONS
7
Earned media output variables (impressions) were found to have a relationship to service
usage when both time series and regression modeling techniques were employed
8
Citi Bike Modeling the Relationship between
Earned Media Activity and Service EngagementAllyson Hugley
TAMU Analytics 2017
WIP – 2/20/2017Data & Data Sources
Citi Bike MECE Tree
9
VA
RIA
BLE
S
OUTCOME VARIABLES
PRIMARY
DAILY TRIPS (USE)
ANNUAL SUBSCRIPTIONS
PREDICTOR VARIABLES
ONLINE NEWS
TOTAL DAILY
ARTICLES (#)
TOTAL DAILY
IMPRESSIONS (#)
DAILY ARTICLES
BY SENTIMENT
POSITIVE
NEUTRAL
NEGATIVE
DAILY IMPRESSIONS
BY SENTIMENT
POSITIVE
NEUTRAL
NEGATIVE
WEATHER CONDITIONS
PRECIPITATION (IN)
SNOWFALL (IN)
SNOWDEPTH
MAX TEMPERATURE
MIN TEMPERATURE
AVG TEMPERATURE
Data Sources and Aggregation Process
10
DATA SOURCES
Citi Bike Transaction Data (business outcomes)https://www.citibikenyc.com/system-data
Sysomos Online News Data (earned media coverage)https://sysomos.com/
SimilarWeb News Source Site Traffic Data (impressions)https://www.similarweb.com/
National Oceanic and Atmospheric Administration - NOAA (weather)http://www.noaa.gov/
DATA AGGREGATION
Citi Bike daily ridership and membership data are released quarterly.
Files for Oct 2014 – Sept 2016 were downloaded and integrated into a
single data set.
Earned media articles for Oct 2014 – Sept 2016 (automatically scored
for sentiment) were obtained from Sysomos media monitoring service.
Each article was manually appended with impressions data from
SimilarWeb, aggregated by date and appended to the Citi Bike
ridership/membership file.
Weather data (e.g., precipitation, temperature) from NOAA was
integrated into the ridership/membership data file based on date fields.
Data Review and Cleaning
EXPLORATORY DATA ANALYSIS
SAS Enterprise Miner was used to perform exploratory data analysis to check for missing values and
data consistency issues.
• No missing values were identified for variables critical to the modeling work
• All values fell within acceptable ranges
• No unusual data points were identified
11
Data Collection Modifications
EXPAND DATA INPUTS
Variables that could be explored in future media impact research, could include:
• Earned media quality/engagement – inclusion of multi-media, news source tier, page views
• Paid media/advertising impressions, spend, format (e.g., video, display ad)
• Bike availability – number of bikes available for use
• Transportation option data – buses, taxis, subways in use/available daily
• Discounts and promotions
12
13
Citi Bike Modeling the Relationship between
Earned Media Activity and Service EngagementAllyson Hugley
TAMU Analytics 2017
WIP – 2/20/2017Model Development
Modeling Techniques Employed
MULTI-VARIATE ANALYSIS (JMP)
Principle Component Analysis Understand underlying structures in the data set
BASIC FIT ANALYSIS (JMP)
Bivariate Fit Model Understand relationship between earned media impressions and Citi Bike usage
TIMESERIES MODELING (JMP/SAS)
Seasonal ARIMA Understand factors influencing service use – including media and weather
REGRESSION MODEL WITH AUTOCORRELATED ERRORS (JMP/SAS)
Regression with ARMA errors (AR(1)) Understand strength of predictor variables including media outputs (impressions) on service engagement (use)
14
Multivariate Analysis - Principle Component Analysis
15
PCA ANALYSIS
This analytic technique was used to
identify initial structures in the data
Weather – temperature events
(Prin1 and Prin2)
Negative media outcomes
contributed to the structure
(Prin3)
Precipitation events – rain and
snow (Prin4)
Data Transformation and Basic Fit Model
16
FIT LOG USAGE BY LOG
IMPRESSIONS
Data for daily trip volume and daily
impressions were log transformed and a
simple bi-variate fit analysis was
executed to determine the potential
relationship between these variables.
Outcomes suggest that every 10%
increase in impressions is associated
with a 1.3% increase in Citi Bike service
usage.
Fit statistics also suggested that earned
media impressions alone accounted for a
small amount of change in service use.
Time Series Modeling
17
OUTCOME SELECTION - DAILY USAGE
The time series modeling was limited to analysis and forecast modeling for daily service use (trips in
past 24 hours). A valid time series model based on subscription data was not achieved; the data
were not stationary.
The daily trip data required differencing to account for trends and seasonality as a first step.
TIME SERIES MODELING
18
MODEL SELECTION • Three valid models (stationary, invertible, parsimonious) were analyzed further accounting for
outliers and level shifts using SAS.
• Seasonal ARIMA (1,1,1)(0,1,1)7
• Seasonal ARIMA (1,1,2)(0,1,1)7
• Seasonal ARIMA (1,1,2)(0,1,2)7 = (Best Model)
19
MODEL VALIDATION -Seasonal ARIMA (1,1,2)(0,1,2)7 in JMPDaily use: model 3: Seasonal ARIMA (1,1,2)(0,1,2)7
Valid –stable/invertible, parsimonious
TIME SERIES MODELING
20
SEASONAL ARIMA (1,1,2)(0,1,2)7 - MODEL VALIDATION IN SASJMP Initial SBC: 14740.817
SAS 59 outliers/level shifts identified (8% of observations)
SBC: 14364.05 (improved, vs. SAS Model 1 (14374.67) and SAS Model 2 (14372.31) accounting for outliers and level shifts )
TIME SERIES MODELING
21
SEASONAL ARIMA (1,1,2)(0,1,2)7 -FURTHER MODEL DEVELOPMENT AND VALIDATION IN SASJMP Initial SBC: 14740.817
SAS 74 outliers/level shifts identified (10.2% of observations, with seven rows withheld for forecasting)
SBC: 13920.45 (improved over initial model excluding weather data – SBC:14364.05); white noise also
significantly lowered)**Media variables were tested, but ultimately excluded form the model due to insignificance**
TIME SERIES MODELING
Weather Variables Added (3): PRCP, SNWD, TEMP_MAX
22
TIME SERIES MODELING
SEASONAL ARIMA (1,1,2)(0,1,2)7 -- ANALYSIS OF OUTLIERS AND LEVEL SHIFTS (74)
Outliers and level shifts tended to be associated with weather events:
Heavy rain, >1” per day
Snow and snow events (e.g., 2016 blizzard)
Cascading weather events, declining temperatures, rising temperatures, rain events that span several days
Holiday events were also consistently associated with outliers and level shifts
• Holidays and holiday periods were associated with low level outliers and level-shifts – Christmas, New Year’s,
Thanksgiving, Good Friday
23
TIME SERIES MODELING
SEASONAL ARIMA (1,1,2)(0,1,2)7 -- ANALYSIS OF ESTIMATES
For each unit increase in precipitation (inches), usage of Citi Bike fell by approximately 5,169 users
For each unit increase in snow depth (inches), usage of Citi Bike fell by about 468 users
Each unit increase in daily max temperature resulted in 218 additional users
24
TIME SERIES MODELING
SEASONAL ARIMA (1,1,2)(0,1,2)7 -- FORECAST ANALYSIS
The forecast and actual service use estimates produced by the model were relatively close with the
forecast usage range being just under 15,000 (14,690)
Forecast estimates generally fell within range- with forecast usage for the seven hold cases being on
average about 6,469 above actual levels
25
REGRESSION MODEL WITH AUTOCORRELATION ERRORS
AGGREGATE AND LOG TRANSFORM VARIABLES TO ACHIEVE STABILITY
Aggregated daily media and usage data into weekly intervals
Log transformed the summed data to achieve stability
Ran time series for each to confirm stability (no consistent increases in values over time)
USE CROSS CORRELATION FUNCTION TO DETERMINE SIGNIFICANT LAGS
Lag 6 was identified as the most significant lag for use in the model to represent earned media
outputs (impressions)
26
REGRESSION MODEL WITH AUTOCORRELATION ERRORS
DEVELOP A REGRESSION MODEL AND IDENTIFY TIME SERIES MODEL FOR ERRORS
Valid model was developed in JMP using variables for both weather, earned media (including lags) and
time of the year (e.g., First Week of the Year)
Executed time series model on the regression model residuals to determine time series model for the
errors (AR(1)).
27
REGRESSION MODEL WITH AUTOCORRELATION ERRORS
REFIT THE REGRESSION MODEL IN SAS, WITH ARMA ERRORS AND ACCOUNTING
FOR OUTLIERS AND LEVEL SHIFTS
AR(1), IDENTIFY VAR = Residual_Log_Sum_Trips_Past_24_
CROSSCORR= (MEAN_SNWD_ MEAN_TEMP_MAX_ XLAG6 YEAR AO14 AO13 AO68 AO15
AO72 LS53 AO67 AO66)
28
REGRESSION MODEL WITH AUTOCORRELATION ERRORS
MODEL ESTIMATES
AR(1), IDENTIFY VAR = Residual_Log_Sum_Trips_Past_24_
CROSSCORR= (MEAN_SNWD_ MEAN_TEMP_MAX_ XLAG6 YEAR AO14 AO13 AO68 AO15
AO72 LS53 AO67 AO66)
29
REGRESSION MODEL WITH AUTOCORRELATION ERRORS
ANALYSIS OF OUTLIERS, LEVEL SHIFTS, PREDICTORS
Holiday periods (Christmas, New Year’s) and weather events (blizzard of 2016) were associated with
significant declines in service use.
Expansion to the outer boroughs (Bedford-Stuyvesant, Brooklyn) and Jersey City was associated
with a significant level shift (LS53).
For every 1% increase in snow (inch) or temperature (degrees Fahrenheit) the volume of service is
predicted to decreases by 5.22% and .82%, respectively.
Every 1% increase in earned media impressions is predicted to increase service use by 2.4%.
Variable Wk/Yr Time Period/Event Type %Change
AO13 51/2014 Holiday Season Additive Outlier -55.34%
AO14 52/2014 Holiday Season Additive Outlier -73.07%
AO15 1/2015 Post New Year Additive Outlier -131.69%
LS53 39/2015 Bedford-Stuyvestant Expansion Level Shift 35.31%
AO66 52/2015 Holiday Season Additive Outlier -46.38%
AO67 53/2015 Holiday Season Additive Outlier -57.58%
AO68 1/2016 Holiday Season Additive Outlier -163.27%
AO72 5/2016 Jan 2016 Blizzard Additive Outlier -45.63%
Variable %Change
MEAN_SNWD -5.22%
MEAN_TEMP_MAX -0.82%
xLAG6 (IMPRESSIONS) 2.40%
YEAR 0.02%Calendar Year (2014, 2015, 2016)
OUTLIERS AND LEVEL SHIFTS
PREDICTOR VARIABLES
Description
Average Snow Depth
Average Max Temperature
Impressions Exposure; 6 week lag
30
Citi Bike Modeling the Relationship between
Earned Media Activity and Service EngagementAllyson Hugley
TAMU Analytics 2017
WIP – 2/20/2017Conclusions & Impact
CONCLUSIONS
PROPOSED BUSINESS SOLUTIONS
More sophisticated modeling techniques can be
applied to media relations and client outcome data.
However, earned media activities (news coverage
impressions) are apt to have potentially weaker
associations with business outcomes than
environmental factors (e.g., weather, economic
conditions).
NEXT ACTIONS
Leverage this analysis to champion for education and
acquisition of additional data sets within Weber
Shandwick to better account for environmental factors
when developing models.
Identify opportunities for application of modeling
techniques to advance client work.
31
PROJECT IMPACT
MAJOR CHALLENGES
Time required to manually code and aggregate the
media data at daily and weekly levels.
KEY INSIGHTS/LEARNING
Implementation of modeling techniques at scale
would require significant resources to support media
data coding and aggregation, or some process
automation would need to be developed.
The impact of earned media is likely to be small
relative to other factors, so we need to be prepared to
message that effectively to clients.
IMPACT ON WORK/ORGANIZATION
This work establishes a foundation for furthering
discussions around the types of data and skill sets
required to develop valid models to evaluate the
impact of earned media on business outcomes.
32
MS PROGRAM IMPACT
IMPACT FROM MS ANALYTICS PROGRAM
Experience with a range of modeling techniques and tools has
broadened my perspective on approaches to evaluating
communications performance.
PROFESSIONAL DEVELOPMENT GAINED
Exposure to the practical application of a range of tools,
techniques and coding languages to solve business problems.
Foundation in modeling methods to inform and advance
discussions with data vendors and platform partners.
Insight into the tools and skill sets specific to data modeling that
should be incorporated into the agency’s recruiting and
professional development plans.
Improved understanding of quality controls and validation
processes that should be incorporated into the agency’s
burgeoning modeling capabilities.
33
34
Citi Bike Modeling the Relationship between
Earned Media Activity and Service EngagementAllyson Hugley
TAMU Analytics 2017
March 2017
top related