approaching big data: lesson plan
TRANSCRIPT
Agenda What is Big Data? • Some Definitions • Mixed Methods Approach Champion’s League & World Cup Case Study • Process • Results and Usage • Pitfalls and Learnings Moving Forward • Data Approach Flow • Caveats • Organization and Communication
What is Big Data? So many different definitions… nobody quite
agrees…. … except that it’s definitely a buzzword
What is Big Data? It is just generally agreed upon that it’s messy and complex. This
is an opportunity and challenge for us to innovate.
“an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or
traditional data processing applications.”
“Big data is a buzzword, or catch-phrase, used to describe a massive volume of both structured and unstructured data that is
so large that it's difficult to process using traditional database and software techniques. In most enterprise scenarios the data is too big or it moves
too fast or it exceeds current processing capacity. Big data has the potential to help companies improve operations and
make faster, more intelligent decisions.”
“Volume, Variety, Velocity, Variability, Complexity”
Quotes from: h-p://www.forbes.com/sites/gilpress/2014/09/03/12-‐big-‐data-‐definiBons-‐whats-‐yours/2/ h-p://www.webopedia.com/TERM/B/big_data.html h-p://en.wikipedia.org/wiki/Big_data
… for leveraging engagement at least.
Determine Right QuesBons and Goals for Data
Interdisciplinary Approach
IteraBve Refinement
“Combining the what (quantitative) with the why (qualitative) can be exponentially powerful. It is also critical to our ability to take all our
clickstream data and truly analyze it, to find insights that drive meaningful website changes that will improve our customers’
experiences.” – Avinash Kaushik
Answer: Mixed Methods and Innovation
Quote from: Web AnalyBcs in One Hour a Day by Avinash Kaukshik
Sports Fan and Engagement Study Overall Goals for HAVAS
• to identify and define communities of sports fans based around passion points (A)
• to analyze fan interactions with those passions (B) • position HAVAS Sports & Entertainment to more
effectively advise brands on how to meaningfully engage with sports fans by leveraging passion-based communities. (C)
Big Data Research Objectives • Discover a mixed
methodology framework for sports and entertainment fan engagement
External for
Havas
• Justify our fan logic topology in relation to Twitter conversations through natural language processing
Internal for Lab
Initial Data Collection Steps 1) Modify data collection process to fit live
soccer events using Champion’s league as a test run
2) Establish methodology in seeding initial pool of users, keywords, and hashtags
3) Analyze tweets and how they fit into logics of engagement
4) Establish methodology in how to gain insight from twitter conversations
“Analyzing Big Data is a BIG JOB with Many People” – Jake
Inputs & Equipment
Keywords, hashtags, user clusters file on txt document
Dedicated server system
colllecting information
Engineering
Run and modify Python script
Register Public Screening API
Parse for results
Live Viewing Team
Team to watch game and look for patterns
Data Collection Process Engineering &
Team: Tech and Data Set-Up
Engineer: Run Script with Seed
File
Team: Watch Event for Patterns and Additional Seeds
Team: Decide Data to Analyze
Engineer: Parse Data into User-Friendly Format
Team: Look at Data and prepare for
next event
Headliners
Official Organization Handles Official Team Handles
Official Hashtags Sponsors Team
Names Key Terms Key Players
Sponsors
Sponsors will often have official hashtags promoted during sporting events to cross-promote their brand and the sporting
event.
Official Hashtags Sponsors Team
Names Key Terms
Key Players
Supporting Characters
Superfans -Fans with unusual
followings on Twitter
Sports Commentators
-ESPN commentators
and the like
Prominent Bloggers -Blogs or
bloggers with large following
on certain teams
Initial Data Seed Scoping Caveats • Twitter caps at couple of thousand tweets per second on Public API • Public API received tweets do not appear to be affected by location based factors the way individual user feeds are • Twitter chunks these tweets in mysterious algorithm it deems important • Number of Tweets scrapped render these factors nominal in terms of large-scale user behavior
What kind of Tweets or tone in tweets fit into logics of
engagement? *Informed by survey and ethnography
Entertainment Immersion Social Connection Identification
Mastery Pride Play Advocacy
Operational Process
Plan for World Cup & Modeling with Beacon Capabilities
See how conservations analyzed from a big data perspective fit and build on the logics of engagement model
Determine what data frameworks worked in capturing useful information
Initial qualitative look at data
Big Data Basic Methods of Analysis
• Text processing of tweets and plotting using algorithms into agglomerative clusters (aka cool visuals)
• Frequency of terms, associations, and word clouds fall under here
• Goal: Find texts of what spurred the most conversation
Textual
• A way to visually see social connection data • Understand forms of bonds and the connections between
individual data points worth exploring • Goal: Detecting communities (our clusters, brands)
Networks
• Toolkits (such as Hootsuite) that measure “sentiment” using positive and negative language
• Can be used to see if an initiative performed well • Goal: Measure success of a campaign at different times
Sentiment
Big Data Low-Hanging Fruit - Topline
Rt Author Screenname FIFAWorldCup 76172 9GAG 37459 DFB_Team_EN 21247 BBCSport 19564 FCBayern 14782 FTBpro 13409 _Snape_ 11371 benparr 10616 TheTweetOfGod 9435 espn 7465 Queen_UK 7174 thereaIbanksy 7113 sulsultm3 6646 damnitstrue 6603 asshaaban 6513 SportsCenter 6470 fifaworldcup_es 6365 LicDice_ 6361 FIFAworldcup_e 6241 DFB_Team 6114 Argentina 5964
Fan Handles 1 Game
Data 2 Brand Data 3 Integrate insights
with Ethnographic and Survey Data for
final deliverables
Initial Idealized Approach
• Survey Twitter Handles – See if their online behavior matches survey logics – What does the content they’re sharing look like – Trends by cluster, gender, other data points
• Match Data – Look for clusters of behavior to events in games – See popularity of brand campaigns and behavioral response to brand stories – Gain insight from bursts of activity and real-time marketing – See what are characteristics of influencers
• Brand Data – Identify how these strategies were executed in online conversations and responses – Identify types of interactions/content/other markers around brands on Twitter – Do influential brands mean consistent users interacting across brands? Why are people
interacting in this way? How can we categorize these interactions according to our logic clusters?
– Was the content agile? – See how users responded by the logics to different types of content – Look for differences in fan response and fan-initiated behavior to the brands
Questions and Hypothesis
What We Planned To Do • Steps
• Define interesting WC fan moments and brand moments • Examine moments in time and certain brand campaigns • Investigate possible Natural Language Processing tools • Formulated Questions
• Timeline • Created a timeline assigning roles to each person
• Deliverables • TBD, likely looking at clusters of behavior around brand campaigns. • Sentiment analysis may tie in here
Ethnographic Report
-What did people say about the brand or the
logics they used?
Survey Data -Under this brand
logic utilized, what is the
intensity and who are the clusters?
Big Data -How did audiences
respond online to actions by the
brand?
Approaching with Mixed Methods
Exercise: Group Datasets
Figure out what insight you might be able to get from each piece of data and how
would you apply mixed methods.
The Future of Social Media Analytics
“We will be moving beyond key-word based queries into machine-learning algorithms. Influencers whom I have with with echo
similar ideas about the increasing use and refine of latent semantic indexing (or some
variant of it) and other machine-learning algorithms in order to improve social listening, automatic categorization of
content, and the ability to take action on data” - Marshall Sponder
The Dashboard Build Process
Pulled 250 Retweeted Tweets with Verification
from BigSheets
Coded Tweets
According to Logic for
Testing Data
Built Dictionary
According to Sample Tweets,
Ethnography, Survey
Created Natural
Language Processing
and Machine Learning
Algorithms
Fan Engagement Dashboard Prototype
Model
Technology
Collaboration
Innovation Fan Engagement Dashboard Prototype
jStart Beacon Custom-Built Twitter Collection Web App jStart BigSheets
Leveraging Engagement Framework
Annenberg Innovation Lab Fan Engagement Dashboard built through
collaboration and mixed methods learning.
67% Accuracy in classifying tweets by Logic of Engagement leading to
actionable insight and business intelligence for Leveraging Fan Engagement.
The Process End-to-End Collecting and Managing Data Data Back Up Data Clean Up Run Models
Gain Insights Refine Models Learn Actionable Insights
Communicate Insights (Reports,
Infographic Blueprints)
Create Initial Dictionary for
Natural Language Processing
Annotate/Code Tweets for
Training Data for Machine Learning
Created Dashboard
Improve on Design
Moving Forward Your Challenge • Your data will be different
client-to-client • Twitter is just the beginning • Your will get to be creative
and work on collaborative cross-functional teams to dive into the data
• *This will be both rewarding and potentially difficult
Tasks Ahead • Begin thinking about
what you can learn from data to help our sponsors reach their goals
• Start thinking about how your fans behave in your approach to figuring out what questions to ask the data
Most Basic Steps
Determine Goals Capture Data Curate Data
Merge Datasets and Bring Together Methodologies if
Necessary
Additional Data Processing to Usable Form
Deliver Insight to the Client
Bumps in the Road Ahead • Privacy Issues and
Respecting the Fans • Company layers and
politics – releasing data from companies is fraught with back and forth
• Getting data into a usable form
• Assumptions were wrong or have to be redefined – it’s ok to fail fast – but be ready to keep moving
• Working in cross-functional groups
Image from: CapGemini h-p://www.capgemini.com/sites/default/files/technology-‐blog/files/2012/09/big-‐data-‐vendors.jpg
Bring it Together
Draw connections between the data sets and how could they relate to the eight
logics and situational triggers.
“While social media data are always interesting in themselves (at least, for an analyst), when business owners are able to combine data and layer them
efficiently, the information will become more useful and actionable.” – Marshall Sponder