cafarella auto bigdata - university of michigan · cafarella_auto_bigdata.pptx author: mike...
Post on 26-May-2020
4 Views
Preview:
TRANSCRIPT
Big Data and Automotive IT
Michael Cafarella University of Michigan September 11, 2013
2
A Banner Year for Big Data
Big Data
• Who knows what it means anymore?
Big Data
• Who knows what it means anymore? • Associated with:
– Google, Facebook, Twitter – Hadoop, MapReduce – Cluster computing, cloud computing – Machine learning, predictive analytics, data science – magazine covers
• For a large range of tasks, data availability is no longer a serious constraint
Data + Statistics = Predictors Web pages + user clicks Google web search
Movie views + user ratings Netflix recommendations
Tweets about illness Disease outbreak estimates
Cameras + laser scanners Self-‐driving cars
Cell phone sales records Customer churn prediction
• Statistics grew up “data-‐poor” • Old techniques now v. effective thanks to data • Enabled by Web, cheap disks, cheap sensors • Google was among the first to see it coming
Agenda
• A Sample Big Data Task • Possible Tasks in Automotive IT
Agenda
• A Sample Big Data Task • Possible Tasks in Automotive IT
Tweets for Macroeconomic Prediction
• Why use Twitter? • Tweets contain valuable information freely provided by the Tweeter in real time – Quick and cheap relative to surveys – Better at capturing turning points – Permit retrospective analysis because beliefs and actions are “archived”
• Let’s try “unemployment”
The Data
• Tweets are short timestamped messages • Explicit metadata: author, geography, time • Implicit metadata: gender, age, many others • Roughly 1B every 2 days • More than 15% of online American adults
Processing Pipeline
• How to turn raw text into predictions?
Processing Pipeline 1. Obtain ~13B Tweets in 2011-‐2013
(compressed ~5 TB)
2. Enumerate and count all unique k-‐grams in data
3. Group counts by week, build all (k-‐gram, signal) pairs
4. Choose unemployment-‐related ones
5. Use signals to build model to predict new claims
12
“I need a job”, “I got fired”, etc.
10/17 i need a job 491
12/15 i love you 5092
1/28 justin bieber 940,291
I need a job, ) (
Deriving Signals Each signal derived from counts of k-‐grams • Any consecutive sequence of k or fewer words • Tweet of N words yields ~kN k-‐grams • We used k=4 (enough for “I lost my job”)
“I teach at the University of Michigan” • 1: I, teach, at, the University, of, Michigan • 2: “I teach”, “teach at”, “at the”, “the University”, … • 3: “I teach at”, “teach at the”, “at the University”, …
Our Tweet corpus contains 2.55 billion unique 4-‐grams in English that appear at least three times
13
Choosing Signals • Too many to examine by hand • Good signals may not be obvious
• Lysol flu • Obvious signals may not be good
• unemployment benefits
• Automated methods would be great, but very difficult. Our research focuses on this problem
• For now, manually formulate plausible ones • I lost my job, I need a job, I want to work
14
Experiments
15
Signals Category Terms
Search signals find a job, looking for a job, looking for work, need a job
Lost job signals canned, laid off, fired (get fired, got fired, be fired, fired from, was fired, been fired, fired lol, being fired, just fired)
Unemployment signal
unemployment
16
• Exclude “benefits,” “fired up,” others
Initial Claims (SA) versus Twitter IndexLearn log(2)
Thou
sand
s, W
eekl
y
2011 2012 2013320
340
360
380
400
420
440
460Initial ClaimsTwitter Index
Do Twitter Signals Carry Incremental Information?
• Panel of economists predict unemployment, make mistakes
• Can we predict economists’ surprise? – If Twitter adds nothing new, should be impossible
18
Creating Measures of Labor Market Flows using Social Media 19
Initial Claims for Unemployment BenefitsRevised Data
J A S O N D J F M A M J J2012
-30000
-20000
-10000
0
10000
20000
30000
40000SurprisePredicted with Twitter
Agenda
• A Sample Big Data Task • Possible Tasks in Automotive IT
Finding Novel Applications: Some Rules of Thumb
1. Data is the critical resource, often overlooked – Great data makes a middling analyst look good – The reverse isn’t true
2. Look for “data exhaust” to exploit – Sales records, transaction logs, phone logs
3. Datasets are synergistic – Weather data is boring – Weather + repair data is compelling
4. Resource optimization often pays off quickly 5. Novel services possible, yield bigger impact
Resource Optimization
• Predict demand for models & colors (possibly prior to manufacture) – Can be localized to states, probably counties – Esp useful for dealer inventory management – Also possible for parts, components, accessories
• Predict service issues – Manufacturer warranty liability – Daily load on service staff
• Predict buyer-‐specific propensity to purchase – (See Charles Duhigg, NYTimes, 2/19/2012)
Novel Services
• Auto Owners – Better prediction => accurate contract pricing – Better service and “Refuel now!” warnings – Next-‐purchase recommendations
• Traffic and Infrastructure – Traffic prediction (e.g., “Tell me when to leave work”)
– Street-‐specific maintenance and salting – Intersection-‐specific accident prediction – Find fun drives
top related