hp discover: real time insights from big data
DESCRIPTION
Slides from HP Discover Europe, 10-12 December 2013. Covering systems architecture, use cases, and real time interactive visualizationTRANSCRIPT
Billions of Rows, Millions of Insights
Right Now
Developing a Landscape for Real Time Information
Spil Games: A leader in online gaming• 180 million monthly and 12 million daily players
• >50 websites, local in 15 languages
• A rich source of data about traffic, content, and consumers
• Battling changing consumer expectations on content delivery (the Netflix effect)
Big data created big paradigm shifts
• Highly consistent• Highly connectable• Inflexible• Slow
• Open• Adaptive/Evolving• Inconsistent
You always need both
Traditionally, we define data based on what we expect
With big data, we capture first and define later
Capture
Explore Define
Apply + Track
VELOCITY
VARIETY
VERACITY
What is big data?
VALUEThe Only V that Matters
Big data also brings new challenges: the four Vs
Velocity: What is real time?
Traditional ETL“Real Time”
• Once a day• Once a week• Delayed
• Faster than human perception
• <200 milliseconds “In Time”
In Time: Information is available fast enough to influence decisions• While in the shop/on the site (minutes)• While the query runs (seconds)• While the page loads (milliseconds)
The Velocity Continuum
How big data drives value at Spil
Informing Decisions Making Decisions
• Day to day business reporting
• Analytical reporting for self-service analysis
• Business analytics for advising decisions
• Descriptive models to explain our business
• Customer Lifetime Value• Marketing ROI
• Customer content recommendations
• Email campaign targeting
• Site learning and optimization
• System monitoring and alerting
Unstructured data intake
Unstructured data storage
Structured data storage
Human interface layer
Predictive analytics tools
Select A,B,sum(C)From XGroup by 1,2
• High Query Performance• Denormalized• Scalable; high concurrency
• Cheap• Flexible Schema• Easy Management
• Scalable• Schemaless or adaptive schema• Resilient
• Highly Flexible• Simple to use• In-tool metadata
• Not memory constrained• Flexible inputs/outputs• Easy iteration
The pieces needed for a big data stack
The nuts and bolts of our big data tech
Why we chose our tech
• Affordable• Highly available and resilient
• Extremely fast development due to SQL• Excellent query performance = lazy
optimization
• Right price• Easy (and fun!) development• Excellent library availability
• Industry standard for Map/Reduce• Cheap storage for “data lake”
• Easy integration with existing tech
How much data do we handle?
Through Map/Reduce: 1.4 Billion Events/Day (200 Million Rows/Day
into DWH)
Through ETL: 100-200 Million
Rows/Day into DWH
Map/Reduce: 20 Billion Rows
Vertica: 50 Billion Rows
Long Term Storage:All of 2013 Events
Predictive models: >500 million scores per day
ETLs to Production DBs: >10 Models
Reporting: 150 Dashboards, 80 data
sources
Queries: >2000 per day
Ingestion Persistence Usage
What it drives for us every day
Demographic Prediction
Multivariate Testing/Site Optimization
Q&A + Demo