Let’s introduce Amazon KinesisInaugural meetup of the
Amazon Kinesis - London User Group
This evening
• Introducing Amazon Kinesis, Ian Meyers, AWS
• Pizza and drinks break
• Kinesis and Snowplow, Alex Dean, Snowplow Analytics
• Drinks
• All courtesy of our hosts:
Introducing Amazon Kinesis
Snowplow and Kinesis
1. Snowplow – who we are
2. Why are we excited about Kinesis?
3. Adding Kinesis support to Snowplow
4. Live demo!
5. Questions
Snowplow – who we are
Today, Snowplow is primarily an open source web analytics platform
Website / webappSnowplow: data pipeline
Collect Transform and enrich
Amazon Redshift /
PostgreSQL
Amazon S3
• Your granular, event-level and customer-level data, in your own data warehouse
• Connect any analytics tool to your data• Join your web analytics data with any other data set
Snowplow was born out of our frustration with traditional web analytics tools…• Limited set of reports that don’t answer business questions
• Traffic levels by source• Conversion levels• Bounce rates• Pages / visit
• Web analytics tools don’t understand the entities that matter to business• Customers, intentions, behaviours, articles, videos, authors,
subjects, services… • …vs pages, conversions, goals, clicks, transactions
• Web analytics tools are siloed• Hard to integrate with other data sets incl. digital (marketing
spend, ad server data), customer data (CRM), financial data (cost of goods, customer lifetime value)
…and out of the opportunities to tame big data new technologies presented
These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis
Snowplow is composed of a set of loosely coupled subsystems, architected to be robust and scalable
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D
A D Standardised data protocols
Generate event data
Examples:• Javascript
tracker• Python /
Lua / No-JS / Arduino tracker
Receive data from trackers and log it to S3
Examples:• Cloudfront
collector• Clojure
collector for Amazon EB
Clean and enrich raw data
Built on Scalding / Cascading / Hadoop and powered by Amazon EMR
Store data ready for analysis
Examples:• Amazon
Redshift• PostgreSQL• Amazon S3
• Batch-based• Normally run overnight; sometimes
every 4-6 hours
Why are we excited about Kinesis?
A quick history lesson: the three eras of business data processing
1. The classic era, 1996+
2. The hybrid era, 2005+
3. The unified era, 2013+
For more see http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/
The classic era, 1996+
OWN DATA CENTER
Data warehouse
HIGH LATENCY
Point-to-point connections
WIDE DATA COVERAGE
CMS
Silo
CRM
Local loop Local loop
NARROW DATA SILOES LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
Management reporting
ERP
SiloLocal loop
Silo
Nightly batch ETL process
FULL DATA HISTORY
The hybrid era, 2005+
CLOUD VENDOR / OWN DATA CENTER
Search
SiloLocal loop
LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
CRM
Local loop
SAAS VENDOR #2
Email marketing
Local loop
ERP
SiloLocal loop
CMS
SiloLocal loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream processing
Productrec’s
Micro-batch processing
Systems monitoring
Batch processing
Data warehouse
Management reporting
Batch processing
Ad hoc analytics
Hadoop
SAAS VENDOR #3
Web analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
The unified era, 2013+CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’ DATA HISTORY
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc analytics
Management reporting
Fraud detection
Churn prevention
APIs
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc analytics
Management reporting
Fraud detection
Churn prevention
APIs
The unified log is Kinesis (or Kafka)
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc analytics
Management reporting
Fraud detection
Churn prevention
APIs
Can we implement Snowplow on top of Kinesis?
Adding Kinesis support to Snowplow
Where we are heading with our Kinesis architecture
Scala Stream Collector
Raw event stream
Enrich Kinesis app
Bad raw events stream
Enriched event
stream
S3
Redshift
S3 sink Kinesis app
Redshift sink Kinesis
app
Snowplow Trackers
We took an important first step in our last release…
hadoop-etl
Record-level enrichment functionality
scala-common-enrich
scala-hadoop-enrich scala-kinesis-enrich
0.8.12pre-0.8.12
… and the next release should get us much closer
Scala Stream Collector
Raw event stream
Enrich Kinesis app
Bad raw events stream
Enriched event
stream
S3
Redshift
S3 sink Kinesis app
Redshift sink Kinesis app
Snowplow Trackers
Live demo!
Questions?
http://snowplowanalytics.comhttps://github.com/snowplow/snowplow
@snowplowdata
And finally…
Huge thanks to our hosts!