Download - AWS Activate Webinar - Growing on AWS
Growing on Amazon Web Services Abhishek Sinha Amazon Web Services @abysinha
Our Journey Today
Growth Hacking
Growth hacking is a marke9ng technique developed by technology startups which uses crea9vity, analy9cal thinking, and social metrics to sell products and gain exposure
At Airbnb, we look into all possible ways to improve our product and user experience. OCen 9mes this involves lots of analy9cs behind the scene.”
Learn and Iterate MVP Hypothesis
Learn and Iterate MVP Hypothesis
Hosts with professional photography will get more business. And hosts will sign up for professional photography as a service.”
Build a MVP – 20 Photographers
Saw the proverbial “Hockey SEck”
Airbnb then scaled the Idea
• Professional Photography Services • Increased the requirements of Photo Quality • Watermarked Photos for authen@city • Key Metrics Tracked – “Shoots per month” • April 2012 – 5000 shoots per month • Growth can some@mes come from unexpected areas
Our Journey Today
Growth hacking is a marke9ng technique developed by technology startups which uses crea9vity, analy9cal thinking, and social metrics to sell products and gain exposure
BUILD-‐MEASURE-‐LEARN The fundamental ac9vity of a startup is to turn ideas into products, measure how customers respond, and then learn whether to pivot or persevere. All successful startup processes should be geared to accelerate that feedback loop.
In a startup, the purpose of analy@cs is to iterate to product/market fit before the money runs out -‐ Lean analy@cs by Alistair Croll and Ben Yoskowitz
Our Journey Today
Metrics Lean
What do these metrics look like ? Depends upon what stage your startup is at
And what is your favorite analyEcs framework ?
Dave Mcclure Pirate Metrics
Source : hIp://www.slideshare.net/dmc500hats/startup-‐metrics-‐for-‐pirates-‐long-‐version
Lean Analy9cs Stages Credits – Alistair Croll and Ben Yoskovitz
One Metric that maXers f(stage, business) = metric that
maIers
Bit.ly/BigLeanTable Credits – Alistair Croll and Ben Yoskovitz
Example – E-‐commerce Stage Metrics
Empathy How do buyers become aware of the need ? How do they try to find the solu@on? What pain do they encounter as a result? What are their demographics and tech profiles?
S@ckiness Conversion, Shopping cart size Acquisi@on : cost of finding new buyers Loyalty : Percent of buyers who return in 90 days
Virality Acquisi@on mode: customer acquisi@on cost, volume of sharing Loyalty model: ability to reac@vate, volume of buyers who return
Revenue Transac@on value, revenue per customer, ra@o of acquisi@on cost to LTV, direct sales metrics
Scale Affiliates, Channels, white-‐label product ra@ngs, reviews, support costs, return RMA and refunds, channel conflict
Source: Bit.ly/BigLeanTable
Our Journey Today Lean
Which one should I focus on ? Preferably one (bit.ly/BigLeanTable) 2
Metrics
What do they look like ? Depends upon stage and type of startup 1
Where do I get these metrics from ?
Logs – Used for and Types…
• Opera9onal Metrics • Applica9on/Business related metrics
• Opera9ng system logs
• Web Server Logs
• Database logs • CDN Logs • Applica9on Logs
User Engagement in Online Video
[Source: Conviva Viewer Experience Report – 2013]
Requirements for Gaming company Co
st Analysis
Data transfer
• By date/9me • By edge loca9on • By date/9me within an edge loca9on
• By top X URLs • By HTTP vs. HTTPS
Marke9n
g
Top URLs
• As-‐is count • By content type • By edge loca9on • By edge loca9on and content type
Requests served
• By edge loca9on Revenue
• By edge loca9on Top games
• By age • By income • By gender
Opera9o
ns
Error rates
• By top X URLs • By edge loca9on • By edge loca9on and content type
Revenu
e
Top games
• By revenue • By edge loca9on and revenue
Top ads
• That lead to a game purchase
Requirements for Gaming company
Cost Analysis
Data transfer
• By date/9me • By edge loca9on • By date/9me within an edge loca9on
• By top X URLs • By HTTP vs. HTTPS
Cloudfront logs
Web Server Logs
Available Data Sources (Gaming) Metric Sources Data transfer by date/@me CloudFront logs Data transfer by edge loca@on CloudFront logs Data transfer by date/@me within an edge loca@on CloudFront logs Data transfer by top x URLs CloudFront logs, web servers logs Data transfer by hXp vs HTTPS CloudFront logs Top URLs CloudFront logs, web servers logs Top URLs by Content Type CloudFront logs Top URLs by Edge Loca@on CloudFront logs Top URLs by Edge Loca@on and Content Type CloudFront logs Error rates by top x URLs CloudFront logs, web servers logs Error rate by edge loca@on CloudFront logs Error Rate by edge loca@on and content type CloudFront logs Requests served by edge loca@on CloudFront logs Revenue by edge loca@on CloudFront logs, OrdersDB, app servers logs Top games segmented by age CloudFront logs, user profile Top games segmented by income CloudFront logs, user profile Top games segmented by gender CloudFront logs, user profile Top games by revenue CloudFront logs, OrdersDB Top games by edge loca@on and revenue CloudFront logs, OrdersDB Top game revenue segmented by age CloudFront logs, OrdersDB, user profile
Our Journey Today Lean
Which one should I focus on ? Preferably one (bit.ly/BigLeanTable) 2
Metrics
What do they look like ? Depends upon stage and type of startup 1
Where do I find them ? They are all hidden in your logs (So don’t throw away logs to create disk space !)
3
How to process logs on AWS
CloudFront Access Log Format #Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query 2012-05-25 22:01:30 AMS1 4448 94.212.249.78 GET d1234567890213.cloudfront.net /YT0KthT/F5SOWdDPqNqQF07tiTOXqJMpfD
dlb3LMwv3/jP3/CINm/yDSy0MsRcWJN/Simutrans.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M\
SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625181
2012-05-25 22:01:30 AMS1 4952 94.212.249.78 GET d1234567890213.cloudfront.net /66IG584/CPCxY0P44BGb5ZOd3qSUrauL05\
0LOvFwaMj/eH/caw/Blob Wars-Blob And Conquer.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M\
SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625184
2012-05-25 22:01:30 AMS1 4556 78.8.5.135 GET d1234567890213.cloudfront.net /SwlufjC/xEjH3BRbXMXwmFWqzKt7od6tlW\
R3e13LhmH/V3eF/lo6g/AstroMenace.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U;%20pl)%2\
0Presto/2.10.229%20Version/11.60 uid=100&oid=108625189
2012-05-25 22:01:30 AMS1 47172 78.8.5.135 GET d1234567890213.cloudfront.net /Di1cXoN/TskldkSHcgkvZXQEmv5vOVR25X\
5UTisFkRq/pQa/wCjUXZb/Z1HRuGlo/Kroz.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U;\
%20pl)%20Presto/2.10.229%20Version/11.60 uid=100&oid=108625206
Sample Your Data with R
> sample_data <- read.delim(”SampleFiles/E123ABCDEF.2012-05-25-22.NEfbhLN3", header=F)
> sample_data <- sample_data[-1:-2,]
> View(sample_data)
> m <- ggplot(sample_data, aes(x = factor(V9)))
> m + geom_histogram() + scale_y_log10() + xlab('Error Codes') +
ylab('log(Frequency)')
Complete Rstudio Interface
Model vCPU Mem (GiB)
SSD Storage (GB)
r3.large 2 15 1 x 32
r3.xlarge 4 30.5 1 x 80
r3.2xlarge 8 61 1 x 160
r3.4xlarge 16 122 1 x 320
r3.8xlarge
32
244
2 x 320
Our Journey Today Lean
Which one should I focus on ? Preferably one (bit.ly/BigLeanTable) 2
Metrics
What do they look like ? Depends upon stage and type of startup 1
Where do I find them ? They are all hidden in your logs (So don’t throw away logs to create disk space !)
3
How do I process these logs ? Simple tools like awk/sed, SQL, R 4
Two approaches to Scale your log processing
1. DIY 2. Use prepackaged 3rd party soCware
3rd Party Tools • Sumologic • Loggly • SnowPlow analy9cs • Papertrail • Logstash + Kibana + elas9cSearch • Log.io • Treasure Data and many more solu9ons in the market with varied levels of depth
Our Journey Today Lean
Which one should I focus on ? Preferably one (bit.ly/BigLeanTable) 2
Metrics
What do they look like ? Depends upon stage and type of startup 1
Where do I find them ? They are all hidden in your logs (So don’t throw away logs to create disk space !)
3
How do I process these logs ? Simple tools like awk/sed, SQL, R 4 What if I have too many logs ? How do I scale processing Get a 3rd party tool or build it yourself
5
DIY Scalable Log Processing Plahorm
Data Analy9cs Plahorm
Log shipping and
aggrega@on Storage Transforma@on Analysis Visualiza@on
Log shipping and
aggrega@on Storage Transforma@on Analysis Visualiza@on
Data Analy9cs Plahorm
Collec9on of Data
Sources Aggrega@on and
shipping Tool
Data Sink
Web Servers Applica@on servers Connected Devices Mobile Phones
Etc
Scalable method to collect and aggregate
Flume, Kaja, Kinesis, Queue
Reliable and durable des@na@on OR Des@na@ons
43
Run your own log collector
Your applicaEon Amazon S3
DynamoDB
Any other data store
Amazon S3
Amazon EC2
1
Use a Queue
Amazon Simple Queue Service (SQS)
Amazon S3
DynamoDB
Any other data store
2
Use a Tool like FLUME, Fluentd,KAFKA, HONU etc
Flume, Fluentd running on EC2
Amazon S3
Any other data store
HDFS
4
Data Sources
App.4
[Machine Learning]
AW
S En
dpoint
App.1
[Aggregate & De-‐Duplicate]
Data Sources
Data Sources
Data Sources
App.2
[Metric ExtracEon]
S3
DynamoDB
Redshift
App.3 [Sliding Window Analysis]
Data Sources
Availability Zone
Shard 1
Shard 2 Shard N
Availability Zone
Availability Zone
Introducing Amazon Kinesis Managed Service for Real-‐Time Processing of Big Data
EMR
47
Easy AdministraEon Managed service for real-‐@me streaming data collec@on, processing and analysis. Simply create a new stream, set the desired level of capacity, and let the service handle the rest.
Real-‐Eme Performance Perform con@nual processing on streaming big data. Processing latencies fall to a few seconds, compared with the minutes or hours associated with batch processing.
High Throughput. ElasEc Seamlessly scale to match your data throughput rate and volume. You can easily scale up to gigabytes per second. The service will scale up or down based on your opera@onal or business needs.
S3, EMR, Storm, Redshib, & DynamoDB IntegraEon
Reliably collect, process, and transform all of your data in real-‐@me & deliver to AWS data stores of choice, with Connectors for S3, Redshil, and DynamoDB.
Build Real-‐Eme ApplicaEons Client libraries that enable developers to design and operate real-‐@me streaming data processing applica@ons.
Low Cost Cost-‐efficient for workloads of any scale. You can get started by provisioning a small stream, and pay low hourly rates only for what you use.
Amazon Kinesis: Key Developer Benefits
Data Analy9cs Plahorm
Log shipping and
aggrega@on Storage Transforma@on Analysis Visualiza@on
Choice of storage systems (Structure and Volume) Structure
Low High
Large
Small
Size
S3
RDS
Dynamo DB
NoSQL EBS
1
Choice of storage systems (Structure and Volume) Structure
Low High
Large
Small
Size
S3
RDS
Dynamo DB
NoSQL EBS
1
Courtesy hXp://techblog.nenlix.com/2013/01/hadoop-‐planorm-‐as-‐service-‐in-‐cloud.html
S3 as a “single source of truth”
S3
Data Analy9cs Plahorm
Log shipping and
aggrega@on Storage Transforma@on Analysis Visualiza@on
Hadoop based Analysis
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log AggregaEon tools
Amazon EMR
Your choice of tools on Hadoop/EMR
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log AggregaEon tools
Amazon EMR
Pig for Access Logs Analysis RAW_LOG = LOAD 's3://myoutputbucket/aggregate/' AS (ts:chararray, url:chararray…);
LOGS_BASE_F = FILTER RAW_LOG BY url MATCHES '^GET /__track.*$’;
LOGS_BASE_F_W_PARAM = FOREACH LOGS_BASE_F GENERATE
url,
DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') as dt,
SUBSTRING(DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') ,0, 10 ) as day,
…
status,
REGEX_EXTRACT(url, '^GET /([^\\?]+)', 1) AS action: chararray,
REGEX_EXTRACT(url, 'idt=([^&]+)', 1) AS idt: chararray,
REGEX_EXTRACT(url, 'idc=([^&]+)', 1) AS idc: chararray;
I1 = FILTER LOGS_BASE_F_W_PARAM by action == 'clic' or action == 'display';
LOGS_SHORT = FOREACH I1 GENERATE uuid, action, dt, day, ida, idas, act, idp, idcmp ,idc;
G1 = GROUP LOGS_SHORT BY (uuid,idc);
store G1 into ‘s3://mybucket/sessions/’;
Load and Filter (cat / grep)
Parse (awk)
Store (>)
Data analy9cs Plahorm
Log shipping and
aggrega@on Storage Transforma@on Analysis Visualiza@on
Hadoop is good for
1. Ad Hoc Query analysis 2. Large Unstructured Data Sets 3. Machine Learning and Advanced Analytics 4. Schema less
SQL based processing for unstructured data
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log AggregaEon tools
Amazon EMR
Amazon Redshift
Pre-processing framework
Petabyte scale Columnar Data -warehouse
You might not need pre-‐processing (e.g. JSON, CSV)
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log AggregaEon tools
Amazon Redshift
Petabyte scale Columnar Data -warehouse
COPY into Amazon RedshiC
create table cf_logs
( d date, t char(8), edge char(4), bytes int, cip varchar(15),
verb char(3), distro varchar(MAX), object varchar(MAX), status int,
Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )
copy cf_logs from 's3://big-data/logs/E123ABCDEF/'
credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'
IGNOREHEADER 2
GZIP
DELIMITER '\t'
DATEFORMAT 'YYYY-MM-DD'
But Data Warehouses is for Enterprises ?
Rela@onal data warehouse
Massively parallel
Petabyte scale
Fully managed; zero admin
Low cost point
Open Interface
Amazon Redshil
Redshift is Data-warehouse done the AWS Way
Your choice of BI Tools on the cloud
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log AggregaEon tools
Amazon EMR
Amazon Redshift
Pre-processing framework
Choose Your Favorite Visualiza9on Tool Tableau (Windows instance)
R
Jaspersol
QlikView
MicroStrategy
SiSense
…
Our Journey Today Lean
Which one should I focus on ? Preferably one (bit.ly/BigLeanTable) 2
Metrics
What do they look like ? Depends upon stage and type of startup 1
Where do I find them ? They are all hidden in your logs (So don’t throw away logs to create disk space !)
3
How do I process these logs ? Simple tools like awk/sed, SQL, R 4 What if I have too many logs ? How do I scale processing Get a 3rd party tool or build it yourself
5 How do I build a log analy9cs plahorm myself 1. Ship and aggregate your logs using either
Flume, Kinesis, Fluentd and store them in S3 2. Process them using Hadoop (EMR) or RedshiC 3. Run your our visualiza9on tool on it
6
Standing on shoulder of Giants “With Amazon RedshiC and Tableau, anyone in the company can set up any queries they like—from how users are reac9ng to a feature, to growth by demographic or geography, to the impact sales efforts have had in different areas. It’s very flexible,”
“Using Amazon Elas9c MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a marer of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges”
“Ini9ally we used Amazon RedshiC as a data mart for the data science team. Now, it is increasingly used for produc9on data mart tasks such as providing our marke9ng department with fresh data to make informed decisions and automa9cally op9mize our adver9sing," said Cooper McGuire, Managing Director, at Zalora. "Addi9onally, Amazon RedshiC is simple to use and reliable. With one click, we can rapidly scale up or down in real 9me in alignment with business requirements. We have been able to eliminate significant maintenance costs and overhead associated with tradi9onal solu9ons and external consultants
Finally, a Small Warning
Abraham Wald (1902-‐1950)
A B C
In Summary • Growth Hacking = Understanding your business to op9mize it • You can’t op9mize what you don’t measure • Logs are your goldmine – they contain everything you want to measure
• S3 is a good place to store all your logs because of Durability and Cost • Build an analy9cs plahorm that enables developers and analysts to gain interes9ng insights with the choice of tool they want
• Most Important – Innova9on and growth will come from areas you least thought it could !
Thank You ! [email protected] @abysinha