Download - AWS Activate Webinar - Growing on AWS

Growing on Amazon Web Services Abhishek Sinha Amazon Web Services @abysinha

Our Journey Today

Growth Hacking

Growth hacking is a marke9ng technique developed by technology startups which uses crea9vity, analy9cal thinking, and social metrics to sell products and gain exposure

At Airbnb, we look into all possible ways to improve our product and user experience. OCen 9mes this involves lots of analy9cs behind the scene.”

Learn and Iterate MVP Hypothesis

Learn and Iterate MVP Hypothesis

Hosts with professional photography will get more business. And hosts will sign up for professional photography as a service.”

Build a MVP – 20 Photographers

Saw the proverbial “Hockey SEck”

Airbnb then scaled the Idea

• Professional Photography Services •  Increased the requirements of Photo Quality • Watermarked Photos for authen@city • Key Metrics Tracked – “Shoots per month” • April 2012 – 5000 shoots per month • Growth can some@mes come from unexpected areas

Our Journey Today

Growth hacking is a marke9ng technique developed by technology startups which uses crea9vity, analy9cal thinking, and social metrics to sell products and gain exposure

BUILD-‐MEASURE-‐LEARN The fundamental ac9vity of a startup is to turn ideas into products, measure how customers respond, and then learn whether to pivot or persevere. All successful startup processes should be geared to accelerate that feedback loop.

In a startup, the purpose of analy@cs is to iterate to product/market fit before the money runs out -‐ Lean analy@cs by Alistair Croll and Ben Yoskowitz

Our Journey Today

Metrics Lean

What do these metrics look like ? Depends upon what stage your startup is at

And what is your favorite analyEcs framework ?

Dave Mcclure Pirate Metrics

Source : hIp://www.slideshare.net/dmc500hats/startup-‐metrics-‐for-‐pirates-‐long-‐version

Lean Analy9cs Stages Credits – Alistair Croll and Ben Yoskovitz

One Metric that maXers f(stage, business) = metric that

maIers

Bit.ly/BigLeanTable Credits – Alistair Croll and Ben Yoskovitz

Example – E-‐commerce Stage Metrics

Empathy How do buyers become aware of the need ? How do they try to find the solu@on? What pain do they encounter as a result? What are their demographics and tech profiles?

S@ckiness Conversion, Shopping cart size Acquisi@on : cost of finding new buyers Loyalty : Percent of buyers who return in 90 days

Virality Acquisi@on mode: customer acquisi@on cost, volume of sharing Loyalty model: ability to reac@vate, volume of buyers who return

Revenue Transac@on value, revenue per customer, ra@o of acquisi@on cost to LTV, direct sales metrics

Scale Affiliates, Channels, white-‐label product ra@ngs, reviews, support costs, return RMA and refunds, channel conflict

Source: Bit.ly/BigLeanTable

Our Journey Today Lean

Which one should I focus on ? Preferably one (bit.ly/BigLeanTable) 2

Metrics

What do they look like ? Depends upon stage and type of startup 1

Where do I get these metrics from ?

Logs – Used for and Types…

• Opera9onal Metrics • Applica9on/Business related metrics

•  Opera9ng system logs

• Web Server Logs

•  Database logs •  CDN Logs •  Applica9on Logs

User Engagement in Online Video

[Source: Conviva Viewer Experience Report – 2013]

Requirements for Gaming company Co

st Analysis

Data transfer

•  By date/9me •  By edge loca9on •  By date/9me within an edge loca9on

•  By top X URLs •  By HTTP vs. HTTPS

Marke9n

g

Top URLs

•  As-‐is count •  By content type •  By edge loca9on •  By edge loca9on and content type

Requests served

•  By edge loca9on Revenue

•  By edge loca9on Top games

•  By age •  By income •  By gender

Opera9o

ns

Error rates

•  By top X URLs •  By edge loca9on •  By edge loca9on and content type

Revenu

e

Top games

•  By revenue •  By edge loca9on and revenue

Top ads

•  That lead to a game purchase

Requirements for Gaming company

Cost Analysis

Data transfer

• By date/9me • By edge loca9on • By date/9me within an edge loca9on

• By top X URLs • By HTTP vs. HTTPS

Cloudfront logs

Web Server Logs

Available Data Sources (Gaming) Metric Sources Data transfer by date/@me CloudFront logs Data transfer by edge loca@on CloudFront logs Data transfer by date/@me within an edge loca@on CloudFront logs Data transfer by top x URLs CloudFront logs, web servers logs Data transfer by hXp vs HTTPS CloudFront logs Top URLs CloudFront logs, web servers logs Top URLs by Content Type CloudFront logs Top URLs by Edge Loca@on CloudFront logs Top URLs by Edge Loca@on and Content Type CloudFront logs Error rates by top x URLs CloudFront logs, web servers logs Error rate by edge loca@on CloudFront logs Error Rate by edge loca@on and content type CloudFront logs Requests served by edge loca@on CloudFront logs Revenue by edge loca@on CloudFront logs, OrdersDB, app servers logs Top games segmented by age CloudFront logs, user profile Top games segmented by income CloudFront logs, user profile Top games segmented by gender CloudFront logs, user profile Top games by revenue CloudFront logs, OrdersDB Top games by edge loca@on and revenue CloudFront logs, OrdersDB Top game revenue segmented by age CloudFront logs, OrdersDB, user profile



Metrics


Where do I find them ? They are all hidden in your logs (So don’t throw away logs to create disk space !)

3

How to process logs on AWS

CloudFront Access Log Format #Version: 1.0

#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query 2012-05-25 22:01:30 AMS1 4448 94.212.249.78 GET d1234567890213.cloudfront.net /YT0KthT/F5SOWdDPqNqQF07tiTOXqJMpfD

dlb3LMwv3/jP3/CINm/yDSy0MsRcWJN/Simutrans.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M\

SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625181

2012-05-25 22:01:30 AMS1 4952 94.212.249.78 GET d1234567890213.cloudfront.net /66IG584/CPCxY0P44BGb5ZOd3qSUrauL05\

0LOvFwaMj/eH/caw/Blob Wars-Blob And Conquer.exe 200 http://AtRJw2kxg0EMW.com/kZetr/YCb6AM9N2xt2 Mozilla/5.0%20(compatible;%20M\

SIE%209.0;%20Windows%20NT%206.1;%20WOW64;%20Trident/5.0) uid=100&oid=108625184

2012-05-25 22:01:30 AMS1 4556 78.8.5.135 GET d1234567890213.cloudfront.net /SwlufjC/xEjH3BRbXMXwmFWqzKt7od6tlW\

R3e13LhmH/V3eF/lo6g/AstroMenace.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U;%20pl)%2\

0Presto/2.10.229%20Version/11.60 uid=100&oid=108625189

2012-05-25 22:01:30 AMS1 47172 78.8.5.135 GET d1234567890213.cloudfront.net /Di1cXoN/TskldkSHcgkvZXQEmv5vOVR25X\

5UTisFkRq/pQa/wCjUXZb/Z1HRuGlo/Kroz.exe 200 http://AtRJw2kxg0EMW.com/AC1vg/1727EWfb7fPt Opera/9.80%20(Windows%20NT%205.1;%20U;\

%20pl)%20Presto/2.10.229%20Version/11.60 uid=100&oid=108625206

Sample Your Data with R

> sample_data <- read.delim(”SampleFiles/E123ABCDEF.2012-05-25-22.NEfbhLN3", header=F)

> sample_data <- sample_data[-1:-2,]

> View(sample_data)

> m <- ggplot(sample_data, aes(x = factor(V9)))

> m + geom_histogram() + scale_y_log10() + xlab('Error Codes') +

ylab('log(Frequency)')

Complete Rstudio Interface

Model vCPU Mem (GiB)

SSD Storage (GB)

r3.large 2 15 1 x 32

r3.xlarge 4 30.5 1 x 80

r3.2xlarge 8 61 1 x 160

r3.4xlarge 16 122 1 x 320

r3.8xlarge

32

244

2 x 320



Metrics



3

How do I process these logs ? Simple tools like awk/sed, SQL, R 4

Two approaches to Scale your log processing

1.  DIY 2.  Use prepackaged 3rd party soCware

3rd Party Tools •  Sumologic •  Loggly •  SnowPlow analy9cs •  Papertrail •  Logstash + Kibana + elas9cSearch •  Log.io •  Treasure Data and many more solu9ons in the market with varied levels of depth



Metrics



3

How do I process these logs ? Simple tools like awk/sed, SQL, R 4 What if I have too many logs ? How do I scale processing Get a 3rd party tool or build it yourself

5

DIY Scalable Log Processing Plahorm

Data Analy9cs Plahorm

Log shipping and

aggrega@on Storage Transforma@on Analysis Visualiza@on

Log shipping and



Collec9on of Data

Sources Aggrega@on and

shipping Tool

Data Sink

Web Servers Applica@on servers Connected Devices Mobile Phones

Etc

Scalable method to collect and aggregate

Flume, Kaja, Kinesis, Queue

Reliable and durable des@na@on OR Des@na@ons

43

Run your own log collector

Your applicaEon Amazon S3

DynamoDB

Any other data store

Amazon S3

Amazon EC2

1

Use a Queue

Amazon Simple Queue Service (SQS)

Amazon S3

DynamoDB


2

Use a Tool like FLUME, Fluentd,KAFKA, HONU etc

Flume, Fluentd running on EC2

Amazon S3


HDFS

4

Data Sources

App.4

[Machine Learning]

AW

S En

dpoint

App.1

[Aggregate & De-‐Duplicate]

Data Sources

Data Sources

Data Sources

App.2

[Metric ExtracEon]

S3

DynamoDB

Redshift

App.3 [Sliding Window Analysis]

Data Sources

Availability Zone

Shard 1

Shard 2 Shard N

Availability Zone

Availability Zone

Introducing Amazon Kinesis Managed Service for Real-‐Time Processing of Big Data

EMR

47

Easy AdministraEon Managed service for real-‐@me streaming data collec@on, processing and analysis. Simply create a new stream, set the desired level of capacity, and let the service handle the rest.

Real-‐Eme Performance Perform con@nual processing on streaming big data. Processing latencies fall to a few seconds, compared with the minutes or hours associated with batch processing.

High Throughput. ElasEc Seamlessly scale to match your data throughput rate and volume. You can easily scale up to gigabytes per second. The service will scale up or down based on your opera@onal or business needs.

S3, EMR, Storm, Redshib, & DynamoDB IntegraEon

Reliably collect, process, and transform all of your data in real-‐@me & deliver to AWS data stores of choice, with Connectors for S3, Redshil, and DynamoDB.

Build Real-‐Eme ApplicaEons Client libraries that enable developers to design and operate real-‐@me streaming data processing applica@ons.

Low Cost Cost-‐efficient for workloads of any scale. You can get started by provisioning a small stream, and pay low hourly rates only for what you use.

Amazon Kinesis: Key Developer Benefits


Log shipping and


Choice of storage systems (Structure and Volume) Structure

Low High

Large

Small

Size

S3

RDS

Dynamo DB

NoSQL EBS

1

Courtesy hXp://techblog.nenlix.com/2013/01/hadoop-‐planorm-‐as-‐service-‐in-‐cloud.html

S3 as a “single source of truth”

S3


Log shipping and


Hadoop based Analysis

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log AggregaEon tools

Amazon EMR

Your choice of tools on Hadoop/EMR

Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Pig for Access Logs Analysis RAW_LOG = LOAD 's3://myoutputbucket/aggregate/' AS (ts:chararray, url:chararray…);

LOGS_BASE_F = FILTER RAW_LOG BY url MATCHES '^GET /__track.*$’;

LOGS_BASE_F_W_PARAM = FOREACH LOGS_BASE_F GENERATE

url,

DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') as dt,

SUBSTRING(DATE_TIME(ts, 'dd/MMM/yyyy:HH:mm:ss Z') ,0, 10 ) as day,

…

status,

REGEX_EXTRACT(url, '^GET /([^\\?]+)', 1) AS action: chararray,

REGEX_EXTRACT(url, 'idt=([^&]+)', 1) AS idt: chararray,

REGEX_EXTRACT(url, 'idc=([^&]+)', 1) AS idc: chararray;

I1 = FILTER LOGS_BASE_F_W_PARAM by action == 'clic' or action == 'display';

LOGS_SHORT = FOREACH I1 GENERATE uuid, action, dt, day, ida, idas, act, idp, idcmp ,idc;

G1 = GROUP LOGS_SHORT BY (uuid,idc);

store G1 into ‘s3://mybucket/sessions/’;

Load and Filter (cat / grep)

Parse (awk)

Store (>)

Data analy9cs Plahorm

Log shipping and


Hadoop is good for

1.  Ad Hoc Query analysis 2.  Large Unstructured Data Sets 3.  Machine Learning and Advanced Analytics 4.  Schema less

SQL based processing for unstructured data

Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Amazon Redshift

Pre-processing framework

Petabyte scale Columnar Data -warehouse

You might not need pre-‐processing (e.g. JSON, CSV)

Amazon SQS

Amazon S3

DynamoDB



Amazon Redshift

Petabyte scale Columnar Data -warehouse

COPY into Amazon RedshiC

create table cf_logs

( d date, t char(8), edge char(4), bytes int, cip varchar(15),

verb char(3), distro varchar(MAX), object varchar(MAX), status int,

Referer varchar(MAX), agent varchar(MAX), qs varchar(MAX) )

copy cf_logs from 's3://big-data/logs/E123ABCDEF/'

credentials 'aws_access_key_id=<key_id>;aws_secret_access_key=<secret_key>'

IGNOREHEADER 2

GZIP

DELIMITER '\t'

DATEFORMAT 'YYYY-MM-DD'

But Data Warehouses is for Enterprises ?

Rela@onal data warehouse

Massively parallel

Petabyte scale

Fully managed; zero admin

Low cost point

Open Interface

Amazon Redshil

Redshift is Data-warehouse done the AWS Way

Your choice of BI Tools on the cloud

Amazon SQS

Amazon S3

DynamoDB



Amazon EMR

Amazon Redshift

Pre-processing framework

Choose Your Favorite Visualiza9on Tool Tableau (Windows instance)

R

Jaspersol

QlikView

MicroStrategy

SiSense

…



Metrics



3

How do I process these logs ? Simple tools like awk/sed, SQL, R 4 What if I have too many logs ? How do I scale processing Get a 3rd party tool or build it yourself

5 How do I build a log analy9cs plahorm myself 1.  Ship and aggregate your logs using either

Flume, Kinesis, Fluentd and store them in S3 2.  Process them using Hadoop (EMR) or RedshiC 3.  Run your our visualiza9on tool on it

6

Standing on shoulder of Giants “With Amazon RedshiC and Tableau, anyone in the company can set up any queries they like—from how users are reac9ng to a feature, to growth by demographic or geography, to the impact sales efforts have had in different areas. It’s very flexible,”

“Using Amazon Elas9c MapReduce Yelp was able to save $55,000 in upfront hardware costs and get up and running in a marer of days not months. However, most important to Yelp is the opportunity cost. “With AWS, our developers can now do things they couldn’t before,” says Marin. “Our systems team can focus their energies on other challenges”

“Ini9ally we used Amazon RedshiC as a data mart for the data science team. Now, it is increasingly used for produc9on data mart tasks such as providing our marke9ng department with fresh data to make informed decisions and automa9cally op9mize our adver9sing," said Cooper McGuire, Managing Director, at Zalora. "Addi9onally, Amazon RedshiC is simple to use and reliable. With one click, we can rapidly scale up or down in real 9me in alignment with business requirements. We have been able to eliminate significant maintenance costs and overhead associated with tradi9onal solu9ons and external consultants

Finally, a Small Warning

Abraham Wald (1902-‐1950)

In Summary • Growth Hacking = Understanding your business to op9mize it •  You can’t op9mize what you don’t measure •  Logs are your goldmine – they contain everything you want to measure

•  S3 is a good place to store all your logs because of Durability and Cost • Build an analy9cs plahorm that enables developers and analysts to gain interes9ng insights with the choice of tool they want

• Most Important – Innova9on and growth will come from areas you least thought it could !

Thank You ! [email protected] @abysinha

Download - AWS Activate Webinar - Growing on AWS

Top Related