machine learning for advertising - amazon s3...aws offers several machine learning services and...

12
Copyright (c) 2018 by Amazon.com, Inc. or its affiliates. Machine Learning for Advertising is licensed under the terms of the Amazon Software License available at https://aws.amazon.com/asl/ Machine Learning for Advertising AWS Implementation Guide Vijay Sathish January 2018

Upload: others

Post on 21-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Copyright (c) 2018 by Amazon.com, Inc. or its affiliates.

Machine Learning for Advertising is licensed under the terms of the Amazon Software License available at

https://aws.amazon.com/asl/

Machine Learning for Advertising AWS Implementation Guide

Vijay Sathish

January 2018

Page 2: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Amazon Web Services –Machine Learning for Advertising January 2018

Page 2 of 12

Contents

Overview ................................................................................................................................... 3

Cost ........................................................................................................................................ 3

Architecture Overview........................................................................................................... 4

Implementation Considerations .............................................................................................. 5

Jupyter Notebook .................................................................................................................. 5

Using Your Own Data............................................................................................................ 5

EMR Web Interfaces ............................................................................................................. 6

AWS CloudFormation Template .............................................................................................. 6

Automated Deployment ........................................................................................................... 6

What We’ll Cover ................................................................................................................... 6

Step 1. Launch the Stack ....................................................................................................... 7

Step 2: Run the Notebook ..................................................................................................... 9

Security ................................................................................................................................... 10

Additional Resources .............................................................................................................. 10

Appendix: Collection of Anonymous Data .............................................................................. 11

Send Us Feedback.................................................................................................................... 12

Document Revisions ................................................................................................................ 12

About This Guide This implementation guide discusses architectural considerations and configuration steps for

deploying the Machine Learning for Advertising solution on the Amazon Web Services (AWS)

Cloud. It includes links to an AWS CloudFormation template that launches, configures, and

runs the AWS services required to deploy this solution using AWS best practices for security

and availability.

The guide is intended for data scientists, chief data officers, and data engineers who have

practical experience architecting on the AWS Cloud.

Page 3: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Amazon Web Services –Machine Learning for Advertising January 2018

Page 3 of 12

Overview Machine learning (ML) helps Amazon Web Services (AWS) customers use historical data to

predict future outcomes, which can lead to better business decisions. Machine learning

techniques are core to the digital advertising industry. Advertisers can use ML algorithms to

construct and refine mathematical models from their business data, and then use these

models to help optimize allocation of advertising budgets across multiple channels, improve

user targeting and engagement, maximize advertising revenue for the publishers, and combat

web traffic.

AWS offers several machine learning services and tools tailored for a variety of use cases and

levels of expertise, however it can be a challenge to understand the mechanics of model

training and tuning, identify relevant data features, and design a workflow that can perform

complex extraction, transformation, and loading (ETL) activities, and also scale to

accommodate large datasets.

To help customers more easily implement a machine learning workflow for advertising use

cases, AWS offers the Machine Learning for Advertising solution. This solution uses AWS

CloudFormation to deploy a scalable, customizable machine learning architecture that

leverages Apache Spark ML, an open source set of standardized APIs for machine learning

algorithms, Amazon EMR, a service that distributes and processes data across dynamically

scalable Amazon Elastic Compute Cloud (Amazon EC2) instances, and Jupyter Notebook, an

open source web application for creating and sharing live code, equations, visualizations and

narrative text.

The solution provides a framework for an end-to-end machine learning process on the AWS

Cloud. It includes pre-trained models and a sample advertising dataset to demonstrate how

to use machine learning algorithms to test and train models for predictive analysis in

advertising. Customers can use the included models as a starting point to develop their own

custom machine learning models, and customize the included notebooks for their own use

case.

Cost You are responsible for the cost of the AWS services used while running this reference

deployment. As of the date of publication, the cost for running this solution with default

settings in the US East (N. Virginia) Region is approximately $0.84 an hour. This includes

charges for Amazon EMR, Amazon S3, and Application Load Balancer. This estimate does

not include variable charges for the NAT Gateway, or data processing and data transfer costs.

Page 4: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Amazon Web Services –Machine Learning for Advertising January 2018

Page 4 of 12

Prices are subject to change. For full details, see the pricing webpage for each AWS service

you will be using in this solution.

Architecture Overview Deploying this solution with the default parameters builds the following environment in

the AWS Cloud.

Figure 1: Machine Learning for Advertising architecture on AWS

The AWS CloudFormation template deploys an Amazon Virtual Private Cloud (Amazon VPC)

network topology with one public and one private subnet, and a VPC endpoint to Amazon

Simple Storage Service (Amazon S3). The private subnet hosts an Amazon EMR cluster with

Apache Spark and custom Jupyter notebooks running on the master node. The public subnet

hosts an Application Load Balancer (ALB) that inspects and forwards incoming user requests

to the Jupyter notebooks. The template also creates an Amazon S3 bucket that includes a

synthetic dataset in Google DoubleClick Campaign Manager (DCM) format. Additionally, the

solution’s Jupyter notebooks include instructions that guide the user through the process of

constructing additional data features for future models.

When the machine learning workflow is triggered in the Jupyter web interface, the solution

ingests data from the S3 bucket into the Amazon EMR cluster and runs the Jupyter notebooks

on the dataset. This involves a sequence of code that preprocesses the data, extracts features,

and divides the data into training and testing sets. Machine learning algorithms process the

training dataset to develop a model to identify and predict the conversion rates of certain

campaigns. The solution then tests the results of the model’s predictions against the testing

dataset, identifies false positives and negatives, and retrains to fine-tune the model.

Page 5: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Amazon Web Services –Machine Learning for Advertising January 2018

Page 5 of 12

Implementation Considerations

Jupyter Notebook The Machine Learning for Advertising solution uses Jupyter Notebook, an open source web

application that allows you to create and share live code, equations, visualizations and

narrative text.

The custom Jupyter notebook develops a machine learning (ML) model that can identify and

predict things such as: user conversion and which advertising activities led to purchases. The

notebook uses a default set of predefined, advertising-related features to train and test the

model against the synthetic Google DoubleClick Campaign Manager (DCM) dataset. Users

can modify the notebook code and features to create different ML models and related

visualizations for their specific use case.

Note: Currently, the Jupyter Notebook is only supported by the latest stable version of browsers. For information on browser compatibility, see Jupyter Browser Compatibility.

Using Your Own Data This solution includes a synthetic dataset for model-training purposes, and is intended for

novice Machine Learning (ML) data scientists. If you choose to use the Jupyter notebooks

against your own datasets, we recommend following the AWS best practices for uploading

data into S3 to ensure that your data is uploaded quickly and securely. Additionally, we

recommend that you enable SSL for your Application Load Balancer (ALB) in the public

subnet. You can use the AWS CloudFormation template to enable the ALB certificate at any

time.

To use your own datasets in Amazon S3, you must point to your data in the Jupyter notebook.

Navigate to the Adtech_Ml_Workflow.ipynb file, and replace the synthetic data S3

location in data frame 1 and 2 with the location of your data in S3. For example,

impressions = "s3://ai-ml-workshop/impressions_full" and activities

= "s3://ai-ml-workshop/activity_full"

Note: Amazon EMR charges by the amount of data that is scanned, processed and analyzed. Customers who have large datasets can save on costs, and achieve better performance if your data is partitioned, compressed, or converted into a columnar format such as, Apache Parquet format.

Page 6: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Amazon Web Services –Machine Learning for Advertising January 2018

Page 6 of 12

EMR Web Interfaces When you launch an Amazon EMR cluster in a public subnet, the master node of the cluster

has a public DNS which allows you to create an SSH tunnel and securely access the Amazon

EMR web interfaces. Because this solution deploys the Amazon EMR cluster in a private

subnet, the master node will not have a public DNS for secure SSH access. To allow you to

access the Amazon EMR web interfaces, this solution deploys a bastion host with a public IP

address. You must configure dynamic port forwarding to connect to the bastion host. Use the

following command for secured port forwarding: ssh -i ~/mykeypair.pem –ND 8157

ec2-user@<BASTION_HOST_DNS>

For more information, see View Web Interfaces Hosted on Amazon EMR Clusters in the

Amazon EMR Management Guide.

AWS CloudFormation Template

This solution uses AWS CloudFormation to automate the deployment of the Machine

Learning for Advertising solution on the AWS Cloud. It includes the following

CloudFormation template, which you can download before deployment:

machine-learning-advertising.template: Use this template to

launch the Machine Learning for Advertising solution and all

associated components. The default configuration deploys an Amazon Virtual Private Cloud

(Amazon VPC) network topology, an Amazon Simple Storage Service (Amazon S3) bucket, a

bastion host, an internet gateway, a VPC endpoint, and an Amazon EMR cluster, but you can

also customize the template based on your specific network needs.

Automated Deployment Before you launch the automated deployment, please review the architecture, security, and

other information discussed in this guide. Follow the step-by-step instructions in this section

to configure and deploy the Machine Learning for Advertising solution into your account.

Time to deploy: Approximately 45 mins

What We’ll Cover The procedure for deploying this architecture on AWS consists of the following steps. For

detailed instructions, follow the links for each step.

View template

Page 7: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Amazon Web Services –Machine Learning for Advertising January 2018

Page 7 of 12

Step 1. Launch the stack

• Launch the AWS CloudFormation template into your AWS account.

• Enter values for required parameters: Availability Zones, Key Pair Name, S3

Artifact bucket, Password

• Review the other template parameters and adjust if necessary.

Step 2. Run the Notebook

• Log into the Jupyter web interface.

Step 1. Launch the Stack This automated AWS CloudFormation template deploys the Machine Learning for

Advertising solution on the AWS Cloud.

Note: You are responsible for the cost of the AWS services used while running this solution. See the Cost section for more details. For full details, see the pricing webpage for each AWS service you will be using in this solution.

1. Sign in to the AWS Management Console and click the button

to the right to launch the machine-learning-

advertising AWS CloudFormation template.

You can also download the template as a starting point for your own implementation.

2. The template is launched in the US East (N. Virginia) Region by default. To launch the

solution in a different AWS Region, use the region selector in the console navigation bar.

Note: This solution uses Amazon EMR which leverages Amazon Elastic Compute Cloud (Amazon EC2) instances, which are currently available in specific AWS Regions only. Therefore, you must launch this solution in an AWS Region where all deployed Amazon EC2 instances are available. 1

3. On the Specify Details page, assign a name to your solution stack.

4. Under Parameters, review the parameters for the template and modify them as

necessary. This solution uses the following default values.

Parameter Default Description

Availability Zones <Requires input> List of Availability Zones to use for the subnets in the VPC

Number of AZs 2 Number of Availability Zones to use in the VPC

1 For the most current service availability by AWS Region, see https://aws.amazon.com/about-aws/global-infrastructure/regional-

product-services/

Launch Solution

Page 8: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Amazon Web Services –Machine Learning for Advertising January 2018

Page 8 of 12

Parameter Default Description

Note: This number must match the number of selections you chose for Availability Zones parameter.

VPC CIDR 10.0.0.0/16 The CIDR block for the VPC

Private Subnet 1

ACIDR

10.0.0.0/19 The CIDR block for the private subnet 1A located in AZ1

Private Subnet 2

ACIDR

10.0.32.0/19 The CIDR block for the private subnet 2A located in AZ2

Public Subnet 1 CIDR 10.0.128.0/20 The CIDR block for the public DMZ subnet 1 located in AZ1

Public Subnet 2 CIDR 10.0.144.0/20 The CIDR block for the public DMZ subnet 2 located in AZ2

Public Alb Acm

Certificate

<Optional input> The AWS Certification Manager certificate ARN for the ALB

certificate.

Note: This certificate must be created in the same AWS region as the ALB and must reference the WordPress domain name you use below.

Key Pair Name <Requires input> Public and private key pair, which allows you to connect securely to the bastion host and EMR master node instances. When you created an AWS account, this is the key pair you created in your preferred AWS Region.

Remote Access CIDR <Requires input> The IP address range that can be used to SSH to the bastion host

Master m4.xlarge The EMR master node Amazon EC2 instance type

Core r4.xlarge The EMR core node Amazon EC2 instance type

Data Yes Choose whether to transfer adtech sample data to solution

created Amazon S3 bucket

Note: If you select No, the notebooks will not run

Command <Optional input> Additional Installation Packages supported eg: --r,--

julia,--torch,--ruby

S3 Artifact Bucket <Requires input> Unique name of the solution-created Amazon S3 bucket where your application artifacts will be stored

Bastion Host Yes Choose whether to deploy a bastion host. For more information,

see EMR Web Interfaces

Password <Requires input> Customer created password for logging into the Jupyter web

interface.

Note: Password must be a minimum of 6 characters and a maximum of 16 characters.

5. Choose Next.

Page 9: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Amazon Web Services –Machine Learning for Advertising January 2018

Page 9 of 12

6. On the Options page, choose Next.

7. On the Review page, review and confirm the settings. Be sure to check the box

acknowledging that the template will create AWS Identity and Access Management (IAM)

resources.

8. Choose Create to deploy the stack.

You can view the status of the stack in the AWS CloudFormation Console in the Status

column. You should see a status of CREATE_COMPLETE in approximately 45 minutes.

Note: In addition to the primary AWS Lambda functions, this solution includes the solution-helper Lambda function, which runs only during initial configuration or

when resources are updated or deleted.

When running this solution, the solution-helper function is inactive. However,

do not delete this function as it is necessary to manage associated resources.

Step 2: Run the Notebook Once you have deployed the AWS CloudFormation template, you need to log into the Jupyter

web interface. Additionally, you can run the notebook using your production data to develop

an ML model for your advertising features.

1. Navigate to the AWS CloudFormation stack Outputs tab.

2. Choose the link that is the value in the Public Alb Hostname key.

Note: Once the template is deployed, you may need to allow additional time for ALB to register with Amazon EMR.

3. Enter the password that you created in the Password parameter.

Note: You may receive a warning about an unsecure connection. The data that is transferred is synthetic (mock) data that does not need to be secured. Choose Login.

If you want to use your own data (in DCM format), we recommend that you enable SSL for your Application Load Balancer (ALB) before transferring your data. For more information, see Using Your Own Data.

4. Under files, choose adtech.

5. Select the Adtech_Ml_Workflow.ipynb file.

6. Choose Cell, then select Run all.

Page 10: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Amazon Web Services –Machine Learning for Advertising January 2018

Page 10 of 12

Now you will be able to use the included Jupyter notebooks which include pre-trained

models, and a sample advertising dataset to demonstrate how to use machine learning

algorithms to test and train models for predictive analysis in advertising.

Security When you build systems on AWS infrastructure, security responsibilities are shared between

you and AWS. This shared model can reduce your operational burden as AWS operates,

manages, and controls the components from the host operating system and virtualization

layer down to the physical security of the facilities in which the services operate. For more

information about security on AWS, visit the AWS Security Center.

Additional Resources

AWS services documentation

• AWS CloudFormation

• Amazon EMR

• AWS Lambda

• Amazon EC2

• Amazon VPC

Other resources

• Jupyter

• Apache Spark MLlib

• DoubleClick Campaign Manager

Page 11: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Amazon Web Services –Machine Learning for Advertising January 2018

Page 11 of 12

Appendix: Collection of Anonymous Data This solution includes an option to send anonymous usage data to AWS. We use this data to

better understand how customers use this solution and related services and products. When

enabled, the following information is collected and sent to AWS:

• Solution ID: The AWS solution identifier

• Unique ID (UUID): Randomly generated, unique identifier for each deployment

• Timestamp: Data-collection timestamp

• Instance Data: Count of the state and type of instances that are managed by the

Amazon EMR in each AWS Region

Example data:

Master: {m4.xlarge:1}

Core: {r4.xlarge:1}

Note that AWS will own the data gathered via this survey. Data collection will be subject

to the AWS Privacy Policy. To opt out of this feature, complete one of the following tasks:

a) Modify the AWS CloudFormation template mapping section as follows:

"Send" : {

"AnonymousUsage" : { "Data" : "Yes" }

},

to

"Send" : {

"AnonymousUsage" : { "Data" : "No" }

},

OR

b) After the solution has been launched, find the machine-learning-advertising

function in the Lambda console and set the SEND_ANONYMOUS_DATA

environment variable to No.

Page 12: Machine Learning for Advertising - Amazon S3...AWS offers several machine learning services and tools tailored for a variety of use cases and levels of expertise, however it can be

Amazon Web Services –Machine Learning for Advertising January 2018

Page 12 of 12

Send Us Feedback We welcome your questions and comments. Please post your feedback on the AWS Solutions

Discussion Forum.

You can visit the GitHub repository to download the templates and scripts for this solution,

and to share your customizations with others.

Document Revisions Date Change In sections

January 2018 Initial release —

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Notices

This document is provided for informational purposes only. It represents AWS’s current product offerings and

practices as of the date of issue of this document, which are subject to change without notice. Customers are

responsible for making their own independent assessment of the information in this document and any use of

AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether express or

implied. This document does not create any warranties, representations, contractual commitments, conditions

or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of AWS to its

customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any

agreement between AWS and its customers.

The Machine Learning for Advertising solution is licensed under the terms of the Amazon Software License

available at https://aws.amazon.com/asl/.