machine learning for advertising - amazon s3...aws offers several machine learning services and...
TRANSCRIPT
Copyright (c) 2018 by Amazon.com, Inc. or its affiliates.
Machine Learning for Advertising is licensed under the terms of the Amazon Software License available at
https://aws.amazon.com/asl/
Machine Learning for Advertising AWS Implementation Guide
Vijay Sathish
January 2018
Amazon Web Services –Machine Learning for Advertising January 2018
Page 2 of 12
Contents
Overview ................................................................................................................................... 3
Cost ........................................................................................................................................ 3
Architecture Overview........................................................................................................... 4
Implementation Considerations .............................................................................................. 5
Jupyter Notebook .................................................................................................................. 5
Using Your Own Data............................................................................................................ 5
EMR Web Interfaces ............................................................................................................. 6
AWS CloudFormation Template .............................................................................................. 6
Automated Deployment ........................................................................................................... 6
What We’ll Cover ................................................................................................................... 6
Step 1. Launch the Stack ....................................................................................................... 7
Step 2: Run the Notebook ..................................................................................................... 9
Security ................................................................................................................................... 10
Additional Resources .............................................................................................................. 10
Appendix: Collection of Anonymous Data .............................................................................. 11
Send Us Feedback.................................................................................................................... 12
Document Revisions ................................................................................................................ 12
About This Guide This implementation guide discusses architectural considerations and configuration steps for
deploying the Machine Learning for Advertising solution on the Amazon Web Services (AWS)
Cloud. It includes links to an AWS CloudFormation template that launches, configures, and
runs the AWS services required to deploy this solution using AWS best practices for security
and availability.
The guide is intended for data scientists, chief data officers, and data engineers who have
practical experience architecting on the AWS Cloud.
Amazon Web Services –Machine Learning for Advertising January 2018
Page 3 of 12
Overview Machine learning (ML) helps Amazon Web Services (AWS) customers use historical data to
predict future outcomes, which can lead to better business decisions. Machine learning
techniques are core to the digital advertising industry. Advertisers can use ML algorithms to
construct and refine mathematical models from their business data, and then use these
models to help optimize allocation of advertising budgets across multiple channels, improve
user targeting and engagement, maximize advertising revenue for the publishers, and combat
web traffic.
AWS offers several machine learning services and tools tailored for a variety of use cases and
levels of expertise, however it can be a challenge to understand the mechanics of model
training and tuning, identify relevant data features, and design a workflow that can perform
complex extraction, transformation, and loading (ETL) activities, and also scale to
accommodate large datasets.
To help customers more easily implement a machine learning workflow for advertising use
cases, AWS offers the Machine Learning for Advertising solution. This solution uses AWS
CloudFormation to deploy a scalable, customizable machine learning architecture that
leverages Apache Spark ML, an open source set of standardized APIs for machine learning
algorithms, Amazon EMR, a service that distributes and processes data across dynamically
scalable Amazon Elastic Compute Cloud (Amazon EC2) instances, and Jupyter Notebook, an
open source web application for creating and sharing live code, equations, visualizations and
narrative text.
The solution provides a framework for an end-to-end machine learning process on the AWS
Cloud. It includes pre-trained models and a sample advertising dataset to demonstrate how
to use machine learning algorithms to test and train models for predictive analysis in
advertising. Customers can use the included models as a starting point to develop their own
custom machine learning models, and customize the included notebooks for their own use
case.
Cost You are responsible for the cost of the AWS services used while running this reference
deployment. As of the date of publication, the cost for running this solution with default
settings in the US East (N. Virginia) Region is approximately $0.84 an hour. This includes
charges for Amazon EMR, Amazon S3, and Application Load Balancer. This estimate does
not include variable charges for the NAT Gateway, or data processing and data transfer costs.
Amazon Web Services –Machine Learning for Advertising January 2018
Page 4 of 12
Prices are subject to change. For full details, see the pricing webpage for each AWS service
you will be using in this solution.
Architecture Overview Deploying this solution with the default parameters builds the following environment in
the AWS Cloud.
Figure 1: Machine Learning for Advertising architecture on AWS
The AWS CloudFormation template deploys an Amazon Virtual Private Cloud (Amazon VPC)
network topology with one public and one private subnet, and a VPC endpoint to Amazon
Simple Storage Service (Amazon S3). The private subnet hosts an Amazon EMR cluster with
Apache Spark and custom Jupyter notebooks running on the master node. The public subnet
hosts an Application Load Balancer (ALB) that inspects and forwards incoming user requests
to the Jupyter notebooks. The template also creates an Amazon S3 bucket that includes a
synthetic dataset in Google DoubleClick Campaign Manager (DCM) format. Additionally, the
solution’s Jupyter notebooks include instructions that guide the user through the process of
constructing additional data features for future models.
When the machine learning workflow is triggered in the Jupyter web interface, the solution
ingests data from the S3 bucket into the Amazon EMR cluster and runs the Jupyter notebooks
on the dataset. This involves a sequence of code that preprocesses the data, extracts features,
and divides the data into training and testing sets. Machine learning algorithms process the
training dataset to develop a model to identify and predict the conversion rates of certain
campaigns. The solution then tests the results of the model’s predictions against the testing
dataset, identifies false positives and negatives, and retrains to fine-tune the model.
Amazon Web Services –Machine Learning for Advertising January 2018
Page 5 of 12
Implementation Considerations
Jupyter Notebook The Machine Learning for Advertising solution uses Jupyter Notebook, an open source web
application that allows you to create and share live code, equations, visualizations and
narrative text.
The custom Jupyter notebook develops a machine learning (ML) model that can identify and
predict things such as: user conversion and which advertising activities led to purchases. The
notebook uses a default set of predefined, advertising-related features to train and test the
model against the synthetic Google DoubleClick Campaign Manager (DCM) dataset. Users
can modify the notebook code and features to create different ML models and related
visualizations for their specific use case.
Note: Currently, the Jupyter Notebook is only supported by the latest stable version of browsers. For information on browser compatibility, see Jupyter Browser Compatibility.
Using Your Own Data This solution includes a synthetic dataset for model-training purposes, and is intended for
novice Machine Learning (ML) data scientists. If you choose to use the Jupyter notebooks
against your own datasets, we recommend following the AWS best practices for uploading
data into S3 to ensure that your data is uploaded quickly and securely. Additionally, we
recommend that you enable SSL for your Application Load Balancer (ALB) in the public
subnet. You can use the AWS CloudFormation template to enable the ALB certificate at any
time.
To use your own datasets in Amazon S3, you must point to your data in the Jupyter notebook.
Navigate to the Adtech_Ml_Workflow.ipynb file, and replace the synthetic data S3
location in data frame 1 and 2 with the location of your data in S3. For example,
impressions = "s3://ai-ml-workshop/impressions_full" and activities
= "s3://ai-ml-workshop/activity_full"
Note: Amazon EMR charges by the amount of data that is scanned, processed and analyzed. Customers who have large datasets can save on costs, and achieve better performance if your data is partitioned, compressed, or converted into a columnar format such as, Apache Parquet format.
Amazon Web Services –Machine Learning for Advertising January 2018
Page 6 of 12
EMR Web Interfaces When you launch an Amazon EMR cluster in a public subnet, the master node of the cluster
has a public DNS which allows you to create an SSH tunnel and securely access the Amazon
EMR web interfaces. Because this solution deploys the Amazon EMR cluster in a private
subnet, the master node will not have a public DNS for secure SSH access. To allow you to
access the Amazon EMR web interfaces, this solution deploys a bastion host with a public IP
address. You must configure dynamic port forwarding to connect to the bastion host. Use the
following command for secured port forwarding: ssh -i ~/mykeypair.pem –ND 8157
ec2-user@<BASTION_HOST_DNS>
For more information, see View Web Interfaces Hosted on Amazon EMR Clusters in the
Amazon EMR Management Guide.
AWS CloudFormation Template
This solution uses AWS CloudFormation to automate the deployment of the Machine
Learning for Advertising solution on the AWS Cloud. It includes the following
CloudFormation template, which you can download before deployment:
machine-learning-advertising.template: Use this template to
launch the Machine Learning for Advertising solution and all
associated components. The default configuration deploys an Amazon Virtual Private Cloud
(Amazon VPC) network topology, an Amazon Simple Storage Service (Amazon S3) bucket, a
bastion host, an internet gateway, a VPC endpoint, and an Amazon EMR cluster, but you can
also customize the template based on your specific network needs.
Automated Deployment Before you launch the automated deployment, please review the architecture, security, and
other information discussed in this guide. Follow the step-by-step instructions in this section
to configure and deploy the Machine Learning for Advertising solution into your account.
Time to deploy: Approximately 45 mins
What We’ll Cover The procedure for deploying this architecture on AWS consists of the following steps. For
detailed instructions, follow the links for each step.
View template
Amazon Web Services –Machine Learning for Advertising January 2018
Page 7 of 12
Step 1. Launch the stack
• Launch the AWS CloudFormation template into your AWS account.
• Enter values for required parameters: Availability Zones, Key Pair Name, S3
Artifact bucket, Password
• Review the other template parameters and adjust if necessary.
Step 2. Run the Notebook
• Log into the Jupyter web interface.
Step 1. Launch the Stack This automated AWS CloudFormation template deploys the Machine Learning for
Advertising solution on the AWS Cloud.
Note: You are responsible for the cost of the AWS services used while running this solution. See the Cost section for more details. For full details, see the pricing webpage for each AWS service you will be using in this solution.
1. Sign in to the AWS Management Console and click the button
to the right to launch the machine-learning-
advertising AWS CloudFormation template.
You can also download the template as a starting point for your own implementation.
2. The template is launched in the US East (N. Virginia) Region by default. To launch the
solution in a different AWS Region, use the region selector in the console navigation bar.
Note: This solution uses Amazon EMR which leverages Amazon Elastic Compute Cloud (Amazon EC2) instances, which are currently available in specific AWS Regions only. Therefore, you must launch this solution in an AWS Region where all deployed Amazon EC2 instances are available. 1
3. On the Specify Details page, assign a name to your solution stack.
4. Under Parameters, review the parameters for the template and modify them as
necessary. This solution uses the following default values.
Parameter Default Description
Availability Zones <Requires input> List of Availability Zones to use for the subnets in the VPC
Number of AZs 2 Number of Availability Zones to use in the VPC
1 For the most current service availability by AWS Region, see https://aws.amazon.com/about-aws/global-infrastructure/regional-
product-services/
Launch Solution
Amazon Web Services –Machine Learning for Advertising January 2018
Page 8 of 12
Parameter Default Description
Note: This number must match the number of selections you chose for Availability Zones parameter.
VPC CIDR 10.0.0.0/16 The CIDR block for the VPC
Private Subnet 1
ACIDR
10.0.0.0/19 The CIDR block for the private subnet 1A located in AZ1
Private Subnet 2
ACIDR
10.0.32.0/19 The CIDR block for the private subnet 2A located in AZ2
Public Subnet 1 CIDR 10.0.128.0/20 The CIDR block for the public DMZ subnet 1 located in AZ1
Public Subnet 2 CIDR 10.0.144.0/20 The CIDR block for the public DMZ subnet 2 located in AZ2
Public Alb Acm
Certificate
<Optional input> The AWS Certification Manager certificate ARN for the ALB
certificate.
Note: This certificate must be created in the same AWS region as the ALB and must reference the WordPress domain name you use below.
Key Pair Name <Requires input> Public and private key pair, which allows you to connect securely to the bastion host and EMR master node instances. When you created an AWS account, this is the key pair you created in your preferred AWS Region.
Remote Access CIDR <Requires input> The IP address range that can be used to SSH to the bastion host
Master m4.xlarge The EMR master node Amazon EC2 instance type
Core r4.xlarge The EMR core node Amazon EC2 instance type
Data Yes Choose whether to transfer adtech sample data to solution
created Amazon S3 bucket
Note: If you select No, the notebooks will not run
Command <Optional input> Additional Installation Packages supported eg: --r,--
julia,--torch,--ruby
S3 Artifact Bucket <Requires input> Unique name of the solution-created Amazon S3 bucket where your application artifacts will be stored
Bastion Host Yes Choose whether to deploy a bastion host. For more information,
see EMR Web Interfaces
Password <Requires input> Customer created password for logging into the Jupyter web
interface.
Note: Password must be a minimum of 6 characters and a maximum of 16 characters.
5. Choose Next.
Amazon Web Services –Machine Learning for Advertising January 2018
Page 9 of 12
6. On the Options page, choose Next.
7. On the Review page, review and confirm the settings. Be sure to check the box
acknowledging that the template will create AWS Identity and Access Management (IAM)
resources.
8. Choose Create to deploy the stack.
You can view the status of the stack in the AWS CloudFormation Console in the Status
column. You should see a status of CREATE_COMPLETE in approximately 45 minutes.
Note: In addition to the primary AWS Lambda functions, this solution includes the solution-helper Lambda function, which runs only during initial configuration or
when resources are updated or deleted.
When running this solution, the solution-helper function is inactive. However,
do not delete this function as it is necessary to manage associated resources.
Step 2: Run the Notebook Once you have deployed the AWS CloudFormation template, you need to log into the Jupyter
web interface. Additionally, you can run the notebook using your production data to develop
an ML model for your advertising features.
1. Navigate to the AWS CloudFormation stack Outputs tab.
2. Choose the link that is the value in the Public Alb Hostname key.
Note: Once the template is deployed, you may need to allow additional time for ALB to register with Amazon EMR.
3. Enter the password that you created in the Password parameter.
Note: You may receive a warning about an unsecure connection. The data that is transferred is synthetic (mock) data that does not need to be secured. Choose Login.
If you want to use your own data (in DCM format), we recommend that you enable SSL for your Application Load Balancer (ALB) before transferring your data. For more information, see Using Your Own Data.
4. Under files, choose adtech.
5. Select the Adtech_Ml_Workflow.ipynb file.
6. Choose Cell, then select Run all.
Amazon Web Services –Machine Learning for Advertising January 2018
Page 10 of 12
Now you will be able to use the included Jupyter notebooks which include pre-trained
models, and a sample advertising dataset to demonstrate how to use machine learning
algorithms to test and train models for predictive analysis in advertising.
Security When you build systems on AWS infrastructure, security responsibilities are shared between
you and AWS. This shared model can reduce your operational burden as AWS operates,
manages, and controls the components from the host operating system and virtualization
layer down to the physical security of the facilities in which the services operate. For more
information about security on AWS, visit the AWS Security Center.
Additional Resources
AWS services documentation
• AWS CloudFormation
• Amazon EMR
• AWS Lambda
• Amazon EC2
• Amazon VPC
Other resources
• Jupyter
• Apache Spark MLlib
• DoubleClick Campaign Manager
Amazon Web Services –Machine Learning for Advertising January 2018
Page 11 of 12
Appendix: Collection of Anonymous Data This solution includes an option to send anonymous usage data to AWS. We use this data to
better understand how customers use this solution and related services and products. When
enabled, the following information is collected and sent to AWS:
• Solution ID: The AWS solution identifier
• Unique ID (UUID): Randomly generated, unique identifier for each deployment
• Timestamp: Data-collection timestamp
• Instance Data: Count of the state and type of instances that are managed by the
Amazon EMR in each AWS Region
Example data:
Master: {m4.xlarge:1}
Core: {r4.xlarge:1}
Note that AWS will own the data gathered via this survey. Data collection will be subject
to the AWS Privacy Policy. To opt out of this feature, complete one of the following tasks:
a) Modify the AWS CloudFormation template mapping section as follows:
"Send" : {
"AnonymousUsage" : { "Data" : "Yes" }
},
to
"Send" : {
"AnonymousUsage" : { "Data" : "No" }
},
OR
b) After the solution has been launched, find the machine-learning-advertising
function in the Lambda console and set the SEND_ANONYMOUS_DATA
environment variable to No.
Amazon Web Services –Machine Learning for Advertising January 2018
Page 12 of 12
Send Us Feedback We welcome your questions and comments. Please post your feedback on the AWS Solutions
Discussion Forum.
You can visit the GitHub repository to download the templates and scripts for this solution,
and to share your customizations with others.
Document Revisions Date Change In sections
January 2018 Initial release —
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Notices
This document is provided for informational purposes only. It represents AWS’s current product offerings and
practices as of the date of issue of this document, which are subject to change without notice. Customers are
responsible for making their own independent assessment of the information in this document and any use of
AWS’s products or services, each of which is provided “as is” without warranty of any kind, whether express or
implied. This document does not create any warranties, representations, contractual commitments, conditions
or assurances from AWS, its affiliates, suppliers or licensors. The responsibilities and liabilities of AWS to its
customers are controlled by AWS agreements, and this document is not part of, nor does it modify, any
agreement between AWS and its customers.
The Machine Learning for Advertising solution is licensed under the terms of the Amazon Software License
available at https://aws.amazon.com/asl/.