generating recommendations at amazon scale with apache spark and amazon dsstne

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Solutions Architect, Amazon Web Services Japan

Generating Recommendationsat Amazon Scale withApache Spark and Amazon DSSTNERyosuke Iwanaga

October 2016

Agenda

• Recommendation and DSSTNE

• Data science productivity with AWS

Note: Details are not the actual Amazon case, but general pattern

Recommendation and DSSTNE

Product Recommendations

What are people who bought items A, B, C … Z most likely to purchase next?

Input and Output

InputPurchase history for each customer

OutputPossibility to buy each products for each customer

Machine Learning for Recommendation

Lots of algorithmsMatrix FactorizationLogistic RegressionNaïve Bayesetc.=> Neural Network

Neural Networks for Product Recommendations

Output (10K-10M)

Input (10K-10M)

Hidden (100-1K)

This Is A Huge Sparse Data Problem

l Uncompressed sparse data either eats a lot of memory or it eats a lot of bandwidth uploading it to the GPU

l Naively running networks with uncompressed sparse data leads to lots of multiplications of zero by zero. This wastes memory, power, and time

l Product Recommendation Networks can have billions of parameters that cannot fit in a single GPU so summarizing...

Framework Requirements (2014)

l Efficient support for large input and output layersl Efficient handling of sparse data (i.e. don't store zero)l Automagic multi-GPU support for large networks and

scalingl Avoids multiplying zero and/or by zerol 24 hour or less training and recommendations

turnaroundl Human-readable descriptions of networks

DSSTNE: Deep Sparse Scalable Tensor Network Engine*

l A Neural Network framework released into OSS by Amazonl Optimized for large sparse data problems and for fully

connected layersl Extremely efficient model-parallel multi-GPU supportl 100% Deterministic Executionl Full SM 3.x, 5.x, and 6.x support (Kepler or better GPUs)l Distributed training support OOTB (~20 lines of MPI calls)

*”Destiny”

Describes Neural Networks As JSON Objects{

"Version" : 0.7,"Name" : "AE","Kind" : "FeedForward", "SparsenessPenalty" : {

"p" : 0.5,"beta" : 2.0

},

"ShuffleIndices" : false,

"Denoising" : {"p" : 0.2

},

"ScaledMarginalCrossEntropy" : {"oneTarget" : 1.0,"zeroTarget" : 0.0,"oneScale" : 1.0,"zeroScale" : 1.0

},"Layers" : [

{ "Name" : "Input", "Kind" : "Input", "N" : "auto", "DataSet" : "input", "Sparse" : true }, { "Name" : "Hidden", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 128, "Activation" : "Sigmoid", "Sparse" : true },{ "Name" : "Output", "Kind" : "Output", "Type" : "FullyConnected", "DataSet" : "output", "N" : "auto", "Activation" : "Sigmoid", "Sparse" : true }

],

"ErrorFunction" : "ScaledMarginalCrossEntropy"}

Summary for DSSTNE

Very efficient performance for sparse fully-connected NNMultiple GPU by Model parallel and Data parallel

Declare NN by human readable formatJSON definition

100% Deterministic execution

Data science productivitywith AWS

Productivity

Agile iteration is the most important for productivitydesign=>train=>predict=>evaluate=>design=>…

Training: GPU (DSSTNE and others)Pre/Post process: CPU

How to unify these different workload?Data scientists don't want to use too much tools

What are Containers?

OS virtualization

Process isolation

Images

Automation Server

Guest OS

Bins/Libs Bins/Libs

App2App1

Deep Learning meets Docker(Container)

A lot of Deep Learning frameworksDSSTNE, Caffe, Theano, TensorFlow, etc.

To compare each framework using the same input and outputContainerize each frameworkJust swap the container image and configurationNo more worry about setup machines!

Spark moves at interactive speed

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

= cached partition= RDD

map

• Massively parallel

• Uses DAGs instead of map-reduce for execution

• Minimizes I/O by storing data in DataFrames in memory

• Partitioning-aware to avoid network-intensive shuffle

Apache Zeppelin notebook to develop queries

Architecture

Control CPU cluster and GPU cluster

Both CPU and GPU jobs are submitted via Spark driver

CPU jobs: Normal Spark tasks running on Amazon EMR

GPU jobs: Spark submits jobs to Amazon ECSNot only DSSTNE but also other DL frameworks by Docker

Amazon EMR

Why EMR?

Automation Decouple Elastic

Integration Low-costCurrent

Why EMR? Automation

EC2 Provisioning Cluster Setup Hadoop Configuration

Installing ApplicationsJob submissionMonitoring and Failure Handling

Why EMR? Decoupled Architecture

Separate compute and storage

Resize and shutdown with no data loss

Point multiple clusters ad the same data on

Amazon S3

Easily evolve infrastructure as

technology evolves

HDFS for iterative and disk I/O intensive

workloads

Save with spot and reserved instances

Why EMR? Decouple Storage and Compute

Amazon Kinesis(Streams, Firehose)

Hadoop Jobs

Persistent Cluster – Interactive Queries(Spark-SQL | Presto | Impala)

Transient Cluster - Batch Jobs(X hours nightly) – Add/Remove Nodes

ETL Jobs

Hive External Metastorei.e Amazon RDS

Workload specific clusters(Different sizes, Different Versions)

Amazon S3 for Storage

create external table t_name(..)...location s3://bucketname/path-to-file/

EMR 5.0 - Applications

Amazon ECS

Amazon EC2 Container Service (ECS)

Container Managementat Any Scale

Flexible ContainerPlacement

Integrationwith the AWS Platform

Components of Amazon ECS

TaskActual containers running on Instances

Task DefinitionDefinition of containers and environment for task

ClusterFleet of EC2 instances on which tasks run

ManagerManage cluster resource and state of tasks

SchedulerPlace tasks considering cluster status

AgentCoordinate EC2 instances and Manager

How Amazon ECS runs Task Scheduler

ManagerCluster

Task Definition

Task

Agent

Integration with Spark and ECS

Install AWS SDK for Java on the EMR cluster

Create Task Definition for each Deep Learning framework

Call RunTask APIECS Scheduler will try to find enough space to run it

Training: Model parallel

Prediction: Data parallel

Why AWS?

Scalability

Fully-managed services

GPU instances

Summary

Amazon Personalization runs on AWS

Spark and Zeppelin for the single interface for data scientists

DSSTNE helps running DL on a huge amount of sparse NN

Using Amazon EMR for CPU and Amazon ECS for GPUYou can do it!

Thank you!

generating recommendations at amazon scale with apache spark and amazon dsstne

Technology