generating recommendations at amazon scale with apache spark and amazon dsstne

43
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Solutions Architect, Amazon Web Services Japan Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE Ryosuke Iwanaga October 2016

Upload: hadoop-summit

Post on 07-Jan-2017

266 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Solutions Architect, Amazon Web Services Japan

Generating Recommendationsat Amazon Scale withApache Spark and Amazon DSSTNERyosuke Iwanaga

October 2016

Page 2: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Agenda

• Recommendation and DSSTNE

• Data science productivity with AWS

Note: Details are not the actual Amazon case, but general pattern

Page 3: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Recommendation and DSSTNE

Page 4: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Product Recommendations

What are people who bought items A, B, C … Z most likely to purchase next?

Page 5: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Input and Output

InputPurchase history for each customer

OutputPossibility to buy each products for each customer

Page 6: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Machine Learning for Recommendation

Lots of algorithmsMatrix FactorizationLogistic RegressionNaïve Bayesetc.=> Neural Network

Page 7: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Neural Networks for Product Recommendations

Output (10K-10M)

Input (10K-10M)

Hidden (100-1K)

Page 8: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

This Is A Huge Sparse Data Problem

l Uncompressed sparse data either eats a lot of memory or it eats a lot of bandwidth uploading it to the GPU

l Naively running networks with uncompressed sparse data leads to lots of multiplications of zero by zero. This wastes memory, power, and time

l Product Recommendation Networks can have billions of parameters that cannot fit in a single GPU so summarizing...

Page 9: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Framework Requirements (2014)

l Efficient support for large input and output layersl Efficient handling of sparse data (i.e. don't store zero)l Automagic multi-GPU support for large networks and

scalingl Avoids multiplying zero and/or by zerol 24 hour or less training and recommendations

turnaroundl Human-readable descriptions of networks

Page 10: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

DSSTNE: Deep Sparse Scalable Tensor Network Engine*

l A Neural Network framework released into OSS by Amazonl Optimized for large sparse data problems and for fully

connected layersl Extremely efficient model-parallel multi-GPU supportl 100% Deterministic Executionl Full SM 3.x, 5.x, and 6.x support (Kepler or better GPUs)l Distributed training support OOTB (~20 lines of MPI calls)

*”Destiny”

Page 11: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Describes Neural Networks As JSON Objects{

"Version" : 0.7,"Name" : "AE","Kind" : "FeedForward", "SparsenessPenalty" : {

"p" : 0.5,"beta" : 2.0

},

"ShuffleIndices" : false,

"Denoising" : {"p" : 0.2

},

"ScaledMarginalCrossEntropy" : {"oneTarget" : 1.0,"zeroTarget" : 0.0,"oneScale" : 1.0,"zeroScale" : 1.0

},"Layers" : [

{ "Name" : "Input", "Kind" : "Input", "N" : "auto", "DataSet" : "input", "Sparse" : true }, { "Name" : "Hidden", "Kind" : "Hidden", "Type" : "FullyConnected", "N" : 128, "Activation" : "Sigmoid", "Sparse" : true },{ "Name" : "Output", "Kind" : "Output", "Type" : "FullyConnected", "DataSet" : "output", "N" : "auto", "Activation" : "Sigmoid", "Sparse" : true }

],

"ErrorFunction" : "ScaledMarginalCrossEntropy"}

Page 12: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Summary for DSSTNE

Very efficient performance for sparse fully-connected NNMultiple GPU by Model parallel and Data parallel

Declare NN by human readable formatJSON definition

100% Deterministic execution

Page 13: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Data science productivitywith AWS

Page 14: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Productivity

Agile iteration is the most important for productivitydesign=>train=>predict=>evaluate=>design=>…

Training: GPU (DSSTNE and others)Pre/Post process: CPU

How to unify these different workload?Data scientists don't want to use too much tools

Page 15: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Page 16: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

What are Containers?

OS virtualization

Process isolation

Images

Automation Server

Guest OS

Bins/Libs Bins/Libs

App2App1

Page 17: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Deep Learning meets Docker(Container)

A lot of Deep Learning frameworksDSSTNE, Caffe, Theano, TensorFlow, etc.

To compare each framework using the same input and outputContainerize each frameworkJust swap the container image and configurationNo more worry about setup machines!

Page 18: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Page 19: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Spark moves at interactive speed

join

filter

groupBy

Stage 3

Stage 1

Stage 2

A: B:

C: D: E:

F:

= cached partition= RDD

map

• Massively parallel

• Uses DAGs instead of map-reduce for execution

• Minimizes I/O by storing data in DataFrames in memory

• Partitioning-aware to avoid network-intensive shuffle

Page 20: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Apache Zeppelin notebook to develop queries

Page 21: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Architecture

Page 22: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Control CPU cluster and GPU cluster

Both CPU and GPU jobs are submitted via Spark driver

CPU jobs: Normal Spark tasks running on Amazon EMR

GPU jobs: Spark submits jobs to Amazon ECSNot only DSSTNE but also other DL frameworks by Docker

Page 23: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Amazon EMR

Page 24: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Why EMR?

Automation Decouple Elastic

Integration Low-costCurrent

Page 25: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Why EMR? Automation

EC2 Provisioning Cluster Setup Hadoop Configuration

Installing ApplicationsJob submissionMonitoring and Failure Handling

Page 26: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Why EMR? Decoupled Architecture

Separate compute and storage

Resize and shutdown with no data loss

Point multiple clusters ad the same data on

Amazon S3

Easily evolve infrastructure as

technology evolves

HDFS for iterative and disk I/O intensive

workloads

Save with spot and reserved instances

Page 27: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Why EMR? Decouple Storage and Compute

Amazon Kinesis(Streams, Firehose)

Hadoop Jobs

Persistent Cluster – Interactive Queries(Spark-SQL | Presto | Impala)

Transient Cluster - Batch Jobs(X hours nightly) – Add/Remove Nodes

ETL Jobs

Hive External Metastorei.e Amazon RDS

Workload specific clusters(Different sizes, Different Versions)

Amazon S3 for Storage

create external table t_name(..)...location s3://bucketname/path-to-file/

Page 28: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

EMR 5.0 - Applications

Page 29: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Amazon ECS

Page 30: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Amazon EC2 Container Service (ECS)

Container Managementat Any Scale

Flexible ContainerPlacement

Integrationwith the AWS Platform

Page 31: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Components of Amazon ECS

TaskActual containers running on Instances

Task DefinitionDefinition of containers and environment for task

ClusterFleet of EC2 instances on which tasks run

ManagerManage cluster resource and state of tasks

SchedulerPlace tasks considering cluster status

AgentCoordinate EC2 instances and Manager

Page 32: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

How Amazon ECS runs Task Scheduler

ManagerCluster

Task Definition

Task

Agent

Page 33: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Integration with Spark and ECS

Install AWS SDK for Java on the EMR cluster

Create Task Definition for each Deep Learning framework

Call RunTask APIECS Scheduler will try to find enough space to run it

Page 34: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Training: Model parallel

Page 35: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Page 36: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Page 37: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Prediction: Data parallel

Page 38: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Page 39: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Page 40: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Why AWS?

Scalability

Fully-managed services

GPU instances

Page 41: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Summary

Page 42: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Amazon Personalization runs on AWS

Spark and Zeppelin for the single interface for data scientists

DSSTNE helps running DL on a huge amount of sparse NN

Using Amazon EMR for CPU and Amazon ECS for GPUYou can do it!

Page 43: Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE

Thank you!