maximizing audience engagement in media delivery (med303) | aws re:invent 2013
DESCRIPTION
Providing a great media consumption experience to customers is crucial to maximizing audience engagement. To do that, it is important that you make content available for consumption anytime, anywhere, on any device, with a personalized and interactive experience. This session explores the power of big data log analytics (real-time and batched), using technologies like Spark, Shark, Kafka, Amazon Elastic MapReduce, Amazon Redshift and other AWS services. Such analytics are useful for content personalization, recommendations, personalized dynamic ad-insertions, interactivity, and streaming quality. This session also includes a discussion from Netflix, which explores personalized content search and discovery with the power of metadata.TRANSCRIPT
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Maximizing Audience Engagement in Media
Delivery
Usman Shakeel, Amazon Web Services
Shobana Radhakrishnan, Engineering Manager at Netflix
November 14th 2013
Consumers Want …
• To watch content that matters to them – From anywhere
– For free
• Content to be easily accessible – Nicely organized according to their taste
– Instantly accessible from all different devices anywhere
• Content to be of high quality – Without any “irrelevant” interruptions
– Personalized ads
• To multitask while watching content – Interactivity, second screens, social media, share
Ultimate Customer Experience
• Ultimate content discovery
• Ultimate content delivery
Content Disc overy
Content Choices Evolution
… And More
+
Content Discovery Evolution
… And More
Personalized
Row Display
Similarity
Algorithms
Unified Search
Content Delivery …
User Engagement in Online Video
[Source: Conviva Viewer Experience Report – 2013]
Personalized Ads?
Mountains of raw data …
Data Sources
• Content discovery – Meta data
– Session logs
• Content delivery – Video logs
– Page-click event logs
– CDN logs
– Application logs
• Computed along several high cardinality dimensions
• Very large datasets for a specific time frame
Mountains of Raw Data …
Some numbers from Netflix – Over 40 million customers
– More than 50 countries and territories
– Translates to hundreds of billions of events in a short period of time
– Over 100 Billion Meta-data operations a day
– Over 1 Billion viewing hours per month
… To Useful Information ASAP
• Historical data – Batch Analysis
• Live data – Real-time Analysis
Historic vs. Real-time Analytics
100% Dynamic 100% Pre Computed
• Always computing on the fly
• Flexible but slow
• Scale is very hard
• Content Delivery
• Superfast lookups
• Rigid
• Do not cover all the use cases
• Content Discovery
The Challenge
Mountains of Raw Data
Real-time Processing Dashboards/
User Personalization/
User Experience
Back-end Processing Storage/DWH
Ingest & Stream
Agenda
• Ultimate Content Discovery – How Netflix creates personalized content and the power of
Metadata
• Ultimate Content Delivery – The toolset for real-time big data processing
Content Discovery Personalized Experience
Why Is Personalization Important?
• Key streaming metrics
– Median viewing hours
– Net subscribers
• Personalization consistently
improves these
• Over 75% of what people watch
comes from recommendations
What Is Video Metadata?
• Genre
• Cast
• Rating
• Contract information
• Streaming and deployment information
• Subtitles, dubbing, trailers, stills, actual content
• Hundreds of such attributes
Used For..
• User-specific choices, e.g., language, viewing behavior, taste preferences
• Recommendation algorithms
• Device-specific rendering and playback
• CDN deployment
• Billboards, trailer display, streaming
• Original programming
• Basically everything!
Our Solution
Mountains of Video Metadata
Batch Processing
Relevant Organized Data for
Consumption
Data snapshots to Amazon S3 (~10)
Metadata publishing engine (one
EC2 instance per country) generates
Amazon S3 facets (~10 per country)
Metadata cache reads Amazon S3
periodically, servers high-availability
apps deployed on EC2 (~2000
m2.4xl and m2.2xl)
Metadata Platform Architecture
Put Facets (10 per Country per Cycle)
Playback Devices API Algorithms
…..
Offline Metadata
Processing
Publishing Engine
Amazon S3
(com.netflix.mpl.test.coldstart.services)
Get Snapshot Files
(EC2 Instances –
m2.2xl or m2.4xl)
Various Metadata Generation and Entry Tools
Amazon S3
Get Blobs (~7GB files, 10 Gets per Instance Refresh)
(netflix.vms.blob.file.instancetype.region)
Put Snapshot Files – One per Source
Data Entry and
Encoding Tools Persistent Storage AmazonS3 Publishing Engine In-memory Cache Apps
Metadata entered
Hourly data
snapshots Check for
snapshots
Generate, write
artifacts
• Efficient resource
utilization
• Quick data
propagation
Java API calls
Periodic cache refresh
• Low footprint
• Quick start-
up/refresh
Initially..
~2000 EC2
Instances
Target Application Scale
• File size 2 GB–15 GB
• ~10 per country (20 total)
• ~2000 instances (m2.2x or m2.4xl) accessing these files once an
hour via cache refresh from Amazon S3
• Availability goal : Streaming: 99.99%, sign-ups: 99.9%
– 100% of metadata access in memory
– Autoscaling to efficiently manage, startup time
And Then..
6000+ EC2
Instances
Target Application Scale
• File size 2 GB–15 GB
• ~10 per country (20 500 total)
• ~2000 6000 instances (m2.2x or m2.4xl) accessing via cache refresh from Amazon S3
• 100% of access in-memory to achieve high availability
• Autoscaling to efficiently manage, startup time
Effects
• Slower file writes
• Longer publish time
• Slower startup and cache refresh
Amazon S3 Tricks That Helped
• Fewer writes – Region-based publishing engine instead of per-
country
– Blob images rather than facets
– 10 Amazon S3 writes per cycle (down from 500)
• Smaller file sizes – Deduping moved to prewrite processing
– Compression: Zipped data snapshot files from source
• Multipart writes
Results
• Significant reduction in average memory
footprint
• Significant reduction in application startup times
• Shorter publish times
What We Learned
• In-memory cache
(NetflixGraph) effective for
high availability
• Startup time important when
using autoscaling
• Use Amazon S3 best practices
• Circuit breakers
Future Architecture
• Dynamically controlled cache
• Dynamic code inclusion
• Parallelized publishing engine
Content Delivery
Toolset for real-time big-data processing
The Challenge
Mountains of Raw Data
Real-time Processing Dashboards/
User Personalization/
User Experience
Backend Processing Storage/DWH
Ingest & Stream
Ingest and Stream
Amazon S3 Amazon SQS Amazon DynamoDB
Amazon
Kinesis Kafka
PU
T
GET
A
WS
VIP
User App.1 [Aggregate & De-
Duplicate]
User Data
Sources
User Data
Sources
User Data
Sources
Amazon Kinesis Enabling Real-time ingestion & processing of streaming data
Amazon Kinesis Amazon Kinesis
Enabled Application
Control Plane
User Data
Sources
User App.2 [Metric
Extraction]
Amazon S3
DynamoDB
Amazon Redshift
User App.3
[Sliding Window]
EC2 Instance
EC2 Instance
A quick intro to
Amazon Kinesis
Producer Producer Producer
W1
Producers • Generate a Stream of data
• Data records from producers are Put into a Stream
using a developer supplied Partition Key which that
are places records within a specific Shard
Kinesis Cluster
• A managed service captures and transports data
Streams.
• Multiple Shards. Each supports 1MB/sec
• Developer controls number of shards – all shards
stored for 24 hours.
• HA & DD by 3-way replication (3X AZs)
• Each data record has a Kinesis-assigned Sequence #
Workers
• Each Shard is processed by a Worker running on
EC2 instances that developer owns and controls
S
0
S
1
S
3
S
4
W2 W3 W4 W5 W6
Kinesis
Cluster
S
2
Processing
Amazon EMR Amazon Redshift
A Quick Intro to Storm
• Similar to Hadoop cluster
• Topolgy vs. dobs (Storm vs. Hadoop) – A topology runs forever (unless you kill it)
storm jar all-my-code.jar backtype.storm.MyTopology arg1 arg2
• Streams – unbounded sequence of tuples – A stream of tweets into a stream of trending topics
– Spout: Source of streams (e.g., connect to a log API and emit a stream of logs)
– Bolt: Consumes any number of input streams, some processing, emit new streams
(e.g. filters, unions, compute)
*Source: https://github.com/nathanmarz/storm/wiki/Tutorial
A Quick Intro to Storm
Example: Get the count of ads that were clicked on and watched in a stream LinearDRPCTopologyBuilder builder = new LinearDRPCTopologyBuilder (‘‘reach’’); //Create a topology
builder.addBolt(new GetStream(), 3); //Get the stream that showed an ad. Transforms a stream of [id, ad] to [id, stream]
builder.addBolt(new GetViewers(), 12).shuffleGrouping(); //Get the viewers for ads. Transforms [id, stream] to [id, viewer]
builder.addBolt(new PartialUniquer(), 6).fieldsGrouping(new Fields(‘id’, ‘viewer’)); //Group the viewers stream by viewer id. Unique count of subset viewers
builder.addBolt(new CountAggregator(), 2).fieldsGrouping(new Fields(‘id’)); //Compute the aggregates for unique viewers
*Adopted from Source: https://github.com/nathanmarz/storm/wiki/Tutorial
Putting it together …
User Data
Source P
UT
GET
A
WS
VIP
Amazon Kinesis
Control Plane
User App.1 [Aggregate & De-
Duplicate]
Amazon Kinesis Enabled Application
User App.2 [Metric
Extraction]
User App.3
[Sliding Window]
User Data
Source
A Quick Intro to
• Language-integrated interface in Scala
• General purpose programming interface can be used for interactive data mining on clusters
• Example (count buffer events from a streaming log) lines = spark.textFile("hdfs://...") //define a data structure
errors = lines.filter(_.startsWith(‘BUFFER'))
errors.persist() //persist in memory
errors.count()
errors.filter(_.contains(‘Stream1’)).count()
errors.cache() //cache datasets in memory to speed up reuse
*Source: Resilient Distributed Datasets: A Fault tolerant Abstraction for In-memory Cluster Computing
A Quick Intro to
Logistic Regression Performance
127 s / iteration
First iteration 174 s Further iterations 6 s
30 GB dataset
80 core cluster
Up to 20x faster than Hadoop interactive jobs
Scan 1TB dataset with 5 – 7 sec latnecy
*Source: Resilient Distributed Datasets: A Fault tolerant Abstraction for In-memory Cluster Computing
Conviva GeoReport
• Aggregations on many keys w/ same WHERE clause
• 40× gain comes from: – Not rereading unused columns or filtered records
– Avoiding repeated decompression
– In-memory storage of deserialized objects
Time (hours)
Back-end Storage
Amazon
Redshift Amazon S3 Amazon
DynamoDB Amazon RDS HDFS on
Amazon EMR
Batch Processing
• EMR – Hive on EMR
– Custom UDF (user-defined functions) needs for data warehouse
• Redshift – More traditional data warehousing workload
Amazon
Redshift
Amazon
EMR
The Challenge
Mountains of Raw Data
Real-time Processing Dashboards/
User Personalization/
User Experience
Back-end Processing Storage/DWH
Ingest & Stream
The Solution
Mountains of Raw Data
Real-time Processing
Back-end Processing Storage/DWH
Ingest & Stream Amazon S3
Amazon SQS
Amazon DynamoDB
Kafka
Amazon Kinesis
Audience
Engagement Amazon EMR
Amazon Redshift
Storm
Spark
Amazon S3
Amazon RDS
Amazon DynamoDB
Amazon RedShift
Amazon EMR
What’s next …
• On the fly adaptive bit rate in the future frame rate and resolution
• Dynamic personalized ad Insertions
• Session Analytics for more monetization oportunities
• Social Media Chat
Pick up the remote
Start watching
Ultimate entertainment experience…
Please give us your feedback on this
presentation
As a thank you, we will select prize
winners daily for completed surveys!
MED303