scalable media processing in the cloud (med302) | aws re:invent 2013

45
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Scalable Media Processing Phil Cluff, British Broadcasting Corporation David Sayed, Amazon Web Services November 13, 2013

Upload: amazon-web-services

Post on 16-Apr-2017

3.441 views

Category:

Technology


3 download

TRANSCRIPT

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Scalable Media Processing

Phil Cluff, British Broadcasting Corporation

David Sayed, Amazon Web Services

November 13, 2013

Agenda

• Media workflows

• Where AWS fits

• Cloud media processing approaches

• BBC iPlayer in the cloud

Media Workflows

Featurettes

Interviews

2D Movie

3D Movie

Archive

Materials

Stills

Networks

Theatrical

DVD/BD

Online

Mobile Apps

Archive

MSOs

Media Workflow

Media Workflow

Media Workflow

Where AWS Fits Into Media Processing

Amazon Web Services

Ingest

Ind

ex

Pro

cess

Package

Pro

tect

QC

Auth

.

Tra

ck

Pla

yback

Media Asset Management

Analytics and Monetization

Media Processing Approaches

3 Phases

Cloud Media Processing Approaches

Phase 1: Lift processing from the premises and shift to the cloud

Lift and Shift

Media Processing

Operation

OS Storage

Media Processing

Operation

OS Storage

EC2

Media Processing

Operation

OS Storage

EC2

The Problem with Lift and Shift

Media Processing

Operation

OS Storage

Monolithic Media Processing Operation

Ingest Operation

Post-

processing Export

Workflow Parameters

EC2

Cloud Media Processing Approaches:

Phase 2

Phase 1: Lift processing from the premises and shift to the cloud

Phase 2: Refactor and optimize to leverage cloud resources

Refactor and Optimization Opportunities

“Deconstruct monolithic media processing

operations” – Ingest

– Atomic media processing operation

– Post-processing

– Export

– Workflow

– Parameters

Refactoring and Optimization Example

AP

I C

alls

EC2 EBS

EC2 EBS

EC2 EBS

Source S3

Bucket

SWF

Output S3

Bucket

Cloud Media Processing Approaches

Phase 1: Lift processing from the premises and shift to the cloud

Phase 2: Refactor and optimize to leverage cloud resources

Phase 3: Decomposed, modular cloud-native architecture

Decomposition and Modularization Ideas

for Media Processing

• Decouple *everything* that is not part of atomic media processing operation

• Use managed services where possible for workflow, queues, databases, etc.

• Manage – Capacity

– Redundancy

– Latency

– Security

in the Cloud

AKA “Video Factory”

Phil Cluff

Principal Software Engineer & Team Lead

BBC Media Services

• The UK’s biggest video & audio on-demand service – And it’s free!

• Over 7 million requests every day – ~2% of overall consumption of BBC output

• Over 500 unique hours of content every week – Available immediately after broadcast, for at least 7 days

• Available on over 1000 devices including – PC, iOS, Android, Windows Phone, Smart TVs, Cable Boxes…

• Both streaming and download (iOS, Android, PC)

• 20 million app downloads to date

Sources:

BBC iPlayer Performance Pack August 2013

http://www.bbc.co.uk/blogs/internet/posts/Video-Factory

Video

“Where Next?”

What Is Video Factory?

• Complete in-house rebuild of ingest, transcode,

and delivery workflows for BBC iPlayer

• Scalable, message-driven cloud-based

architecture

• The result of 1 year of development by ~18

engineers

And here they are!

Why Did We Build Video Factory?

• Old system – Monolithic

– Slow

– Couldn’t cope with spikes

– Mixed ownership with third party

• Video Factory – Highly scalable, reliable

– Completely elastic transcode resource

– Complete ownership

Why Use the Cloud?

• Background of 6 channels, spikes up to 24 channels, 6 days a week

• A perfect pattern for an elastic architecture

Off-Air Transcode Requests for 1 week

Video Factory – Architecture

• Entirely message driven – Amazon Simple Queuing Service (SQS)

• Some Amazon Simple Notification Service (SNS)

– We use lots of classic message patterns

• ~20 small components – Singular responsibility – “Do one thing, and do it well”

• Share libraries if components do things that are alike

• Control bloat

– Components have contracts of behavior

• Easy to test

Video Factory – Workflow

SDI Broadcast

Video Feed x 24

Playout

Data Feed

Broadcast

Encoder

Live Ingest

Logic

Amazon Elastic

Transcoder

Elemental

Cloud

DRM

QC

Editorial

Clipping

MAM

Amazon S3

Mezzanine

Time Addressable

Media Store

Amazon S3

Distribution

Renditions

RTP

Chunker

Transcode

Abstraction

Layer

Mezzanine

Playout Video

Transcoded Video

Metadata

SMPTE

Timecode

Mezzanine Video Capture

Detail

• Mezzanine video capture

• Transcode abstraction

• Eventing demonstration

Mezzanine Video Capture

Mezzanine Capture

SDI Broadcast

Video Feed

x 24

Broadcast Grade Encoder

Amazon S3

Mezzanine

Chunks

RTP

Chunker

Chunk

Uploader

MPEG2 Transport Stream (H.264) on RTP Multicast 30 MB HD/10 MB SD

MPEG2 Transport Stream (H.264) Chunks

3 GB HD/1 GB SD

Chunk

Concatenator

Amazon S3

Mezzanine Control

Messages

SMPTE

Timecode

Concatenating Chunks

• Build file using Amazon S3 multipart requests – 10 GB Mezzanine file constructed in under 10 seconds

• Amazon S3 multipart APIs are very helpful – Component only makes REST API calls

• Small instances; still gives very high performance

• Be careful – Amazon S3 isn’t immediately consistent when dealing with multipart built files – Mitigated with rollback logic in message-based applications

By Numbers – Mezzanine Capture

• 24 channels – 6 HD, 18 SD

– 16 TB of Mezzanine data every day per capture

• 200,000 chunks every day – And Amazon S3 has never lost one

– That’s ~2 (UK) billion RTP packets every day… per capture

• Broadcast grade resiliency – Several data centers / 2 copies each

Transcode Abstraction

Transcode Abstraction

• Abstract away from single supplier – Avoid vendor lock in

– Choose suppliers based on performance and quality and broadcaster-friendly feature sets

– BBC: Elemental Cloud (GPU), Amazon Elastic Transcoder, in-house for subtitles

• Smart routing & smart bundling – Save money on non–time critical transcode

– Save time & money by bundling together “like” outputs

• Hybrid cloud friendly – Route a baseline of transcode to local encoders, and spike to cloud

• Who has the next game changer?

Transcode Abstraction

Transcode

Request

Transcode

Router

Amazon Elastic

Transcoder

Elemental

Cloud

Amazon Elastic

Transcoder

Backend

Elemental

Backend

REST SQS

Amazon S3

Mezzanine

Amazon S3

Distribution

Renditions

SQS

Subtitle

Extraction

Backend

Transcode Abstraction - Future

Transcode

Request

Transcode

Router

Amazon Elastic

Transcoder

Elemental

Cloud

Amazon Elastic

Transcoder

Backend

Elemental

Backend

SQS

Amazon S3

Mezzanine

Amazon S3

Distribution

Renditions

SQS

Subtitle

Extraction

Backend

Unknown Future

Backend X

?

REST

Example – A Simple Elastic Transcoder Backend

XML

Transcode

Request

Get Message

from Queue

Unmarshal and

Validate Message Initialize

Transcode

Wait for SNS

Callback over HTTP

XML

Transcode

Status

Message

Amazon Elastic

Transcoder

POST

POST

(Via SNS)

SQS Message Transaction

Example – Add Error Handling

XML

Transcode

Request

Get Message

from Queue

Unmarshal and

Validate Message Initialize

Transcode

Wait for SNS

Callback over HTTP

XML

Transcode

Status

Message

Amazon Elastic

Transcoder

POST

POST

(Via SNS)

Bad Message

Queue

Fail

Queue Dead Letter

Queue

SQS Message Transaction

Example – Add Monitoring Eventing

XML

Transcode

Request

Get Message

from Queue

Unmarshal and

Validate Message Initialize

Transcode

Wait for SNS

Callback over HTTP

XML

Transcode

Status

Message

Amazon Elastic

Transcoder

POST

POST

(Via SNS)

Bad Message

Queue

Fail

Queue Dead Letter

Queue

Monitoring

Events Monitoring

Events

Monitoring

Events

Monitoring

Events

SQS Message Transaction

BBC eventing framework

• Key-value pairs pushed into Splunk – Business-level events, e.g.:

• Message consumed

• Transcode started

– System-level events, e.g.:

• HTTP call returned status 404

• Application’s heap size

• Unhandled exception

• Fixed model for “context” data – Identifiable workflows, grouping of events; transactions

– Saves us a LOT of time diagnosing failures

Component Development – General Development &

Architecture

• Java applications

– Run inside Apache Tomcat on m1.small EC2 instances

– Run at least 3 of everything

– Autoscale on queue depth

• Built on top of the Apache Camel framework

– A platform for build message-driven applications

– Reliable, well-tested SQS backend

– Camel route builders Java DSL

• Full of messaging patterns

• Developed with Behavior-Driven Development (BDD) & Test-Driven Development (TDD)

– Cucumber

• Deployed continuously

– Many times a day, 5 days a week

Error Handling Messaging Patterns

• We use several message patterns – Bad message queue

– Dead letter queue

– Fail queue

• Key concept – Never lose a message

– Message is either in-flight, done, or in an error queue somewhere

• All require human intervention for the workflow to continue – Not necessarily a bad thing

Message Patterns – Bad Message Queue

• Wrapped in a message wrapper which contains context

• Never retried

• Very rare in production systems

• Implemented as an exception handler on the route builder

The message doesn’t unmarshal to the object it should

OR

We could unmarshal the object, but it doesn’t meet our

validation rules

Message Patterns – Dead Letter Queue

• Message is an exact copy of the input message

• Retried several times before being put on the DLQ

• Can be common, even in production systems

• Implemented as a bean in the route builder for SQS

We tried processing the message a number of times, and

something we weren’t expecting went wrong each time

Message Patterns – Fail Queue

• Wrapped in a message wrapper that contains context

• Requires some level of knowledge of the system to be retried

• Often evolve from understanding the causes of DLQ’d messages

• Implemented as an exception handler on the route builder

Something I knew could go wrong went wrong

Demonstration – Eventing Framework

Questions?

[email protected]

[email protected] @GeneticGenesis

@dsayed

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

MED302