partner 20 riskmetrics riskburst™ riskmetrics group offers industry-leading products and services

46
Lessons Learned: Building Scalable Applications with the Windows Azure Platform Simon Davies Windows Azure TSP Microsoft Corporation SVC32

Upload: paul-oneal

Post on 13-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Lessons Learned: Building Scalable Applications with the Windows Azure PlatformSimon DaviesWindows Azure TSPMicrosoft Corporation

SVC32

Agenda

> Objectives of this session> Thoughts on scalabilty in the cloud> Real World Lessons Learned

> Thuzi> RiskMetrics

> Summary> Questions and Answers

Scalability in the Cloud

> Scalability==work\resources> Windows Azure makes adding AND

REMOVING resources dynamic> This – along with the business model -

changes things> Capacity planning becomes dynamic> Utilisation levels are important> Definition of scale is different depending

on application type and workload arrival characteristics

Scaling Facebook Apps in the Azure Cloud

Jim ZimmermanCTO / Lead Developer Thuzi.com

partner

Who is Thuzi?

> We develop customized viral marketing solutions, utilizing a variety of technologies that engage users and measure results.

> We ensure maximum scalability through exploiting the latest virtual computing by using Microsoft's Azure Platform and Tools

Facebook Viral Application Needs

> Support for thousands of users virtually overnight … our models predicted geometric adoption

> The success of one of our clients could not be the failure for others … requirement for distinct computing environments for each Thuzi customer

> Our job is to turn social media data into real business information … must have a robust back end for reporting detailed analytics

Facebook Viral Application Needs

> Thuzi builds cool social media web apps and we don’t know much about running data centers … besides we didn’t want to purchase extra servers “just in case”

> A consistent user experience was mandatory … social media users don’t like to wait

Hosting Options

> Our own data center – Is too expensive and with unpredictable growth, hard to plan for

> Google – Didn’t have a familiar programming environment

> Amazon – Could use Windows VM’s, but did not have as many features as we wanted

> Azure - Familiar Microsoft Technologies

Technology

Outback DEMO

The Results

Fan Growth over Time

050,000

100,000150,000200,000250,000300,000350,000400,000

Fans

Fans

Lessons Learned

> Trace everything!> Errors, Debug Info> You will upgrade later if as you start to ask questions

about how your app is behaving> Track Perf Counters

> CPU Usage, Req/sec, memory usage> Use Worker roles to move data from Queues to

table storage and SQL Azure> SQL is easier to report on> Table storage allows more scalability

> Deployment> Upgrade Manually> When moving to production, use the VIP Swap feature

Tracing

> config.DiagnosticInfrastructureLogs.ScheduledTransferLogLevelFilter = LogLevel.Error;

> config.DiagnosticInfrastructureLogs.ScheduledTransferPeriod = TimeSpan.FromMinutes(5);

> config.Logs.ScheduledTransferLogLevelFilter = LogLevel.Error;

> config.Logs.ScheduledTransferPeriod = TimeSpan.FromMinutes(5);

Performance Monitoring

> var cpuUsage = new PerformanceCounterConfiguration();> cpuUsage.CounterSpecifier = @"\Processor(_Total)\% Processor

Time";> cpuUsage.SampleRate = TimeSpan.FromSeconds(5);> var pccMemory = new PerformanceCounterConfiguration();> pccMemory.CounterSpecifier = @"\Memory\Available Mbytes";> pccMemory.SampleRate = TimeSpan.FromSeconds(5);> var requestsPerSec = new PerformanceCounterConfiguration();> requestsPerSec.CounterSpecifier = @"\ASP.NET

Applications(__Total__)\Requests/Sec";> requestsPerSec.SampleRate = TimeSpan.FromSeconds(5);> config.PerformanceCounters.DataSources.Add(cpuUsage);> config.PerformanceCounters.DataSources.Add(pccMemory);> config.PerformanceCounters.DataSources.Add(requestsPerSec);> config.PerformanceCounters.ScheduledTransferPeriod =

TimeSpan.FromMinutes(5);

Deployment

> Upload new package to staging> Wait for all roles to be ready

> Use VIP Swap to upgrade and deploy to production> Rewrites the load balancer to swap

staging with production> If anything is wrong, you can swap back

Tools Needed

> Needed to be able to manage records in table storage for testing

> Needed to be able to download logs from table storage for tracing and perf counters

> Azure Storage Explorer ( Codeplex ) - Free

> Cloud Storage Studio - Cost> Do linq queries against table storage to

get specific info when needed.

In Summary

> Azure provides Thuzi a competitive advantage … so please don’t tell the other social media marketing companies and let us enjoy our 15 minute advantage

Building Scalable Applications using Windows Azure: RiskMetrics RiskBurst™

Rob Fraser and Phil JacobRiskMetrics Groupwww.riskmetrics.com

partner

www.riskmetrics.com 20

RiskMetrics RiskBurst™

RiskMetrics GroupOffers industry-leading products and services in the disciplines of risk management, corporate governance and financial research & analysis

Scaling on-premise computation to the CloudIntegration of RiskMetrics extensive on-premise capability with Windows Azure

We are running on 2,000 instances on Windows Azure

We have plans to use 10,000+ instances in 2010

What are RiskMetrics doing with so much computing power?Calculation of financial risk

Simulate scenarios for the movement of market factors over time & price financial assets in those scenarios

Notoriously complex – can involve Monte Carlo2 for complex asset classes of the kind that the triggered the 'credit crunch‘

Results in very high computational loads for RiskMetricsDaily risk analysis load equivalent to calculating risk on 4 trillion US Stocks

Computational loads are characterised by high demand peaks

Strong growth trend in calculation complexity

www.riskmetrics.com 21

Peak Load Characteristics

www.riskmetrics.com 22

Growth trend in calculation complexity

0

1

2

3

4

5

6

7

8

9

10

1994 1996 1998 2000 2002 2004 2006 2008

Risk problem complexity has doubled every 6 months

Moore’s Law

Processor power doubles every 2 years

Rela

tive

Equi

ty E

quiv

alen

t Uni

ts (L

og S

cale

)

Maximum Complexity of Risk Analysis Processing Request

www.riskmetrics.com 23

Analytics Architecture: Large-Scale Data Dependent Processing vs. Distributable Work Packets

Load

Balancer

Market and Pricing Data

Velocity Scenario

Cache

RiskServer

RiskServer

RiskServer

RiskServer

RiskServer

PricerPricer

PricerPricer

PricerPricer

Scenario Generation and

Aggregation:

These Services dependent on high

speed access to large scale data

stores and caches

Scenario Pricing:

Work Packets are self-

contained

www.riskmetrics.com 24

Work Packet Example:Pricing request for a Mortgage Backed Security

Work PacketOperation: Request price for the asset in specified market scenarioAsset• Asset id• Asset

description• Collateral

description• Size 1KB

Scenario• Interest rate

points• Swaption

Volatilities• Overrides for

explicit stresses

• Size: 5KB

Response Packet

Price for asset• Size 1KB

Logging• Optional diagnostics

• Exceptions• Size: Large

Compute Time:

150ms - 30s

www.riskmetrics.com 25

Analytics Architecture: Integration of Cloud Resources?

Load

Balancer

Market and Pricing Data

Velocity Scenario

Cache

RiskServer

RiskServer

RiskServer

RiskServer

RiskServer

PricerPricer

PricerPricer

PricerPricer

PricerPricer

PricerPricer

PricerPricer

Scenario Generation and

Aggregation:

These Services dependent on high

speed access to large scale data

stores and caches

Scenario Pricing:

Work Packets are self-

contained

www.riskmetrics.com 26

RiskBurst™ Project Timeline

March - June•Project Conception

•Choice of Platform

July - August•Initial MSFT Meetings

•RiskMetrics joins TAP

•TAP team actively involved in architecture decisions

September - October•Engineering work on scaling proof of concept

•Deep-dive sessions

•Large-scale testing with test load (200-2000 nodes)

•‘Industrialisation’ of architectural pattern

November - December•Large-scale UAT using load application

•Complete work on operational integration

Q1 2010•Run parallel with in-house solution

•Production

www.riskmetrics.com 27

RiskBurst™An architectural pattern for large scale computational applications

www.riskmetrics.com 28

Architectural Pattern

Building large scale computation requires careful design

Problem: Need to avoid the Von Neumann BottleneckKeywords: Reason and Instrument

No changes to the applicationRun on-premise on HPC Server or in cloud on Azure

Pattern has end-to-end decouplingHorizontal scaling of decoupled components

Computational Resources & Application

Workload

GenerationMessaging &

StorageWorkload

GenerationWorkload

GenerationWorkload

Generation

Messaging & StorageMessaging &

StorageMessaging & StorageMessaging &

StorageMessaging & StorageMessaging &

Storage

Computational Resources & Application

Computational Resources & Application

Computational Resources & Application

Computational Resources & Application

Computational Resources & Application

www.riskmetrics.com 29

RiskBurst™ Workflow: Windows Azure & HPC Server

RiskBurst™ Server

Workload Receiver

Batching and Sending

Outstanding Request Timeout Sweeper

Scenario

Generator

Windows Azure

Output

Queue(s)

Windows Azure

Input

Queue(s)

WCF Request

WCF Request

WCF Request

Input Message

Output Message

WCF Response

WCF Response

WCF Response

WCF Error Response

Worker Output Monitoring

www.riskmetrics.com 30

Azure QueueAzure Queue

Azure QueueAzure Queue

Azure QueueAzure Queue

Azure QueueAzure Queue

Worker Role

InstanceWorker Role

InstanceWorker Role

InstanceWorker Role

Instance

Input Queues (To Do Jobs)

Input Blob Storage

Local storage

Data

Support files in Blob Storage

Windows Azure Storage Component Usage

RiskBurst

Server

Azure Queue

Output Queues (Job done)

Output Blob Storage

Azure Queue

www.riskmetrics.com 31

Mapping to the Azure Environment

Visual Studio 2008 Azure development SDK mimics cloudMix code running in dev locally, with cloud resources such as Blob storage or queues

Good for features, does not assist with scale

Existing 32-bit .NET C++/CLI application with 3 third-party

libraries

Initial idea - run directly in web-role – but 32-bit(!) Run within worker role

Preserve WCF interface – no changes whatsoever to analytics app

Only changes to existing code base are:Retrieve Cash-flow library support files from Blob storage on demand

Some diagnostic information added

www.riskmetrics.com 32

Getting to Cloud Resources: Bandwidth & Latency

Problem: Bandwidth to Azure gateway limited by InternetSolution: pass by reference & blobs

Replace pass-by-value calls with pass-by-reference

Create key for scenario

Large, repeated objects (scenarios) pushed to blob storage

WCF call contains only key

Each of 1000 scenarios, used for all assets

Problem: Communications Latency Within data centre, 20ms latency on WCF call through HPC SOA platform

Queues and Blob storage are off-device; engineering must respect this!

Work packet : 200ms computation

Solution: batch requests within input queuesBut, more simultaneous work requests (threads outstanding on input)

www.riskmetrics.com 33

Utilizing Cloud Resources: Generating Load

www.riskmetrics.com 34

Utilizing Cloud Resources: Generating Load

Problem: Generating Load for Cloud Resources Threading architecture

Workload originally generated by synchronous calls in client

Number of outstanding pricing requests = nodes x batch size

Implies large number of threads in wait states in scenario generators

Work request made asynchronous

RiskBurst™ Server LogicCreates a balanced workload – uses a work item’s average run time

Made calls to RiskBurst™ Server asynchronous

Incoming calls create batch entry synchronously with request

Map created from message id to wait handlers

When batch full, sent on to Azure input queue

Sweeper thread gathers up output messages and uses map to associate with wait handlers

Scales well to over 1000 simultaneous requests per RiskBurst™ Server

Horizontal scale of RiskBurst™ Servers – each creates own input queue

www.riskmetrics.com 35

Horizontal Scaling within the Cloud

Problem: Saturation behaviour of queuesCan create situation where queues are saturated, made worse by retry logic

Complexity due to varied processing time

Controller will move busy queues to independent hardware

Use exponential back-off algorithm

Batch work items for each queue read or write (using 10 work packets per queue item)

Amortizing the cost  of IO against CPU time is key Batch compute sizes need to be big enough both to occupy the CPU for long enough and not cause the swamping of the queues

Also, more items contained in queue item -> fewer queue hits

But, larger batches imply more simultaneous outstanding connections on client side

Variable run-time of assets – from 150ms – 30 seconds

Carry out processing concurrently with queue accessPushing IO onto background threads is critical (the writes and the deletes are independent background tasks)

On-node caching within worker role to avoid queue reads

www.riskmetrics.com 36

Exception Management in Distributed Applications

Keep it simpleLarge distributed system implies need to engineer robustness to failure

Distinguish between events that are random and unpredictable and poison-message kind of failures

Do not over-engineer efficient handling of occasional exceptions

Return exceptions to client applicationClient can track number of attempts to process a work item

Distinguish poison messages and give up

Parallel handling on HPC Server SOA platform

Complexity from varying message processing timesTime-outs can be caused by several long-running pricings in same job

Re-try time-outs by sending all pricings in batch independently

www.riskmetrics.com 37

Diagnostics and Run-time Monitoring

A challenge for large scale applications, even more so for CloudLogging and monitoring must be switchable so as to reduce overhead

Variable level of diagnostics and logging

Requirement to filter information through decoupled architecture (on node; centralized in Azure; returned to client)

Key data for architectural patternRequest and result queue; successful/unsuccessful read, write and delete; time taken for all operations

Empty request queue gets

Count of successful/unsuccessful work packets

% Processor Time performance counter

Cache misses

We utilized custom built solution during TAPNodes broadcast over service bus

Clients subscribe to trace messages

New diagnostic & monitoring package provides platform

support

www.riskmetrics.com 38

Final CommentsIntegrating on-premise and cloud applications

www.riskmetrics.com 39

Production Services across On-Premise and Cloud

Operational IntegrationFully integrate Windows Azure capabilities with RiskMetrics Operational Infrastructure

Provisioning plus diagnostic & monitoring packages

“Outside-In” ServicesControl and visibility of the services on the cloud consistent with on-premise services.

Resource ViewNodes

Queues

Blob Stores

Process ViewThroughput & Performance

Traceability

Problem identification

Process linkage (intra- & inter-cloud)

Binding SLA Commitments

Operational Support Escalation

www.riskmetrics.com 40

RiskBurst™ on Windows Azure

Effective architectural pattern delivers key business benefitsElastic scaling

Enhanced services

Empowered innovation

High reliability

Improved agility

Windows Azure was an obvious choice of cloud platformMinimize impedance mismatch between on-premise and off-premise

.NET/WCF/HPC SOA in data center extended to cloud

Configure to run in either environment

Familiar development environment

Massive scalability

View of Azure as extension of OS into CloudUndertake work with HPC Server Team in 2010

Ability to target either Azure-hosting WCF services or HPC Server hosted WCF services in a seamless manner

Synchronization of on-premise Velocity instance with Azure instance

www.riskmetrics.com 41

Acknowledgements

www.riskmetrics.com

Prototype Development:

Stuart Hartley (University of York, UK)

Simon Davies (TAP programme)

Production Development Team:

Rich Bower (Team Lead)

Kelly Crawford (RiskBurst Server/Client)

Simon Davies (TAP Programme)

Jonathan Blair (Microsoft Consulting)

Supporting Cast:

Alistair Beagley (DPE / Azure)

Patrick Butler Monterde (TAP Programme)

Azure Product Group (Hoi Vo, Brad Calder,

Tom Fahrig, Joe Chau)

Hunter Cadzow & Analytics Development at RiskMetrics

Tom Stockdale (RiskMetrics CTO)

More Information

> SVC16 Developing Advanced Applications with Windows Azure

> SVC09 Windows Azure Tables and Queues Deep Dive

> SVC14 Windows Azure Blobs and Drives Deep Dive

> SVC08 Patterns for Building Scalable and Reliable Windows Azure Applications

> Windows Azure Platform lounge

YOUR FEEDBACK IS IMPORTANT TO US!

Please fill out session evaluation

forms online atMicrosoftPDC.com

Learn More On Channel 9

> Expand your PDC experience through Channel 9

> Explore videos, hands-on labs, sample code and demos through the new Channel 9 training courses

channel9.msdn.com/learnBuilt by Developers for Developers….

© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.