tcp udpclouds architecture

8/22/2019 Tcp Udpclouds Architecture

1/15

The Italian Elections and the Case for

Cloudburst

The 93.000 Firewall Rules Problem and Why

Cloud is Not Just Orchestration

TCP-clouds, UDP-clouds, design for fail and AWSBY MASSIMO, ON APRIL 27TH, 2011

An entire Amazon AWS Region was recently down for four days. Everyone has got to blog something about it

and this is my attempt. Just as a warning: this post may be highly controversial.

There has been a litany of tweets pontificating how applications on AWS should be deployed in a certain way to

achieve the maximum level of availability and how applications need to be re-architected to properly fit into the

new cloud paradigm. Basically the idea is that your application should be thought, designed, architected,

developed and deployed with failure in mind. Many call it design for fail. That is to say: software architects and

developers should never assume that any given piece of the infrastructure is reliable.

I beg to differ. I dont like this idea even though some of you will be thinking I am a bit archaic.

George Reese wrote a great blog post titled The AWS Outage: The Clouds Shining Moment outlining the

differences between the design for fail model and the traditional model. The traditional model, among other

things, has high-availability and DR characteristics built right into the infrastructure and these features are

typically application-agnostic (a couple of years ago I wrote a big document on the various alternatives for HA

and DR of virtual infrastructures if you are interested). George nailed down the story very well and the story is

that there are a couple of different philosophies at play here. I dont call these two models design for fail and

traditional though. I call them TCP-clouds and UDP-clouds. Lets look at a summary of the characteristics of

these two protocols.

In the context of cloud resiliency this is what that means:

Page 1 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/


2/15

AWS uses a UDP-cloud model because it doesnt guarantee reliability at the infrastructure level. AWS essentially

offers an efficient distributed computing platform that doesnt have any built-in high availability services. The

notion ofAvailability Zones and Regions is often misunderstood since the name may imply there is high

availability built into the EC2 service. Thats not the case: AWS suggests to deploy in multiple Availability Zones

simply to avoid concurrent failures. Its mere statistic. In other words, if you deploy your application in a given

Availability Zone, there is nothing that will fail it over to another Availability Zone as part of the AWS service

(RDS is a vertical example that does that for MySQL but I am instead talking about an application-agnostic

service that does that for every application regardless of the nature).

Since I am not able at the moment to write a structured thought around this complex matter, let me write down

mixed and random thoughts, opinions and questions to try to make you think. I am giving you some food for

thoughts. As far as answers, call me when you find them please.

Isnt this design for fail theory a step back?

What we have seen in the last decade was a trend where we were able to remove the non-functional

requirements complexity from within the traditional OS and put them down into the virtual

infrastructure (arguably the backbone of any IaaS cloud). This is the point I was trying to come across during this

VMworld 2007 breakout session 4 years ago. And what we are saying now is that we should put that logic back

into the application (not even the Guest OS)? I thought the trend I have just described was quite successful and

one of the many reasons of the success of virtualization deployments. Are we now questioning it? My idea is

fairly simple although I am open to be challenged: developers focus on functional requirements, IT focuses on

non-functional requirements (which includes resiliency and reliability among other aspects). If interested, you

can download the full deck here. Note I did that presentation before joining VMware so, if you think I am biased,




3/15

well I am biased just because I bought into that school of thought long before I was on the VMwares payroll

system.

Excuse me? What did you say? NoSQL to whom?

In his post George suggested exploring NoSQL solutions. Not a bad idea however, other than the risk of losing

transactions that he was mentioning, Id say 95% of the customers I have been working with so far would look at

me strangely and theyd ask: what do you e x a c t l y mean by NoSQL? Is it a bad word?. Lets be honest folks:

this is not mainstream. If we want to create a cloud for an elite of people I am fine with that. However I amconvinced one of the key values of an IaaS infrastructure is, among others, providing a cloud-like experience (pay

-as-you-go, elasticity, etc) to traditional workloads. I am not philosophically against the idea of re-architecting

applications, however I am also convinced that, for one person thinking about writing a brand new Ruby

application for a UDP-cloud leveraging NoSQL (pardon me?) there are at least 1.000 poor sysadmins trying to

figure out how to live with their traditional applications.

Can you afford a personal Chaos Monkey?

Some of the AWS customers developed tools to test the resiliency of their applications. Do you remember the old

good HA and DR plans? IT people would walk into the server room to power-off servers and eventually the entire

datacenter to simulate a failure and see if their HA and DR policies were working properly. If everything was good

applications could survive the failure (more or less) transparently. This is what a Chaos Monkey tool does, but

with a different perspective: these are software programs that are designed to break things randomly (on

purpose) in order to see if the application itself is robust enough to survive those artificially created infrastructure

issues in the cloud. In a TCP-cloud it would be the cloud provider to run traditional tests to make sure the

infrastructure could self-recover. In a UDP-cloud it is the developer to run these Chaos Monkey tests to make

sure the application could self-recover since its been designed for fail. Now, my take is that if you are Netflix or

the like of Nasa and JPMorgan (these two are just examples of big organizations not even sure if they are on

Amazon) then you may have enough motivation and business reasons to re-architect your application for a UDP-

Cloud and create your own Chaos Monkey to test your design for fail deployment. Certainly at Netflix they

know what they are doing and in fact they seem to not have been impacted by this AWS outage. But if you are

these guys do you think you have bandwidth, knowledge and time to re-architect the application and test it for

failure? That AWS forum discussion showed up during the 4 days debacle and it deserves a proper copy and

paste just in case it gets lost:

< Sorry, I could not get through in any other way. We are a monitoring company and are monitoring hundreds of

cardiac patients at home. We were unable to see their ECG signals since 21st of April.

> Man mission critical systems should never be ran in the cloud. Just because AWS is HIPPA certified doesnt

mean it wont go down for 48+ hours in a row.

< Well, it is supposed to be reliableAnyway, I am begging anyone from Amazon team to contact us directly.

This is shocking isnt it? Try to argue with them about NoSQL and design for fail. They barely probably

understand the notion of Availability Zones and Regions. Dont get me wrong. Its not these peoples fault. They

are not in the business to re-architect an application to be written with reliability in mind, they are in the business

of helping their patients. Sure you can argue with them that it was their fault if they failed. But the net of this story

is that they are not going to re-architect anything nor write a Chaos Monkey. When they realize what happened,they will look for a TCP-Cloud.

Design for fail: philosophy or necessity?

I hope youve got at least to this point because this is my biggest struggle at the moment. The more I read about

suggestions to design applications for fail the more I miss whether these suggestions are tactical or strategic. In

other words, are you suggesting to design for fail simply because thats the way Amazon AWS works today (but

youd rather use an Amazon TCP-cloud if that was available)? Or are you suggesting that, in any case, you

should design an application for fail because you are happy to deal with a UDP-cloud and thats how every cloud

should behave? Are we saying that its strategically and philosophically better to have developers deal with




4/15

application high availability and disaster tolerance because thats what makes sense to do? Or are we saying we

need to do this because thats the only option we have on Amazon AWS (today) and there is no other choice? I

know it may sound like a rhetoric question but its actually not. Perhaps we need both models?

You dont like the noise coming from the other apartments? Buy the entire building!

This isnt related to the outage and the resiliency of the cloud but it relates to the overall TCP-cloud Vs UDP-cloud

discussion. Similar to the design for fail there is the deploy for performance thread going on. In a multi-tenant

environment (a must-have to achieve economy of scale and elasticity) there is obviously contention of resources.In an ideal world Id like to be able to buy virtual capacity for what I need and have a certain level of guarantee

that that capacity (or at least a contracted part of it) is always available for me. There are of course circumstances

where I can trade-off performance and availability of capacity for a lower cost, but there are other situations

where I cannot trade that off. A TCP-cloud should (ideally) be able to deliver that guarantee. A UDP-cloud works

in best-effort mode and typically leverages statistical law to fight contention. This is the statistical assumption: not

all users running on a shared infrastructure will be pushing like hell at the same time (one would hope finger

crossed).

So what do you have to do if you are running on a UDP-cloud?You keep the other people out of your garden.

I think Adrian is a genius but I dont agree with his point of view :

you cannot control who you are sharing with and some of the time you will be impacted by the other tenants,increasing variance within each EC2 instance. You can minimize the variance by running on the biggest instance

type, e.g. m1.xlarge, or m2.4xlarge. In this case there isnt room for another big tenant, so you get as much as

possible of the disk space and network bandwidth to yourself.

busy client can slow down other clients that share the same EBS service resources. EBS volumes are

between 1GB and 1TB in size. If you allocate a 1TB volume, you reduce the amount of multi-tenant sharing that

is going on for the resources you use, and you get more consistent performance. Netflix uses this technique, our

high traffic EBS volumes are mostly 1TB, although we dont need that much space.

If you ever see public benchmarks of AWS that only use m1.small, they are useless, it shows that the people

running the benchmark either didnt know what they were doing or are deliberately trying to make some other

system look better.

The last sentence is like saying that, if you buy a new apartment and then complain about the big noise coming

from other apartments, its your fault: you should have bought the entire building and enjoyed the silence! Hell

Adrian, I say no! There must be a better way.

I think there must be rules in place to keep the noise at an acceptable level and if there is someone trying to

scream all the time someone should enforce silence without having you to buy an entire building to cook and

sleep in peace. Thats how it works in real life, thats how it should work in the cloud. In my opinion at least.

In cloud terms Id be ok if what I was buying always delivers a contracted baseline as a guarantee and then can

burst (I said burst Beaker, not cloudburst) to higher throughput if there isnt contention. What I would NOT be ok

with is no baseline at all so what I get is no predictable performance all times. BTW note that Amazon made a

step forward in the right direction a few weeks ago announcing the availability of what they call dedicated

instances. This is an attempt to solve the noisy neighbors problem. However in doing so they did trade off multi

-tenancy (hence the higher cost of such a service).

For the records I have to say that I dont think there is a single public cloud at the moment delivering such a fine

grained QoS across all subsystems on rented resources. This is a generic discussion about TCP-clouds and

UDP-clouds and if you interpreted it like a vCloud Vs AWS shootout you are mistaken. In fact I think George gave

vCloud too much credit in his blog associating it to the traditional datacenter model. There is a gap between

what we can deliver, in terms of non-functional requirements, with a raw vSphere deployments and what we can

deliver with a vCloud Director 1.x implementation. I am not hiding this by any means, in fact you can read here

(the post but more importantly the comments) what I had to say about this. Having this said I believe VMware




5/15

Share this:

has a vision to fill that gap and create a true TCP-cloud. Last but not least I dont see why a VMware service

provider partner shouldnt be able to implement a vCloud-powered UDP-cloud if need be.

PaaS and Design for fail?

If I struggle with IaaS clouds (and I do), go figure with PaaS clouds. To me PaaS is all about moving the level of

abstraction at a higher level. IaaS is all about hiding infrastructure details. PaaS is all about hiding infrastructure

and middleware details. In a PaaS you can upload your WAR file and thats it. Its the PaaS cloud provider that is

going to deal with the complexity of setting up, managing and maintaining the middleware stack that can interpretthat WAR file (for example). Fundamentally the developer should focus (even more than with IaaS) on the

functional requirements of the application and let the cloud provider deal with the non-functional requirements

aspect of it. Last time I checked HA and DR were still part of the non-.functional requirements domain. Note that,

ironically, it may be easier for a PaaS cloud provider to build out-of-the-box resiliency given the nature of the

interfaces they are exposing. Amazon is half way through that already with theirRDS My-SQL as a service: they

already offer automatic failover across Availability Zones and they would just need to extend this failover support

across regions (this would have helped with the recent failure by the way). So, if my theory is sound, that means

that if you are architecting your application for PaaS you shouldnt design for fail. Upload your WARs, create a db

instance on the fly and you are done. The cloud provider will figure out how to failover to the next server, to the

next datacenter room or to another geography should a problem occur at any of the given levels.

So why isnt Amazon offering resiliency and reliability as part of their cloud services in the end?

After all they offer other non-functional requirements such as automatic scaling of applications through tools such

as Autoscaling. So why would Amazon offer auto-scale services and shouldnt offer an automatic, agnostic,

infrastructure-level recovery service across Availability Zones (or even better across Regions)? Guess what. It is

at least two order of magnitude easier to instantiate a new web server and add an IP to a load balancer than

implementing a (reasonably performant) backend traditional database that can geographically fail over without

losing transactions in case of a disaster. Dealing with stateless objects is a piece of cake. Try to deal with statefull

objects if you can.

I am sure Amazon doesnt think that dealing with autoscaling is something the cloud should do for developers

whereas dealing with reliability and DR is something a developer should do on his/her own. What do you think?

My speculation is that they are simply not there yet. As easy as it sounds. But dont be fooled. Amazon is full of

smart people and I think they are looking into this as we speak. While we are suggesting (to an elite ofprogrammers) to design for fail, they are thinking how to auto-recovery their infrastructure from a failure (for the

masses). I bet we will see more failure recovery across AZs and Regions type of services in one form or another

from AWS. I believe they want to implement a TCP-cloud in the long run since the UDP-cloud is not going to

serve the majority of the users out there. Mark my words. Ill have to link to this blog post once this happens and

Ill have to say I told you (I hate this). And that is only going to be a good thing because developers will start

again to focus on functionalities and IT the cloud will continue to focus on making sure those functionalities are

(highly) available.

As I said, just food for thoughts. If you find definitive answers, please let me know.

Last but not least this is a good time to remind the disclosure of my blog (courtesy of a big copy and paste from

the Sam Johnstons blog): The views expressed on these pages are mine alone and not (necessarily) those of

any current, future or former client or employer. As I reserve the right to review my position based on future

evidence, they may not even reflect my own views by the time you read them. Protip: If in doubt, ask.

Massimo.

UNCATEGORIZED




6/15


Cloudburst



31 comments to TCP-clouds, UDP-clouds, design for fail and AWS

Steve BryenApril 27, 2011 at 4:48 PM Reply

Hey,

Nice Post. I love the UDP/TCP Cloud Anology.

The main issue that I see with TCP Clouds is that you can get away with running your current app on them

and utilise the features that vendors such as VMware offer (HA/DR), but if you want 100% uptime you are

always going to have to design for failure at the application Layer.

If I hired a Chaos Monkey to go and pull out some VMware ESXi Blades, Customers wouldnt be very happy

Although their VMs would restart, there would be an outage (considering I am not running FT, as my

machines are not compatible). However if I designed my application for failure and ran it across Different

Availability Zones(Blade Chassis/Clusters), they would be much happier. Yes, they would still know there was

an outage; however their application would have still been running.

Just my 2 cents

Steve

Massimo

April 27, 2011 at 10:48 PM Reply

Steve, thanks for the comment. I agree with what you are saying. My argument is that

designing an application for fail (especially transparently with not even a brief outage) is

a titanic effort and most may be happy leveraging platform generic services.

Steve Chambers


Resiliency CHECK

Reliability CHECK

Recoverability but we cant back it up, so we cant recover it what now?

Massimo


Steve,

> Recoverability but we cant back it up, so we cant recover it what now?

Did you read the post? Thats one of the reasons why I said vCloud is not yet a full TCP-cloud.

Massimo.

Doug B





7/15

Massimo,

Good stuff here. My thought is that this should be handled at both layers, and implementation (or not)

depends on the availability needs of an application. To that end, Ill play somewhat of a Devils Advocate.

If youre building scaling/elasticity into an app, you should probabably be communicating with the platform and

it should be (made) fairly straightforward to leverage the DR/HA services provided by that platform. I would

agree with you that autoscaling provided by a platform is fairly trivial for stateless workloads and significantly

more complex for anything else.

If you have a simple application that does not require (auto)scaling think of a traditional app, wrapped in a

VM I would think the (transparent) HA capabilities of the platform (a la VMware HA) would apply. With that

level of HA (no application awareness), there would be a brief outage as recovery occurs, and that might be

acceptable for most users/applications. I propose that, for those with higher availability needs, a modified,

cloud-aware application that is designed with the cloud platform in mind may suffer brief performance

degradation while it re-protects itself across availability zones or regions, but it would not go offline. I think of

this like VMware FT, but at the application layer VMware FT is an interesting feature, but the complexity

involved with what it actually does behind the scenes has got to be significant. (Those developers have my

respect for sure)

Hacking the infrastructure to prevent *any* additional work by the application developers is, in my opinion,

adding unnecessary complexity to the entire stack. Is it really too much to ask that developers leverage the

services provided by the platform rather than blindly compiling code that meets functional requirements while

assuming it will run anywhere? Lets apply the solution to the places where it makes the most sense.

My 2 cents,

Doug

Massimo

April 28, 2011 at 9:52 AM Reply

Hi Doug. Thanks for the comment. I believe what you are suggesting is pretty much in

line with what VMware has been trying to pitch so far, which doesnt necessarily mean

its the right thing to do and you are following the dogma. It just makes sense to me. When I was

writing the post I was thinking that what I was putting down was sort of neglecting the concept of

Devops. Thats not actually the case since, as you point out, developers may still be able to interface

with the infrastructure subscribing to services provided by it. Obviously this cannot be done in the

application code (IMO) but it needs an additional level of abstraction/wrapping that is a sort of bridge

between how the application behave (or has been engineered) and how the services the infrastructure

publishes. Its a long way to go but I believe that OVF can be that bridge where developers can

describe what they need the infra to deliver. If the infra doesnt understand these metadata it will just

ignore them.

Massimo.

DeckerEgo


I can definitely agree that having to force software design constraints around design-to-fail is a

step backward. Hardware failures happen, without a doubt, but the right infrastructure should

shield the applications from outright hardware failure.

Application architecture should be centered on scalability (i.e. stateless and asynchronous architectures),

infrastructure architecture should be centered on fail-safety (split-brain networks, drive failures, PSU fires).




8/15

Enrico Signoretti


Massimo,

I strongly agree with your point of view and I would like to add my two cents.

Design for fail/UDP clouds have two big issues:

1) the development cost of a UDP cloud safe application is very high, you need a more skilled developmentteam with reliability in their DNA, a longer development process (==time) and more resources for tests/quality

purposes.

2) If you develop a designed for fail application/service on an UDP cloud you need to know very well the

underlying technology and APIs with the risk to write very closed and non portable software. Services like

AWS are sold with the promise of big savings on the infrastructure but if I need to rethink all my applications

from 0 to adapt them to AWS the only result is to move my money from one pocket to another!

the sum of two points remember me the mainframe years are we sure we want to go back to the past?

do we need to start to talk about public clouds lock-ins?

ciao,

Enrico

LChichiarelli


Massimo,

I really enjoyed reading your article, it introduced an interesting philosophy to understand what are the

implications to develop and deploy applications with failure in mind.

I think you have coined two new terms with TCP-cloud and UDP-cloud that we will hear a lot in the future

Luca

Gabriele


Massimo,

I like TCP/UDP comparison but, as a longstanding fan of UDP, I feel the need to challenge you.

DNS is a full redundant and reliable service. It works mainly on UDP and fails back by design on TCP if there

is a response truncation.

Guess what? DNS is Internets backbone and worked pretty well for a number of years.

Why shouldnt UDP Clouds do the same?

As you point it out: its about the application level.

If somebody builds an application on the Cloud the same way it does for a classic continuity-orientedinfrastructure, then this person is obviously missing the point, unfortunately 95% of the applications on the

Cloud might miss the point to my perception.

Performance variance, ephemeral instances, shared resources, design constraints: Cloud Computing (in its

purest form as Amazon sells) is a peculiar environment that needs rethinking of application architecture

paradigms and shifts accountability on the user.

All we have to expect from IaaS services is scalability and an increasing level of openness and

interoperability.




9/15

Eventually applications will become more distributed and more self-healing. Its a trend, its happening now

and will gradually absorb any additional cost (e.g. distributed, portable, stateless apps), such a model might

become mainstream. Enrico is right just for now, but not for long.

BTW: I did not really understand the part regarding noSQL. What did G. Reese did want to say? For me it was

a bit off-course; no noSQL I know does guarantee either consistency or availability per-se, they always need

special attention or an IDA file-system.

Regarding noSQL awareness: I work with some customers that use noSQL tools and every one of themknows very well the implications of the CAP theorem. Its not a big crowd but its encouraging.

Saluti

Gabriele B

Massimo


Gabriel thanks for the comment. Sure thing you have the right to disagree/challenge.

I dont know if we can make a good parallel with DNS (to applications). DNS is a bit of a weird beast. In

one way it is a statefull object for which someone found a way to create a nice distributed architecture.

However it also imposes limitations and issues due to its heavy caching depending algorithms. But this

is not the point. There are hundreds of UDP applications that are run just fine my question is.. what

about the other thousands of TCP applications?

To my philosophy Vs necessity question you are basically answering its a philosophy and

applications in the cloud should take into account the volatility of the resources there. We obviously

disagree on that point but thats fine. No one has the magic crystal ball to see what will happen in the

future.

I am always doubtful when dealing with re-architecting / re-writing applications. Perhaps its because I

have been through the Xeon Vs Itanium discussions many years ago. I am not saying that the re-

architecting of the applications to fit a UDP cloud model is going to end-up like re-writing applications

for Itanium (not at all), however this is not a matter that can be over-simplified with a it is happening.

My opinion.

As far as the NoSQL field experience I guess that it really depends on the points of view. This reminds

me of a conversion I had with a partner during my tenure at IBM. During an event I said something on

the line of we dont see RedHat/Xen a lot as far as virtualization is concerned. He approached me

saying I was wrong and that this is what he was doing all day long. When I asked what his job was it

was something like RedHat virtualization practice leader for that partner. I made a bold statement that

NoSQL-like technologies are not widespread in the field based on what I have seen. If you are very

involved in that space you may have a different percpeption.

Thanks for commenting.

Massimo.

Gabriele


Massimo.

Thanks for answering.

Just to point out that I too think noSQL is NOT a mainstream technology. What I wanted to say is that those

who need it know very well how to use it (usual of early adopters) and that we find day after day new practical




10/15

uses (complementary to common DBs). I am not specifically dedicated to noSQL, its the industry which is

starting to use it.

On UDP, my rationale is that such distributed architectures will become progressively easier to tackle and

will find their place in the IT panorama. We have seen this with SOA: these are cases that fit greenfield

projects, no oude koeien (a colorful Flemish way to call legacy stuff).

I always been convinced that, in perspective, a public Cloud is an enabler for distributed designs because of

elasticity at relatively low investment (where relative means: you still need the skills, someone consider them

as sunken costs, me not).

Whatever. I wont bother you any more with my bla-blas.

On the specific case of AWS, after reading some sharp comments on the post of G. Reese, I took the time to

peruse Amazons SLA and their FAQs and they failed. Its confusing how they functionally presented

availability zones: in good faith someone might have developed a system thinking that AWS would have

guaranteed continuity across AZs, expect them to change something in the coming days.

Massimo


No bother at all its good to have a debate. Thanks for jumping in and share your

views.

Massimo.

Lance Berc

May 1, 2011 at 4:42 PM Reply

Its a matter of cost and complexity versus perceived need. Leaning on facilities for availability in

underlying infrastructure greatly simplifies application life-cycles, lowering development, test,

and maintenance costs and increasing business agility. Distributed systems that can pass ChaosMonkey-style

testing are very complex and the testing is very expensive, and theyre generally one-off applications DNS

cited earlier is a good example, as is AD. Yet we see them fail, too usually when a botched complexconfiguration doesnt come to light until something distantly related fails. In addition, those sorts of systems

also tend to require skilled priests to feed and care for the deployment.

(If you think its easy, Nominum is hiring more people to work on the next-generation BIND system. Id be

happy to forward some resumes.)

When faced with the costs associated with such systems its not surprising that those paying initially say there

is no need for such complexity. Its only through the losses associated with failure that perceptions change.

The costs to ensure real Business Continuity are currently so high that most companies require a CEO- or

even Board of Directors-level mandate before embarking on an initiative.

Whats needed is a level of software thats above current infrastructure and below todays applications that

provides scaled distributed data access and persistence while easing development, test, and maintenance

burdens. Relational database answers like GoldenGate are prohibitively expensive for many systems; NoSQLby itself isnt an answer, nor are sharding key/value stores. So I think systems built on technologies like

Gemfire have a very bright future if the Gemfire layer can be made general enough to support a wide variety

of use cases it has the right primitives for scale, performance, distribution, and persistence with coherency

rules one can actually understand. This is a facility above IaaS that should help make PaaS a legitimate layer

to develop on.

Failure happens. People that rely on multiple in-memory copies for persistence are just delaying their day of

reckoning, and in the end there is no real replacement for streaming transaction logs to tape. The grey-hairs

embed this knowledge in the mantra, Amateurs talk about backup; professionals talk about recovery.




11/15

But maybe it doesnt matter in these days where people become billionaires before making a profit, isnt

occasional data loss somewhat over valued?

lance

Massimo

May 2, 2011 at 4:31 AM Reply

What to say Lance.. not too bad for a 6am reply. Additional food for thoughts

Massimo.

Sony and Amazon; are you a victim or in control? | | Capgemini


[...] So clearly we should be designing our apps to fail? Thats easy to say but not so easy to square with the

basic idea that we can have cheap and flexible apps for short periods. A much more radical approach as to

exactly what technology we are using and exactly what that means in terms of expectations and options is, I

believe, called for. At the root of this is the difference between TCP-based cloud services and UDP-based

cloud services, a little understood topic, which in this case can be summarised as AWS uses UDP as a basisfor its clouds and most IT departments have an expectation that the service level they will receive is that of a

TCP cloud. Some people think that this is a controversial argument, but at its root is a very simple set of

differences starting with TCP using connection oriented, and UDP being connectionless. This ying yang

occurs at every level of the two approaches, and hopefully I have now interested you enough to go to the

lively and interesting blog of Massimo on IT 2.0, and next generation IT infrastructures in which he discusses

this topic. [...]

Jacques Talbot


Massimo,

I tentatively propose that the appropriate moto is design for some failures.In the sense that the infra should take care of some failures and the application of some other failures.

Lets take Azure as a model for a change from the Amazon obsession of the last few days.

As clarified by David Chappell, there is a fundamental programming model assumption in the Azure PaaS:

An application that follows the Windows Azure programming model must be built using roles, and it must run

two or more instances of each of those roles. It must also behave correctly when any of those role instances

fails.

So, for the Web and Worker roles, you MUST have a cluster of VMs, and its minimum size is 2. Moreover ,

the PaaS, under the cover, has the right to kill on of the instances (to patch the OS for example).

This forces the programmer to think about state in a different way, and more or less requires a reliable cache

service.

On the other hand, the data tier is (supposedly) reliable, and the application is not supposed to assume that

data come and go too often (RTO and RPO permitting).So perhaps, it looks like Microsoft Azure philosophy is : design around web and business tiers failures, trust

data tier.

My 2 cents

Massimo


Thanks Jacques for chiming in. The more I think about this the more I am convinced that

all this discussion should be around statefull services (e.g. databases). Stateless




12/15

services is a no brainer and we have been using the 2+ instances of the front-end for how long since

the inception of the web? If we stick on a pure 2 or 3 tiers web architecture lets just agree we dont

need a TCP cloud for the front-end. Easy. The problem, as usual, is with the backend and with the

data. I know very little about Azure and I have never actually played with it but my feeling is that the

way they handle the SQL Azure is not very different to how Amazon treats RDS. As far as I understand

they are both clustered instances of a database (MS SQL Server and MySQL) whose (db) interfaces

are exposed to the consumer. I believe Amazon have a clear description of the RDS implementation. I

am not sure if MS has one but my speculation is that its a MSCS-clustered SQL database. In a waythis is a TCP-cloud (a PaaS cloud or better a DBaaS cloud).

So in a way, if your applications adhere to these patterns (and these technologies) both Amazon and

AWS do provide a TCP-cloud type of service (where its needed). My rant was about those

applications that do not adhere to this pattern (web app) not these technologies (MySQL, MSSQL). Not

only they dont have a solution for legacy applications but even for web applications the devil might

be in the details. Forget about the bashing around stability (its not the point), this article touches on a

few points why, if you are used to develope/deploy a web application in house.. you may find

substantial differences in a PaaS cloud environment: http://www.carlosble.com/2010/11/goodbye-

google-app-engine-gae/

I can imagine a great number of customers i have been working with finding very difficult to lose, all of

the sudden, full control over the OS/Middleware layer that is backing their web applications. The devils

is always in the details.

Massimo.

Fabio Rapposelli


Ciao Massimo,

As many others have said, I love the TCP/UDP cloud analogy.

But my concern is that we will still have to design for some degree of fail even if the underlying architecture is

already providing appropriate resiliency.

I give you a fairly recent example: http://www.theregister.co.uk/2011/05/01/stalled_by_a_lan_switch/

In that case, they were providing redundancy at the infrastructure level, but, the fail scenario wasnt a

graceful one, or one that you design your infrastructure for (well, sometimes you cant even design around

similar scenarios). The problem is that if you dont design (or re-engineer, or re-host) your application for

failure, it will still be tied to the highest level of protection that your infrastructure can give you, and sometimes,

this is simply not enough.

Just my 0.02

ciao,

Fabio

Massimo


Fabio,

mh I tend to disagree with this view. I am in favor of discussing whether more

resiliency should be built into the application Vs the infrastructure. However I am not in favor of

creating resiliency at both layers (at least not to the level that create overlapping efforts). If I have to

build resiliency into the application to overcome problems that a resilient infrastructure (TCP-cloud)

may experience then Id just use a NON-resilient infrastructure (UDP-cloud) in the first place. I have




13/15

seen a Parallel Sysplex going down entirely crashing all applications running on it. Yet this doesnt

mean mainframe programmers build resiliency into the application to overcome events like this.

Problems happen. Sure if you build resiliency into the app and deploy it onto a resilient infrastructure

you diminish drastically the chance of an outage but on the other hand you increase exponentially

the costs.

Massimo.

4 fallacies were hearing in the wake of the AWS outage Datacentre Management . org


[...] we wish to worry less, and let a height do more. Massimo Re Ferrs fascinating post, TCP-clouds, UDP-

clouds, design for fail and AWS, likewise hurdles a required knowledge that concentration architects ought to

sojourn wholly [...]

Anne Mansson


This thread was a very interesting start of my Saturday morning thanks! Even though I dont

fully grasp all of the low end discussions I would like to add some thoughts to the discussion:

1) I think it is very crucial to expose a cost comparison between a 100% available solution and a solution with

99.x% availability. In my opinion most non-functional requirements in tenders/RFP-s demand the 100%

availability approach. This is of course something you would expect from an application, but we all know that

this is complex to achieve and comes with a very hefty price tag.

I urge all IT-suppliers to ask you customer (Note! make sure this is someone from the business side that is

responsible for picking up the bill not an IT person) if they are willing to pay a 2-5 times higher price for the

solution to support the 100% availability requirement or if they can settle for a lower availability guarantee with

a significantly lower cost? Let the client evaluate the cost/benefit based on realistic risk assumptions! OK, so

AWS broke down once but honestly how often does this occur? What impact does these failure have on your

business compared to the costs of trying to avoid it (Is there really any 100& guarantee IRL?)

2) In the best of all worlds the programmers would care about the non-functional requirements I would say

that +90% of the programmers do not have a clue! They do not understand infrastructure, virtualization, high

availability, fail-over, recovery, response-times, network latency issues etc. The are focused on functional

requirements.

3) I believe there is a need for the cloud providers (public and private) to offer new innovative HA solutions

using vitualization techniques, SAN replication, load balancing networks etc to replace the traditional way of

building HA solutions in the middleware layer using clustering. This will remove a lot of complexity in the PaaS

-layer. Does anyone provide this today?

//Anne

TCP-clouds,UDP-clouds,design for fail and AWS


[...] IT 2.0 fail, TCPclouds, UDPclouds, design [...]

Massimiliano

June 15, 2011 at 5:47 PM Reply




14/15

Hello Massimo,

I think you touched a very interesting point, that will spark (and already has!) a rich debate.

The answer I would give would be. it depends on the type of applicationsafest position

Moving to the cloud computing model means moving to a new business model, as such the applications might

need a re-engineering simply for business reasons, so in such a case why not adding an extra-effort and

make them TCP-like?, and gracefully survive to the UDP-cloud failure (see here an example).

Thinking high level I would say that multi-tiered applications might hide intricacies that could lead to faults incase of a simple porting to the cloud. On the other side could we say that apps adhering to SOA model are

better candidate for cloud and as such easier to re-engineer than non-SOA apps? probably yes.

I might be now under the sceptical influence of my home reading, but if my core business had to move to the

cloud I would seriously consider making my apps robust and not relying too much on the cloud service, simply

to avoid falling into the Platonic fallacy of immature standards and because a Black Swan might just be

lurking in there.

Again thanks for your interesting blog.

Massimiliano.

Massimo

June 16, 2011 at 9:47 AM Reply

I guess that it depends is a good summary of this discussion

Massimo.

the worst still to come

October 15, 2011 at 9:22 PM Reply

so per VMW sales buy Vcloud and you have better availability ?

well this is just lies isnt it ? if vshield edge device fails ?(any related host failure)

it has no backup till ~5 minutes later , till then all VMs on all other hosts that relies on vshield edge fails for 5

minutes

those applications might not come up if TCP sessions will not be re-initiated

how this facts dont tell you quite the opposite about vcloud (that relies heavily on vshield edge), that is is NOT

high available at all ?

Massimo

October 16, 2011 at 10:47 PM Reply

This sounds the junk arguments our friend Koren Lev would use. Is that you or a close

friend that stole your junk Koren? 5 minutes to restart an Edge appliance? Excuse me,

on which planet?

Will


Im dumb so you can you explain this sentence? I do not know what it means.

I hear LOTS of bloggers use this sentence over and over-

To me PaaS is all about moving the level of abstraction at a higher level




15/15


Cloudburst



Massimo


Robert, see this: http://it20.info/2010/11/random-thoughts-and-blasphemies-around-

iaas-paas-saas-and-the-cloud-contract/

Jon

February 5, 2013 at 11:42 PM Reply

Hi Massimo

Just wondering if I could use your TCP and UDP comparison image in an assignment I am doing (i.e., non-

commercial)?

Thanks

Jon

Massimo

February 6, 2013 at 12:03 PM Reply

Of course you can.


tcp udpclouds architecture

Documents