tcp udpclouds architecture

Upload: tushpak

Post on 08-Aug-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/22/2019 Tcp Udpclouds Architecture

    1/15

    The Italian Elections and the Case for

    Cloudburst

    The 93.000 Firewall Rules Problem and Why

    Cloud is Not Just Orchestration

    TCP-clouds, UDP-clouds, design for fail and AWSBY MASSIMO, ON APRIL 27TH, 2011

    An entire Amazon AWS Region was recently down for four days. Everyone has got to blog something about it

    and this is my attempt. Just as a warning: this post may be highly controversial.

    There has been a litany of tweets pontificating how applications on AWS should be deployed in a certain way to

    achieve the maximum level of availability and how applications need to be re-architected to properly fit into the

    new cloud paradigm. Basically the idea is that your application should be thought, designed, architected,

    developed and deployed with failure in mind. Many call it design for fail. That is to say: software architects and

    developers should never assume that any given piece of the infrastructure is reliable.

    I beg to differ. I dont like this idea even though some of you will be thinking I am a bit archaic.

    George Reese wrote a great blog post titled The AWS Outage: The Clouds Shining Moment outlining the

    differences between the design for fail model and the traditional model. The traditional model, among other

    things, has high-availability and DR characteristics built right into the infrastructure and these features are

    typically application-agnostic (a couple of years ago I wrote a big document on the various alternatives for HA

    and DR of virtual infrastructures if you are interested). George nailed down the story very well and the story is

    that there are a couple of different philosophies at play here. I dont call these two models design for fail and

    traditional though. I call them TCP-clouds and UDP-clouds. Lets look at a summary of the characteristics of

    these two protocols.

    In the context of cloud resiliency this is what that means:

    Page 1 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    2/15

    AWS uses a UDP-cloud model because it doesnt guarantee reliability at the infrastructure level. AWS essentially

    offers an efficient distributed computing platform that doesnt have any built-in high availability services. The

    notion ofAvailability Zones and Regions is often misunderstood since the name may imply there is high

    availability built into the EC2 service. Thats not the case: AWS suggests to deploy in multiple Availability Zones

    simply to avoid concurrent failures. Its mere statistic. In other words, if you deploy your application in a given

    Availability Zone, there is nothing that will fail it over to another Availability Zone as part of the AWS service

    (RDS is a vertical example that does that for MySQL but I am instead talking about an application-agnostic

    service that does that for every application regardless of the nature).

    Since I am not able at the moment to write a structured thought around this complex matter, let me write down

    mixed and random thoughts, opinions and questions to try to make you think. I am giving you some food for

    thoughts. As far as answers, call me when you find them please.

    Isnt this design for fail theory a step back?

    What we have seen in the last decade was a trend where we were able to remove the non-functional

    requirements complexity from within the traditional OS and put them down into the virtual

    infrastructure (arguably the backbone of any IaaS cloud). This is the point I was trying to come across during this

    VMworld 2007 breakout session 4 years ago. And what we are saying now is that we should put that logic back

    into the application (not even the Guest OS)? I thought the trend I have just described was quite successful and

    one of the many reasons of the success of virtualization deployments. Are we now questioning it? My idea is

    fairly simple although I am open to be challenged: developers focus on functional requirements, IT focuses on

    non-functional requirements (which includes resiliency and reliability among other aspects). If interested, you

    can download the full deck here. Note I did that presentation before joining VMware so, if you think I am biased,

    Page 2 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    3/15

    well I am biased just because I bought into that school of thought long before I was on the VMwares payroll

    system.

    Excuse me? What did you say? NoSQL to whom?

    In his post George suggested exploring NoSQL solutions. Not a bad idea however, other than the risk of losing

    transactions that he was mentioning, Id say 95% of the customers I have been working with so far would look at

    me strangely and theyd ask: what do you e x a c t l y mean by NoSQL? Is it a bad word?. Lets be honest folks:

    this is not mainstream. If we want to create a cloud for an elite of people I am fine with that. However I amconvinced one of the key values of an IaaS infrastructure is, among others, providing a cloud-like experience (pay

    -as-you-go, elasticity, etc) to traditional workloads. I am not philosophically against the idea of re-architecting

    applications, however I am also convinced that, for one person thinking about writing a brand new Ruby

    application for a UDP-cloud leveraging NoSQL (pardon me?) there are at least 1.000 poor sysadmins trying to

    figure out how to live with their traditional applications.

    Can you afford a personal Chaos Monkey?

    Some of the AWS customers developed tools to test the resiliency of their applications. Do you remember the old

    good HA and DR plans? IT people would walk into the server room to power-off servers and eventually the entire

    datacenter to simulate a failure and see if their HA and DR policies were working properly. If everything was good

    applications could survive the failure (more or less) transparently. This is what a Chaos Monkey tool does, but

    with a different perspective: these are software programs that are designed to break things randomly (on

    purpose) in order to see if the application itself is robust enough to survive those artificially created infrastructure

    issues in the cloud. In a TCP-cloud it would be the cloud provider to run traditional tests to make sure the

    infrastructure could self-recover. In a UDP-cloud it is the developer to run these Chaos Monkey tests to make

    sure the application could self-recover since its been designed for fail. Now, my take is that if you are Netflix or

    the like of Nasa and JPMorgan (these two are just examples of big organizations not even sure if they are on

    Amazon) then you may have enough motivation and business reasons to re-architect your application for a UDP-

    Cloud and create your own Chaos Monkey to test your design for fail deployment. Certainly at Netflix they

    know what they are doing and in fact they seem to not have been impacted by this AWS outage. But if you are

    these guys do you think you have bandwidth, knowledge and time to re-architect the application and test it for

    failure? That AWS forum discussion showed up during the 4 days debacle and it deserves a proper copy and

    paste just in case it gets lost:

    < Sorry, I could not get through in any other way. We are a monitoring company and are monitoring hundreds of

    cardiac patients at home. We were unable to see their ECG signals since 21st of April.

    > Man mission critical systems should never be ran in the cloud. Just because AWS is HIPPA certified doesnt

    mean it wont go down for 48+ hours in a row.

    < Well, it is supposed to be reliableAnyway, I am begging anyone from Amazon team to contact us directly.

    This is shocking isnt it? Try to argue with them about NoSQL and design for fail. They barely probably

    understand the notion of Availability Zones and Regions. Dont get me wrong. Its not these peoples fault. They

    are not in the business to re-architect an application to be written with reliability in mind, they are in the business

    of helping their patients. Sure you can argue with them that it was their fault if they failed. But the net of this story

    is that they are not going to re-architect anything nor write a Chaos Monkey. When they realize what happened,they will look for a TCP-Cloud.

    Design for fail: philosophy or necessity?

    I hope youve got at least to this point because this is my biggest struggle at the moment. The more I read about

    suggestions to design applications for fail the more I miss whether these suggestions are tactical or strategic. In

    other words, are you suggesting to design for fail simply because thats the way Amazon AWS works today (but

    youd rather use an Amazon TCP-cloud if that was available)? Or are you suggesting that, in any case, you

    should design an application for fail because you are happy to deal with a UDP-cloud and thats how every cloud

    should behave? Are we saying that its strategically and philosophically better to have developers deal with

    Page 3 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    4/15

    application high availability and disaster tolerance because thats what makes sense to do? Or are we saying we

    need to do this because thats the only option we have on Amazon AWS (today) and there is no other choice? I

    know it may sound like a rhetoric question but its actually not. Perhaps we need both models?

    You dont like the noise coming from the other apartments? Buy the entire building!

    This isnt related to the outage and the resiliency of the cloud but it relates to the overall TCP-cloud Vs UDP-cloud

    discussion. Similar to the design for fail there is the deploy for performance thread going on. In a multi-tenant

    environment (a must-have to achieve economy of scale and elasticity) there is obviously contention of resources.In an ideal world Id like to be able to buy virtual capacity for what I need and have a certain level of guarantee

    that that capacity (or at least a contracted part of it) is always available for me. There are of course circumstances

    where I can trade-off performance and availability of capacity for a lower cost, but there are other situations

    where I cannot trade that off. A TCP-cloud should (ideally) be able to deliver that guarantee. A UDP-cloud works

    in best-effort mode and typically leverages statistical law to fight contention. This is the statistical assumption: not

    all users running on a shared infrastructure will be pushing like hell at the same time (one would hope finger

    crossed).

    So what do you have to do if you are running on a UDP-cloud?You keep the other people out of your garden.

    I think Adrian is a genius but I dont agree with his point of view :

    you cannot control who you are sharing with and some of the time you will be impacted by the other tenants,increasing variance within each EC2 instance. You can minimize the variance by running on the biggest instance

    type, e.g. m1.xlarge, or m2.4xlarge. In this case there isnt room for another big tenant, so you get as much as

    possible of the disk space and network bandwidth to yourself.

    busy client can slow down other clients that share the same EBS service resources. EBS volumes are

    between 1GB and 1TB in size. If you allocate a 1TB volume, you reduce the amount of multi-tenant sharing that

    is going on for the resources you use, and you get more consistent performance. Netflix uses this technique, our

    high traffic EBS volumes are mostly 1TB, although we dont need that much space.

    If you ever see public benchmarks of AWS that only use m1.small, they are useless, it shows that the people

    running the benchmark either didnt know what they were doing or are deliberately trying to make some other

    system look better.

    The last sentence is like saying that, if you buy a new apartment and then complain about the big noise coming

    from other apartments, its your fault: you should have bought the entire building and enjoyed the silence! Hell

    Adrian, I say no! There must be a better way.

    I think there must be rules in place to keep the noise at an acceptable level and if there is someone trying to

    scream all the time someone should enforce silence without having you to buy an entire building to cook and

    sleep in peace. Thats how it works in real life, thats how it should work in the cloud. In my opinion at least.

    In cloud terms Id be ok if what I was buying always delivers a contracted baseline as a guarantee and then can

    burst (I said burst Beaker, not cloudburst) to higher throughput if there isnt contention. What I would NOT be ok

    with is no baseline at all so what I get is no predictable performance all times. BTW note that Amazon made a

    step forward in the right direction a few weeks ago announcing the availability of what they call dedicated

    instances. This is an attempt to solve the noisy neighbors problem. However in doing so they did trade off multi

    -tenancy (hence the higher cost of such a service).

    For the records I have to say that I dont think there is a single public cloud at the moment delivering such a fine

    grained QoS across all subsystems on rented resources. This is a generic discussion about TCP-clouds and

    UDP-clouds and if you interpreted it like a vCloud Vs AWS shootout you are mistaken. In fact I think George gave

    vCloud too much credit in his blog associating it to the traditional datacenter model. There is a gap between

    what we can deliver, in terms of non-functional requirements, with a raw vSphere deployments and what we can

    deliver with a vCloud Director 1.x implementation. I am not hiding this by any means, in fact you can read here

    (the post but more importantly the comments) what I had to say about this. Having this said I believe VMware

    Page 4 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    5/15

    Share this:

    has a vision to fill that gap and create a true TCP-cloud. Last but not least I dont see why a VMware service

    provider partner shouldnt be able to implement a vCloud-powered UDP-cloud if need be.

    PaaS and Design for fail?

    If I struggle with IaaS clouds (and I do), go figure with PaaS clouds. To me PaaS is all about moving the level of

    abstraction at a higher level. IaaS is all about hiding infrastructure details. PaaS is all about hiding infrastructure

    and middleware details. In a PaaS you can upload your WAR file and thats it. Its the PaaS cloud provider that is

    going to deal with the complexity of setting up, managing and maintaining the middleware stack that can interpretthat WAR file (for example). Fundamentally the developer should focus (even more than with IaaS) on the

    functional requirements of the application and let the cloud provider deal with the non-functional requirements

    aspect of it. Last time I checked HA and DR were still part of the non-.functional requirements domain. Note that,

    ironically, it may be easier for a PaaS cloud provider to build out-of-the-box resiliency given the nature of the

    interfaces they are exposing. Amazon is half way through that already with theirRDS My-SQL as a service: they

    already offer automatic failover across Availability Zones and they would just need to extend this failover support

    across regions (this would have helped with the recent failure by the way). So, if my theory is sound, that means

    that if you are architecting your application for PaaS you shouldnt design for fail. Upload your WARs, create a db

    instance on the fly and you are done. The cloud provider will figure out how to failover to the next server, to the

    next datacenter room or to another geography should a problem occur at any of the given levels.

    So why isnt Amazon offering resiliency and reliability as part of their cloud services in the end?

    After all they offer other non-functional requirements such as automatic scaling of applications through tools such

    as Autoscaling. So why would Amazon offer auto-scale services and shouldnt offer an automatic, agnostic,

    infrastructure-level recovery service across Availability Zones (or even better across Regions)? Guess what. It is

    at least two order of magnitude easier to instantiate a new web server and add an IP to a load balancer than

    implementing a (reasonably performant) backend traditional database that can geographically fail over without

    losing transactions in case of a disaster. Dealing with stateless objects is a piece of cake. Try to deal with statefull

    objects if you can.

    I am sure Amazon doesnt think that dealing with autoscaling is something the cloud should do for developers

    whereas dealing with reliability and DR is something a developer should do on his/her own. What do you think?

    My speculation is that they are simply not there yet. As easy as it sounds. But dont be fooled. Amazon is full of

    smart people and I think they are looking into this as we speak. While we are suggesting (to an elite ofprogrammers) to design for fail, they are thinking how to auto-recovery their infrastructure from a failure (for the

    masses). I bet we will see more failure recovery across AZs and Regions type of services in one form or another

    from AWS. I believe they want to implement a TCP-cloud in the long run since the UDP-cloud is not going to

    serve the majority of the users out there. Mark my words. Ill have to link to this blog post once this happens and

    Ill have to say I told you (I hate this). And that is only going to be a good thing because developers will start

    again to focus on functionalities and IT the cloud will continue to focus on making sure those functionalities are

    (highly) available.

    As I said, just food for thoughts. If you find definitive answers, please let me know.

    Last but not least this is a good time to remind the disclosure of my blog (courtesy of a big copy and paste from

    the Sam Johnstons blog): The views expressed on these pages are mine alone and not (necessarily) those of

    any current, future or former client or employer. As I reserve the right to review my position based on future

    evidence, they may not even reflect my own views by the time you read them. Protip: If in doubt, ask.

    Massimo.

    UNCATEGORIZED

    Page 5 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    6/15

    The Italian Elections and the Case for

    Cloudburst

    The 93.000 Firewall Rules Problem and Why

    Cloud is Not Just Orchestration

    31 comments to TCP-clouds, UDP-clouds, design for fail and AWS

    Steve BryenApril 27, 2011 at 4:48 PM Reply

    Hey,

    Nice Post. I love the UDP/TCP Cloud Anology.

    The main issue that I see with TCP Clouds is that you can get away with running your current app on them

    and utilise the features that vendors such as VMware offer (HA/DR), but if you want 100% uptime you are

    always going to have to design for failure at the application Layer.

    If I hired a Chaos Monkey to go and pull out some VMware ESXi Blades, Customers wouldnt be very happy

    Although their VMs would restart, there would be an outage (considering I am not running FT, as my

    machines are not compatible). However if I designed my application for failure and ran it across Different

    Availability Zones(Blade Chassis/Clusters), they would be much happier. Yes, they would still know there was

    an outage; however their application would have still been running.

    Just my 2 cents

    Steve

    Massimo

    April 27, 2011 at 10:48 PM Reply

    Steve, thanks for the comment. I agree with what you are saying. My argument is that

    designing an application for fail (especially transparently with not even a brief outage) is

    a titanic effort and most may be happy leveraging platform generic services.

    Steve Chambers

    April 27, 2011 at 5:16 PM Reply

    Resiliency CHECK

    Reliability CHECK

    Recoverability but we cant back it up, so we cant recover it what now?

    Massimo

    April 27, 2011 at 10:44 PM Reply

    Steve,

    > Recoverability but we cant back it up, so we cant recover it what now?

    Did you read the post? Thats one of the reasons why I said vCloud is not yet a full TCP-cloud.

    Massimo.

    Doug B

    April 27, 2011 at 7:10 PM Reply

    Page 6 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    7/15

    Massimo,

    Good stuff here. My thought is that this should be handled at both layers, and implementation (or not)

    depends on the availability needs of an application. To that end, Ill play somewhat of a Devils Advocate.

    If youre building scaling/elasticity into an app, you should probabably be communicating with the platform and

    it should be (made) fairly straightforward to leverage the DR/HA services provided by that platform. I would

    agree with you that autoscaling provided by a platform is fairly trivial for stateless workloads and significantly

    more complex for anything else.

    If you have a simple application that does not require (auto)scaling think of a traditional app, wrapped in a

    VM I would think the (transparent) HA capabilities of the platform (a la VMware HA) would apply. With that

    level of HA (no application awareness), there would be a brief outage as recovery occurs, and that might be

    acceptable for most users/applications. I propose that, for those with higher availability needs, a modified,

    cloud-aware application that is designed with the cloud platform in mind may suffer brief performance

    degradation while it re-protects itself across availability zones or regions, but it would not go offline. I think of

    this like VMware FT, but at the application layer VMware FT is an interesting feature, but the complexity

    involved with what it actually does behind the scenes has got to be significant. (Those developers have my

    respect for sure)

    Hacking the infrastructure to prevent *any* additional work by the application developers is, in my opinion,

    adding unnecessary complexity to the entire stack. Is it really too much to ask that developers leverage the

    services provided by the platform rather than blindly compiling code that meets functional requirements while

    assuming it will run anywhere? Lets apply the solution to the places where it makes the most sense.

    My 2 cents,

    Doug

    Massimo

    April 28, 2011 at 9:52 AM Reply

    Hi Doug. Thanks for the comment. I believe what you are suggesting is pretty much in

    line with what VMware has been trying to pitch so far, which doesnt necessarily mean

    its the right thing to do and you are following the dogma. It just makes sense to me. When I was

    writing the post I was thinking that what I was putting down was sort of neglecting the concept of

    Devops. Thats not actually the case since, as you point out, developers may still be able to interface

    with the infrastructure subscribing to services provided by it. Obviously this cannot be done in the

    application code (IMO) but it needs an additional level of abstraction/wrapping that is a sort of bridge

    between how the application behave (or has been engineered) and how the services the infrastructure

    publishes. Its a long way to go but I believe that OVF can be that bridge where developers can

    describe what they need the infra to deliver. If the infra doesnt understand these metadata it will just

    ignore them.

    Massimo.

    DeckerEgo

    April 27, 2011 at 7:12 PM Reply

    I can definitely agree that having to force software design constraints around design-to-fail is a

    step backward. Hardware failures happen, without a doubt, but the right infrastructure should

    shield the applications from outright hardware failure.

    Application architecture should be centered on scalability (i.e. stateless and asynchronous architectures),

    infrastructure architecture should be centered on fail-safety (split-brain networks, drive failures, PSU fires).

    Page 7 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    8/15

    Enrico Signoretti

    April 28, 2011 at 12:52 PM Reply

    Massimo,

    I strongly agree with your point of view and I would like to add my two cents.

    Design for fail/UDP clouds have two big issues:

    1) the development cost of a UDP cloud safe application is very high, you need a more skilled developmentteam with reliability in their DNA, a longer development process (==time) and more resources for tests/quality

    purposes.

    2) If you develop a designed for fail application/service on an UDP cloud you need to know very well the

    underlying technology and APIs with the risk to write very closed and non portable software. Services like

    AWS are sold with the promise of big savings on the infrastructure but if I need to rethink all my applications

    from 0 to adapt them to AWS the only result is to move my money from one pocket to another!

    the sum of two points remember me the mainframe years are we sure we want to go back to the past?

    do we need to start to talk about public clouds lock-ins?

    ciao,

    Enrico

    LChichiarelli

    April 28, 2011 at 2:02 PM Reply

    Massimo,

    I really enjoyed reading your article, it introduced an interesting philosophy to understand what are the

    implications to develop and deploy applications with failure in mind.

    I think you have coined two new terms with TCP-cloud and UDP-cloud that we will hear a lot in the future

    Luca

    Gabriele

    April 28, 2011 at 11:10 PM Reply

    Massimo,

    I like TCP/UDP comparison but, as a longstanding fan of UDP, I feel the need to challenge you.

    DNS is a full redundant and reliable service. It works mainly on UDP and fails back by design on TCP if there

    is a response truncation.

    Guess what? DNS is Internets backbone and worked pretty well for a number of years.

    Why shouldnt UDP Clouds do the same?

    As you point it out: its about the application level.

    If somebody builds an application on the Cloud the same way it does for a classic continuity-orientedinfrastructure, then this person is obviously missing the point, unfortunately 95% of the applications on the

    Cloud might miss the point to my perception.

    Performance variance, ephemeral instances, shared resources, design constraints: Cloud Computing (in its

    purest form as Amazon sells) is a peculiar environment that needs rethinking of application architecture

    paradigms and shifts accountability on the user.

    All we have to expect from IaaS services is scalability and an increasing level of openness and

    interoperability.

    Page 8 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    9/15

    Eventually applications will become more distributed and more self-healing. Its a trend, its happening now

    and will gradually absorb any additional cost (e.g. distributed, portable, stateless apps), such a model might

    become mainstream. Enrico is right just for now, but not for long.

    BTW: I did not really understand the part regarding noSQL. What did G. Reese did want to say? For me it was

    a bit off-course; no noSQL I know does guarantee either consistency or availability per-se, they always need

    special attention or an IDA file-system.

    Regarding noSQL awareness: I work with some customers that use noSQL tools and every one of themknows very well the implications of the CAP theorem. Its not a big crowd but its encouraging.

    Saluti

    Gabriele B

    Massimo

    April 29, 2011 at 5:36 PM Reply

    Gabriel thanks for the comment. Sure thing you have the right to disagree/challenge.

    I dont know if we can make a good parallel with DNS (to applications). DNS is a bit of a weird beast. In

    one way it is a statefull object for which someone found a way to create a nice distributed architecture.

    However it also imposes limitations and issues due to its heavy caching depending algorithms. But this

    is not the point. There are hundreds of UDP applications that are run just fine my question is.. what

    about the other thousands of TCP applications?

    To my philosophy Vs necessity question you are basically answering its a philosophy and

    applications in the cloud should take into account the volatility of the resources there. We obviously

    disagree on that point but thats fine. No one has the magic crystal ball to see what will happen in the

    future.

    I am always doubtful when dealing with re-architecting / re-writing applications. Perhaps its because I

    have been through the Xeon Vs Itanium discussions many years ago. I am not saying that the re-

    architecting of the applications to fit a UDP cloud model is going to end-up like re-writing applications

    for Itanium (not at all), however this is not a matter that can be over-simplified with a it is happening.

    My opinion.

    As far as the NoSQL field experience I guess that it really depends on the points of view. This reminds

    me of a conversion I had with a partner during my tenure at IBM. During an event I said something on

    the line of we dont see RedHat/Xen a lot as far as virtualization is concerned. He approached me

    saying I was wrong and that this is what he was doing all day long. When I asked what his job was it

    was something like RedHat virtualization practice leader for that partner. I made a bold statement that

    NoSQL-like technologies are not widespread in the field based on what I have seen. If you are very

    involved in that space you may have a different percpeption.

    Thanks for commenting.

    Massimo.

    Gabriele

    April 29, 2011 at 10:03 PM Reply

    Massimo.

    Thanks for answering.

    Just to point out that I too think noSQL is NOT a mainstream technology. What I wanted to say is that those

    who need it know very well how to use it (usual of early adopters) and that we find day after day new practical

    Page 9 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    10/15

    uses (complementary to common DBs). I am not specifically dedicated to noSQL, its the industry which is

    starting to use it.

    On UDP, my rationale is that such distributed architectures will become progressively easier to tackle and

    will find their place in the IT panorama. We have seen this with SOA: these are cases that fit greenfield

    projects, no oude koeien (a colorful Flemish way to call legacy stuff).

    I always been convinced that, in perspective, a public Cloud is an enabler for distributed designs because of

    elasticity at relatively low investment (where relative means: you still need the skills, someone consider them

    as sunken costs, me not).

    Whatever. I wont bother you any more with my bla-blas.

    On the specific case of AWS, after reading some sharp comments on the post of G. Reese, I took the time to

    peruse Amazons SLA and their FAQs and they failed. Its confusing how they functionally presented

    availability zones: in good faith someone might have developed a system thinking that AWS would have

    guaranteed continuity across AZs, expect them to change something in the coming days.

    Massimo

    April 30, 2011 at 3:20 PM Reply

    No bother at all its good to have a debate. Thanks for jumping in and share your

    views.

    Massimo.

    Lance Berc

    May 1, 2011 at 4:42 PM Reply

    Its a matter of cost and complexity versus perceived need. Leaning on facilities for availability in

    underlying infrastructure greatly simplifies application life-cycles, lowering development, test,

    and maintenance costs and increasing business agility. Distributed systems that can pass ChaosMonkey-style

    testing are very complex and the testing is very expensive, and theyre generally one-off applications DNS

    cited earlier is a good example, as is AD. Yet we see them fail, too usually when a botched complexconfiguration doesnt come to light until something distantly related fails. In addition, those sorts of systems

    also tend to require skilled priests to feed and care for the deployment.

    (If you think its easy, Nominum is hiring more people to work on the next-generation BIND system. Id be

    happy to forward some resumes.)

    When faced with the costs associated with such systems its not surprising that those paying initially say there

    is no need for such complexity. Its only through the losses associated with failure that perceptions change.

    The costs to ensure real Business Continuity are currently so high that most companies require a CEO- or

    even Board of Directors-level mandate before embarking on an initiative.

    Whats needed is a level of software thats above current infrastructure and below todays applications that

    provides scaled distributed data access and persistence while easing development, test, and maintenance

    burdens. Relational database answers like GoldenGate are prohibitively expensive for many systems; NoSQLby itself isnt an answer, nor are sharding key/value stores. So I think systems built on technologies like

    Gemfire have a very bright future if the Gemfire layer can be made general enough to support a wide variety

    of use cases it has the right primitives for scale, performance, distribution, and persistence with coherency

    rules one can actually understand. This is a facility above IaaS that should help make PaaS a legitimate layer

    to develop on.

    Failure happens. People that rely on multiple in-memory copies for persistence are just delaying their day of

    reckoning, and in the end there is no real replacement for streaming transaction logs to tape. The grey-hairs

    embed this knowledge in the mantra, Amateurs talk about backup; professionals talk about recovery.

    Page 10 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    11/15

    But maybe it doesnt matter in these days where people become billionaires before making a profit, isnt

    occasional data loss somewhat over valued?

    lance

    Massimo

    May 2, 2011 at 4:31 AM Reply

    What to say Lance.. not too bad for a 6am reply. Additional food for thoughts

    Massimo.

    Sony and Amazon; are you a victim or in control? | | Capgemini

    May 3, 2011 at 10:15 AM Reply

    [...] So clearly we should be designing our apps to fail? Thats easy to say but not so easy to square with the

    basic idea that we can have cheap and flexible apps for short periods. A much more radical approach as to

    exactly what technology we are using and exactly what that means in terms of expectations and options is, I

    believe, called for. At the root of this is the difference between TCP-based cloud services and UDP-based

    cloud services, a little understood topic, which in this case can be summarised as AWS uses UDP as a basisfor its clouds and most IT departments have an expectation that the service level they will receive is that of a

    TCP cloud. Some people think that this is a controversial argument, but at its root is a very simple set of

    differences starting with TCP using connection oriented, and UDP being connectionless. This ying yang

    occurs at every level of the two approaches, and hopefully I have now interested you enough to go to the

    lively and interesting blog of Massimo on IT 2.0, and next generation IT infrastructures in which he discusses

    this topic. [...]

    Jacques Talbot

    May 3, 2011 at 10:20 AM Reply

    Massimo,

    I tentatively propose that the appropriate moto is design for some failures.In the sense that the infra should take care of some failures and the application of some other failures.

    Lets take Azure as a model for a change from the Amazon obsession of the last few days.

    As clarified by David Chappell, there is a fundamental programming model assumption in the Azure PaaS:

    An application that follows the Windows Azure programming model must be built using roles, and it must run

    two or more instances of each of those roles. It must also behave correctly when any of those role instances

    fails.

    So, for the Web and Worker roles, you MUST have a cluster of VMs, and its minimum size is 2. Moreover ,

    the PaaS, under the cover, has the right to kill on of the instances (to patch the OS for example).

    This forces the programmer to think about state in a different way, and more or less requires a reliable cache

    service.

    On the other hand, the data tier is (supposedly) reliable, and the application is not supposed to assume that

    data come and go too often (RTO and RPO permitting).So perhaps, it looks like Microsoft Azure philosophy is : design around web and business tiers failures, trust

    data tier.

    My 2 cents

    Massimo

    May 3, 2011 at 2:55 PM Reply

    Thanks Jacques for chiming in. The more I think about this the more I am convinced that

    all this discussion should be around statefull services (e.g. databases). Stateless

    Page 11 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    12/15

    services is a no brainer and we have been using the 2+ instances of the front-end for how long since

    the inception of the web? If we stick on a pure 2 or 3 tiers web architecture lets just agree we dont

    need a TCP cloud for the front-end. Easy. The problem, as usual, is with the backend and with the

    data. I know very little about Azure and I have never actually played with it but my feeling is that the

    way they handle the SQL Azure is not very different to how Amazon treats RDS. As far as I understand

    they are both clustered instances of a database (MS SQL Server and MySQL) whose (db) interfaces

    are exposed to the consumer. I believe Amazon have a clear description of the RDS implementation. I

    am not sure if MS has one but my speculation is that its a MSCS-clustered SQL database. In a waythis is a TCP-cloud (a PaaS cloud or better a DBaaS cloud).

    So in a way, if your applications adhere to these patterns (and these technologies) both Amazon and

    AWS do provide a TCP-cloud type of service (where its needed). My rant was about those

    applications that do not adhere to this pattern (web app) not these technologies (MySQL, MSSQL). Not

    only they dont have a solution for legacy applications but even for web applications the devil might

    be in the details. Forget about the bashing around stability (its not the point), this article touches on a

    few points why, if you are used to develope/deploy a web application in house.. you may find

    substantial differences in a PaaS cloud environment: http://www.carlosble.com/2010/11/goodbye-

    google-app-engine-gae/

    I can imagine a great number of customers i have been working with finding very difficult to lose, all of

    the sudden, full control over the OS/Middleware layer that is backing their web applications. The devils

    is always in the details.

    Massimo.

    Fabio Rapposelli

    May 3, 2011 at 5:10 PM Reply

    Ciao Massimo,

    As many others have said, I love the TCP/UDP cloud analogy.

    But my concern is that we will still have to design for some degree of fail even if the underlying architecture is

    already providing appropriate resiliency.

    I give you a fairly recent example: http://www.theregister.co.uk/2011/05/01/stalled_by_a_lan_switch/

    In that case, they were providing redundancy at the infrastructure level, but, the fail scenario wasnt a

    graceful one, or one that you design your infrastructure for (well, sometimes you cant even design around

    similar scenarios). The problem is that if you dont design (or re-engineer, or re-host) your application for

    failure, it will still be tied to the highest level of protection that your infrastructure can give you, and sometimes,

    this is simply not enough.

    Just my 0.02

    ciao,

    Fabio

    Massimo

    May 4, 2011 at 7:22 AM Reply

    Fabio,

    mh I tend to disagree with this view. I am in favor of discussing whether more

    resiliency should be built into the application Vs the infrastructure. However I am not in favor of

    creating resiliency at both layers (at least not to the level that create overlapping efforts). If I have to

    build resiliency into the application to overcome problems that a resilient infrastructure (TCP-cloud)

    may experience then Id just use a NON-resilient infrastructure (UDP-cloud) in the first place. I have

    Page 12 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    13/15

    seen a Parallel Sysplex going down entirely crashing all applications running on it. Yet this doesnt

    mean mainframe programmers build resiliency into the application to overcome events like this.

    Problems happen. Sure if you build resiliency into the app and deploy it onto a resilient infrastructure

    you diminish drastically the chance of an outage but on the other hand you increase exponentially

    the costs.

    Massimo.

    4 fallacies were hearing in the wake of the AWS outage Datacentre Management . org

    May 5, 2011 at 11:05 PM Reply

    [...] we wish to worry less, and let a height do more. Massimo Re Ferrs fascinating post, TCP-clouds, UDP-

    clouds, design for fail and AWS, likewise hurdles a required knowledge that concentration architects ought to

    sojourn wholly [...]

    Anne Mansson

    May 14, 2011 at 11:40 AM Reply

    This thread was a very interesting start of my Saturday morning thanks! Even though I dont

    fully grasp all of the low end discussions I would like to add some thoughts to the discussion:

    1) I think it is very crucial to expose a cost comparison between a 100% available solution and a solution with

    99.x% availability. In my opinion most non-functional requirements in tenders/RFP-s demand the 100%

    availability approach. This is of course something you would expect from an application, but we all know that

    this is complex to achieve and comes with a very hefty price tag.

    I urge all IT-suppliers to ask you customer (Note! make sure this is someone from the business side that is

    responsible for picking up the bill not an IT person) if they are willing to pay a 2-5 times higher price for the

    solution to support the 100% availability requirement or if they can settle for a lower availability guarantee with

    a significantly lower cost? Let the client evaluate the cost/benefit based on realistic risk assumptions! OK, so

    AWS broke down once but honestly how often does this occur? What impact does these failure have on your

    business compared to the costs of trying to avoid it (Is there really any 100& guarantee IRL?)

    2) In the best of all worlds the programmers would care about the non-functional requirements I would say

    that +90% of the programmers do not have a clue! They do not understand infrastructure, virtualization, high

    availability, fail-over, recovery, response-times, network latency issues etc. The are focused on functional

    requirements.

    3) I believe there is a need for the cloud providers (public and private) to offer new innovative HA solutions

    using vitualization techniques, SAN replication, load balancing networks etc to replace the traditional way of

    building HA solutions in the middleware layer using clustering. This will remove a lot of complexity in the PaaS

    -layer. Does anyone provide this today?

    //Anne

    TCP-clouds,UDP-clouds,design for fail and AWS

    May 20, 2011 at 7:12 PM Reply

    [...] IT 2.0 fail, TCPclouds, UDPclouds, design [...]

    Massimiliano

    June 15, 2011 at 5:47 PM Reply

    Page 13 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    14/15

    Hello Massimo,

    I think you touched a very interesting point, that will spark (and already has!) a rich debate.

    The answer I would give would be. it depends on the type of applicationsafest position

    Moving to the cloud computing model means moving to a new business model, as such the applications might

    need a re-engineering simply for business reasons, so in such a case why not adding an extra-effort and

    make them TCP-like?, and gracefully survive to the UDP-cloud failure (see here an example).

    Thinking high level I would say that multi-tiered applications might hide intricacies that could lead to faults incase of a simple porting to the cloud. On the other side could we say that apps adhering to SOA model are

    better candidate for cloud and as such easier to re-engineer than non-SOA apps? probably yes.

    I might be now under the sceptical influence of my home reading, but if my core business had to move to the

    cloud I would seriously consider making my apps robust and not relying too much on the cloud service, simply

    to avoid falling into the Platonic fallacy of immature standards and because a Black Swan might just be

    lurking in there.

    Again thanks for your interesting blog.

    Massimiliano.

    Massimo

    June 16, 2011 at 9:47 AM Reply

    I guess that it depends is a good summary of this discussion

    Massimo.

    the worst still to come

    October 15, 2011 at 9:22 PM Reply

    so per VMW sales buy Vcloud and you have better availability ?

    well this is just lies isnt it ? if vshield edge device fails ?(any related host failure)

    it has no backup till ~5 minutes later , till then all VMs on all other hosts that relies on vshield edge fails for 5

    minutes

    those applications might not come up if TCP sessions will not be re-initiated

    how this facts dont tell you quite the opposite about vcloud (that relies heavily on vshield edge), that is is NOT

    high available at all ?

    Massimo

    October 16, 2011 at 10:47 PM Reply

    This sounds the junk arguments our friend Koren Lev would use. Is that you or a close

    friend that stole your junk Koren? 5 minutes to restart an Edge appliance? Excuse me,

    on which planet?

    Will

    May 29, 2012 at 7:10 AM Reply

    Im dumb so you can you explain this sentence? I do not know what it means.

    I hear LOTS of bloggers use this sentence over and over-

    To me PaaS is all about moving the level of abstraction at a higher level

    Page 14 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0

    7/25/2013http://it20.info/2011/04/tcp-clouds-udp-clouds-design-for-fail-and-aws/

  • 8/22/2019 Tcp Udpclouds Architecture

    15/15

    The Italian Elections and the Case for

    Cloudburst

    The 93.000 Firewall Rules Problem and Why

    Cloud is Not Just Orchestration

    Massimo

    May 29, 2012 at 9:29 AM Reply

    Robert, see this: http://it20.info/2010/11/random-thoughts-and-blasphemies-around-

    iaas-paas-saas-and-the-cloud-contract/

    Jon

    February 5, 2013 at 11:42 PM Reply

    Hi Massimo

    Just wondering if I could use your TCP and UDP comparison image in an assignment I am doing (i.e., non-

    commercial)?

    Thanks

    Jon

    Massimo

    February 6, 2013 at 12:03 PM Reply

    Of course you can.

    Page 15 of 15TCP-clouds, UDP-clouds, design for fail and AWS IT 2.0