When you talk about cloud computing with the enterprises and tell them how cloud requires a different approach to designing applications, I get the biggest pushback from them. Since most of the large enterprises are used to the idea that expensive and powerful hardware that seldom fails is the only way to build robustness into their IT (which ensures business continuity), they are appalled by the new way of designing applications for the cloud. They feel that they are being forced to subscribe to a completely new paradigm in order to take advantage of the cloud. In spite of the marketing gimmicks from the traditional vendors, they understand that cloud computing is more about resiliency than robustness and it bothers many of the enterprise IT managers. They really have difficulty changing their mindset from “failure is not an option” to “failure is not a problem”.
I was recently watching one of Clay Shirky’s talks at the Singularity University and he was trying to highlight the example of cell phone towers to explain the difference between robustness and resiliency in the crowdsourced world. He was trying to highlight the difference between the robustness needed for the survival of Encyclopedia Britannica and the resiliency needed in the case of Wikipedia. It got me excited to make another attempt at explain the enterprise community on the need to shift their thinking from robustness to resiliency. In short, I want to argue that this mental shift is not a new paradigm which cloud forces upon the enterprise IT but, instead, it is an old well tested idea for dealing with scale.
The example which I am going to pick is the cell phone tower example Clay Shirky used in his talk. When you look at the construction of cell phone towers, its base is broad and built with the idea of robustness but, after a certain height, the tower is not built for robustness. Instead, it is built for resiliency against heavy winds. It is not just expensive to ensure robustness at such great heights but also practically impossible. Once the construction engineers understood this difficulty, they relied on the idea of resiliency for building tall towers that has become the backbone for mission critical networks. They didn’t give up on the idea of tall towers because they are faced with a mission critical problem. Rather, they figured out a way to make these towers resilient for winds.. It is no brainer to know that construction industry were trained to focus on robustness while building smaller towers/buildings. But when it came to taller towers or building (scale), they figured that it makes sense to focus on resiliency than robustness. In short, it not only helped save tons of money (economics) but also helped to innovate faster (agility) than waiting for technology to improve so that they can achieve robustness at such heights.
The point I am trying to highlight is that it doesn’t make sense to be married to the traditional IT paradigm of robustness. It worked well when it came to legacy applications. However, for applications at scale, it doesn’t make sense to wait for infrastructure at scale that also offers the robustness of the traditional world. I am not saying we can never achieve success building robust infrastructure at scale. I am just arguing that it doesn’t make sense to wait for such an infrastructure when designing applications for resiliency can help any organization innovate faster. I am also not advocating that one should rely on hardware or data centers that fail every other hour. I am only emphasizing on the need to make the mental shift as history has already shown us that one can trust resiliency over robustness for mission critical needs.
Again, I make it a point to emphasize that it is perfectly ok to shop around for service providers who offer some level of robustness through SLAs. Even then, it is important to understand and accept the fact that servers fail and designing the apps for failure is the right approach for building modern applications. Whether we like it or not, legacy applications are on their way out. Globalized nature of the economy, mobile and social are pushing organizations to slowly (eventually) move away from legacy applications to modern applications. It is important for the enterprise IT managers to understand and accept this fact, change the mindset quickly and start embracing “modernity” in their IT. Organizations waiting patiently for robust infrastructure at scale will eventually end up getting disrupted.
Very true.. This is major mind shift required in enterprises where they move from robustness to resilient design. Since most enterprises are having legacy system, this mind shift is slow to come. Also most enterprises looks at cloud as more a hosting provider rather than infinite computing on demand.
Not to be contrarian here, but the enterprise needs to understand the costs associated with resilience. It isn’t that resilient practices are unknown. It is that they are typically too expensive to justify for legacy systems. The new cloud-oriented IT models make resilience more accessible from a business perspective triggering the climb up the learning curve.
Rich, I didn’t advocate resiliency for legacy apps. I am only asking enterprises to think about the idea of resiliency when thinking about web scale. I do agree that the cost factor is a big consideration for legacy systems.
increased Costs have always been the key in thwarting innovation and creating bottlenecks, but i do believe it is all linked right to infrastructure providers, re-sellers & PaaS providers the end result breaks the bank !
One problem is that resilient scalable software is still somewhat rocket science. Enterprises are used to relying on vendors to provide the rocket science they need. Not only do enterprises need to understand how (and why) to design and build resilient scalable software services, they also need to learn how to adopt resilient ops practices.
One of the reasons these conversations are difficult is because there isn’t a shared vocabulary. You are using ‘resilient’ and ‘robust’ as if people know what that means. Even with the cell phone story, there is not example here about what is different in the construction strategy between the two. There is a spectrum of architectures and strategies that all make trade offs between cost, performance and failure characteristics. For the purpose of discussing resilience in IT infrastructure, resilience is not an end state that is achieved, resilience is a quality that is pursued.
I am not denying that. The goal of the post is to first get over the mental block and start thinking about how resiliency can be achieved. I also agree to your point that resilience is not an end state. Part of this mental shift is understanding the fact that you are dealing with a dynamic system that is bound to fluctuate wildly.
Great blog Krish… a few points we push with our app developers and app owners…
App Owners (people that are responsible for on-premise off the shelf apps, old legacy code with no developers left, and SaaS based apps) – they need to drive a change with the vendors they use to start building for resilience at the SW layer, this is often not just for failure, but generally also helps with scaling horizontally and therefore improving utilization of our capital assets (servers/storage/network/facility). On the other hand we are starting to see more and more on-premise apps switch to the SaaS model, I think this will accelerate due to the User Experience shifts we are seeing with some of the great SaaS apps and the fact that many work across devices.
App Developers (the guys writing new code) – follow specific design rules for cloud awareness which includes design for failure, and important concepts like operationalizing everything (log all), stateless compute, and horizontal scaling. Disney did a good writeup as part of the ODCA and we reuse this internally: http://www.opendatacenteralliance.org/docs/DevCloudCapApp.pdf
My dream is that we can get to a PaaS platform that does this automatically for us (including active/active) but I haven’t found one yet.
Das Kamhout
Intel IT Cloud Computing Lead
@dkamhout
Great point Das.