Skip to content

hyperscale.at

Service-orientation-as-a-Service, SOA, PaaS, IaaS, and Economies of Autoscale

Archive

Category: cloud computing

Just got back from International CES. Nice to see some familiar faces and meet many new people! McDevOps makes computers for DevOps. The newest computer we’re working on is called Dynamic-PeripheryTM. Unsatisfied with the one-to-one constraint of personal computing, we decided that a workstation isn’t a personal computer. One workstation, powered by supercomputers, could be accessed by many tablets. But we couldn’t just use any tablets, we needed dynamic periphery. This means that one user may use several tablets in order to have a more tailored user experience, and be able to send their user experience to another user. Portable user experience is one of the most exciting features of cloud-based virtual desktop infrastructure. So we took it a step further and began designing specialized tablets for use with desktop supercomputing workstations. It just makes more sense in today’s software engineering, video production, and enthusiast gaming environments. After all, if DevOps culture doesn’t constrain what-could-be by what-is, then why should hardware constrain platform service software? We think it’s also better to have a consistent user experience in development and production and we think a common yet flexible software framework (for example prototype-friendly structure programming in Dart or open PaaS frameworks like CloudFoundry and OpenShift) facilitates this efficiency in many software engineering practices.

Dog-fooding the Supercomputer

But honestly, localized computation is only half of the fun of cloud VDI. We really wanted to rock the portable UX over the internet globally. And that’s doable with a McDevOps microcloud account (contact me if you want an invite), whether or not you roll your own microcloud. Microcloud accounts will be free for engineers, developers, designers, and devops culturists… and in general free for anyone looking for work or something to hack on. But it’s not just a SaaS model, it’s a PaaS model from a software perspective. From the hardware perspective it’s a gateway appliance taking you through the pearly gates to supercomputing heaven in the cloud. Desktops are a heavy workload in and of themselves, especially in the aggregate. The problem with all the cloud hype in consumer electronics or “personal cloud” is that they’ve gotten away from cloud computing’s future value. The future value of cloud computing is that it offers scalability. As Dave Nielsen says Cloud computing is OSSM (On Demand, Scalable, Self-serviceable, and Measureable), and I say it’s OSSAM (adding Automation which is implied in every letter of OSSM)…. consumer electronics manufacturers haven’t really delivered the scalability components, but rather what seems to be an overprovisioned appliance or box. The cloud is not a box, nor a puppet show, but maybe more like a vending machine. Get served.

We might be engineers or developers but we’re often not a this-or-a-that we’re often both. And I think in DevOps culture this is the case. I think it’s also the case that a desktop hybrid microcloud can handle heavier video production workloads much better than a beefed up mac (request demo), due to parallel elastic provision at hyperscale supporting rendering workloads for example. And that’s just one example because rendering is just one video production workload. And when these guys get bored they play LAN parties which works really nicely with a desktop microcloud in your cube farm or wherever.

So think how software engineers play with supercomputers while video producers play 3-D shooters. It’s a competition, but for practical purposes the same infrastructure is used to prove the concept that collaboration is like competition on steroids… especially when you can use the same tools and share the same big data insights.

So at CES this year it really seemed as though cloud either meant wireless or SAN or NAS… but I think cloud storage is a nice low hanging fruit. Cloud persistence is the other benefit of microcloud. It’s a gateway to public utility persistence of files. So it takes the load off your tablets and keeps things locally accessible via ultra high speed bandwidth while it slowly persists remotely in heaven… eventually consistent and redundantly persistent… You can take it with you.

The CLOUD is upon you

In 2011 many were still wondering if the CLOUD really meant anything in terms of technology, dollars, and or cents. Looking back on 2011 all I can see is a whirl of nebulocity surrounding what-is with what-could-be. Here’s what I think might change significantly in the next 12 months or so:

The CLOUD is real… WHERE’S MINE?

Ok, so we’ve seen people make money off cloud… now I want one. Go build me my own thing that makes money too. Make it look like the King and maybe the King will be forced to buy it… I mean there can’t be 3 kings can there? So now that 2012 is almost here… people are realizing the cloud isn’t just a nebulous swirl of vapor-ware… now let’s start the ASP second chance foundation. Do I need a license for that? I think there will be a lot of opportunities to abstract licenses with SaaS deliveries. Some may exploit the gimmicks that should not have been codified into the licenses in the first place. What comes around goes around, but by now the only ISVs who are likely to be affected by it are the monolithicly most comprehensive solution providers who claim they invented everything. Invention by consolidation should be on the rise in 2012, by the way, I’m guessing.

ASP Second Life


Application service providers were right. Applications can often be served better warm, with human love. At minimum viability, a product contains at least one service component. Automation is great, but services contain humans and humans contain human error. Consumers love to cut out the middle-man, but once they’ve made all their man-in-the-middle attacks and all their paper dolls of sliced and diced middle-men they realize that they want service. So they go to http://asherbond.com/contact and ask for technical advice. Anyone who knows Second Life (or other virtual realities) knows that people like to design things and build things themselves. But if you’re going to build a cloud please ask yourself where the economies-of-scale exist. Now that the technology concepts have been proven in business practice many more customers are going to ask for cloud service, but what they’re really actually asking for is people (sometimes via a RESTful API).

The difference between application services and software-as-a-service is abstraction measured by a degree of multi-tenancy.

Compliance-and-Regulatory-Tunneling-and-Channeling-as-a-Service

They thought regulations and compliance “hurdles” created jobs… and they were right… in the short term… but what they might have missed is that it also creates jobs for service providers who can broker emerging technology as a service.

Business-Process-as-a-Service (#BPaaS)

What kinda cloud u talking bout? We got SaaS BPaaS and my personal favorite: GSaaS. GSaaS loves you brother. Now let me show you how to run your business. I expect to hear a lot of “what kinda PaaS” from developers and a lot of ooooo aaaah from business process practitioners… but the process consultants deserve a chance to really shine and this is it. I got my developer card revoked a couple times for saying “Cloud is SOA” but I got a new one from VeriSign and now I think developers are starting to be cool about it now that they realize that OASIS was right and that so was I since I said so too, neh. The first guy who raked my graphic depictions over the campfire did admit however, “yeah ok man.. i guess if you’re talking about REST.” So it turns out predictions in 2010 were accurate. I think service-component architecture and visual programming are going to play a role in RESTful integration as software components are service-oriented. I strongly expect scalability requirements and cloud-readiness motivators to stir the pot. Service-orientation is inevitable when technology is applied. Developers are empowered as decision makers and technical advisors, so maybe they would be interested in subscribing to business-process-as-a-service since they have more of a technical focus.

The most COMPREHENSIVE solution – brought to you by the Federated Association of Governing Consolidators

So what if you’re an investor and you buy and sell technology securities and you want some of that good old fashioned ROI. How can you make any money in this cloud biz now that the developers are taking over? Oh yeah there’s this little thing called the most COMPREHENSIVE solution. Big comprehensive, little solution. That’s right folks. The time is NOW. Buy everything. Your cloud portfolio is about to make it rain, but before you buy everything… you have to know how this stuff works and what it does. Haha just joking… now back to our regularly consolidated program… I think in 2012 we might continue to see enterprisey comprehensive solution providers trying to convince people that they are the box you can put your cloud into… or are they more of a comprehensive solution “cloud” that spans actual clouds with meaningful definitions which exist in actual physical datacenters? Who gives these large enterprisey comprehensive solution providers the authority to do this? The customer lets them get away with it because they sponsor industry events and they are often older companies who played a role in many of the technologies that end up as cloud. They equivocate between distribution models of cloud computing, for example… they might get behind the technology curve doing tons of non-emerging has-been-mature-for-a-decade-or-so SaaS business then pretend they are powering IaaS today on a public scale… when the emerging technologies are PaaS based.

DevOps as more of a cultural paradigm shift and movement and less of a title

People are going to start either killing each other based on their choice of configuration management / automation framework or they are going to start getting along more and not putting DevOps in their title unless it has Engineer at the end of it and Lead in the the front of it. Designers are going to be constrained by tighter iterations and Ops are going to punch developers just because they haven’t been punched before and everyone goes through it.

Developer-as-a-customer

In the old days, developers could be divided and conquered by business managers much more easily. The days of developers having a great idea that no one understands are not over… but “I don’t understand how this stuff works” is no longer an excuse now that we have so many services available. If you don’t know how something works… just ask… only now… you don’t even have to ask how to do it, you can ask for service. If you don’t know how something works, that something might be new and valuable. Dustin said it already, but I think public offerers are going to focus more on influencing the decisions of software developers. Software developers represent change in the direction of requirements and demands… not just whatever seems wanted right now… I think developers often try to guess (like Steve Jobs R.I.P.) what people need since they’re probably going to want that eventually. I could probably guess that a pregnant mom is going to be in the market for diapers sooner or later. Hopefully sooner rather than later. Developers are in the early stages from cradle to grave. They iterate through software development and application life cycles and deliver features based on requirements. Those features become part of a common framework that can be offered more publicly. It’s not new, but software vendors love to put developers on their platforms. What’s new is that developers are not-so-divided and not-so-conquered… so they probably demand a higher degree of ubiquity in their distribution channels… so they probably demand a higher degree of interoperability in their language frameworks.

Applications are most portable when the target distribution platform is based on open-standards.

Public Platform-as-a-Service (PaaS) Top Doggery

Not everyone can be King of the Hill, but I think there’s room for a whole circle of winners in the market segment of public PaaS. We have seen 3 generations of public platform service offerings to developers:

Totally Rigidly Arcane PaaS

The first platform services with public offerings forced the developer to conform to a proprietary framework. The back end was a confidential operation delivered as a multi-tenant service to subscribers who learned how to conform to the proprietary framework. The framework may have been based on python or java, but constrained the developer to the platform of implementation rather than the standards of the enabling technologies within.

Still-exploiting-the-constraint PaaS

This type of platform is built secretly and operates as a proprietary service, but relies on open-source components to deliver services which are mostly compliant with open-standards. A true language is always an open-standard.

Open PaaS – as it should be

Third generation platform services are completely portable. This type of middle-ware essentially replaces the role of the “operating system” as a software component with “systems-in-operation” instantiated as objects by a framework of classes delivered as a platform of services for developers to build things on top of. The distribution model allows for services to be delivered with scalability, flexibility, interoperability, high availability and the distribution model also allows for platform portability and application interoperability by default. The evolution of service-component architecture (SCA) and visual programming may also influence the adoption of visual programming in the cloud as practical users are abstracted by service and frictionless design becomes the practice.

Next Generation PaaS+

I think of PaaS+ as a value-added platform-as-a-service which may include business processes as a service or may include additional DevOps tooling or methodologies-as-a-service (MaaS?) whatever… The framework (tool) teaches you the process. In a toolcloud you might experience something like a toolbox… for example when you’re using Gmail, you realize that Gmail is a Google approach to email… it’s not just an “email program” … so you get some agility along with the nebulocity of the cloudy SaaSfulness. So I think that the next generation PaaS+ will need to put their pluses on by adding some kind of business or other practical high level value. Some of this high level value can be delivered in the form of integration. Cloudbees has moved forward with their initiative to add continuous integration via Jenkins/Hudson integrated service components in their PaaS offering. I think DevOps toolclouds will emerge via the PaaS delivery model and that like Cloudbees other cloud service providers who have a PaaS offering may choose to offer a chocolate or strawberry new flavor of PaaS for Dev and possibly a vanilla PaaS for their long term support in production interoperability and highly available portability PaaSes. I guess Leiloo Dallas could call that one a multi-PaaS just in time to kiss Korbin and save the world before New Years.

Predictive Monitoring and SLAs

Predictive monitoring tools will leverage Hadoop and other big data / analytics. The abstraction of data itself may become an abstract business-process-as-a-service and drive innovation in system performance as SLA’s are enforced and predictive deep monitoring tools allow autonomous and dynamic autoscaling of instances in resource pools.

Resource Pool Expansion and Utility Computing Commodotitization

I think the price of public cloud will start to look like a true utility and come down quite a bit. Companies like Amazon Web Services probably would lower their prices is the demand wasn’t way too high. When more IaaS vendors such as Rackspace, Opsource, Datapipe, et al.. enter the space (they’re already here) and start to compete for customers, the price of raw x86 compatible IaaS should come down quite a bit and make people re-think their hybrid strategies. For now, many organizations may benefit from a flexible hybrid cloud strategy that (for example) may leverage their existing infrastructure to orchestrate public cloud services.

Security implications of Cloud Computing

Cloud computing lowers the barriers to entry by people who ordinarily could not access high performance clusters of nodes to do complex brute-force math research on your “encrypted” password… or just fire up an array of nodes and aim it at the ssh port. Nothing they couldn’t do in the old days of dark matter / botnet clouds. What IP address did that come from? A leased one in a classy datacenter. I think public cloud providers are going to become very security-savvy (actually they really are top notch in most cases). It will be interesting to see how they empower themselves from the big data + hypervisor perspective.

Rinse that CLOUD out ‘cha mouth boy!

At some point… analysts are saying that there is a “hype cycle” in which cloud word sentiment shall become stale. The word cloud will either become ultra-ubiquitous like industry insiders are saying… or it may become a bit blase.. numb from the excessive nebulocity of smoke and mirrors becoming clouds too. I think if we can refrain from partying too hard it might help. Happy new years eve. Be responsible and make backups.

Different working-groups have defined and re-defined cloud computing over the last few years. Peter Mell, Timothy Grance, Murugiah Souppaya, Lee Badger and other brilliant minds working together with the NIST (National Institute of Standards and Technology) have drafted a document characterizing cloud computing as:

On-demand self-service

A consumer can unilaterally provision computing capabilities, such as
server time and network storage, as needed automatically without requiring human
interaction with each service’s provider.

Broad network access

Capabilities are available over the network and accessed through standard
mechanisms that promote use by heterogeneous thin or thick client platforms (e.g.,
mobile phones, laptops, and PDAs).

Resource pooling

The provider’s computing resources are pooled to serve multiple consumers
using a multi-tenant model, with different physical and virtual resources dynamically
assigned and reassigned according to consumer demand. There is a sense of location
independence in that the customer generally has no control or knowledge over the exact
location of the provided resources but may be able to specify location at a higher level of
abstraction (e.g., country, state, or datacenter). Examples of resources include storage,
processing, memory, network bandwidth, and virtual machines.

Rapid elasticity

Capabilities can be rapidly and elastically provisioned, in some cases
automatically, to quickly scale out, and rapidly released to quickly scale in. To the
consumer, the capabilities available for provisioning often appear to be unlimited and can
be purchased in any quantity at any time.

Measured Service

Cloud systems automatically control and optimize resource use by leveraging
a metering capability at some level of abstraction appropriate to the type of service (e.g.,
storage, processing, bandwidth, and active user accounts). Resource usage can be
monitored, controlled, and reported, providing transparency for both the provider and
consumer of the utilized service.

After attending a few Cloud Camp events I had the privilege of discussing what cloud computing is with Dave Nielsen. Here is the OSSM (pronounced awesome) CloudCamp definition of cloud computing Dave Nielsen has been presenting:

On-demand: the server is already setup and ready to be deployed
Self-service: the customer chooses what they want, when they want it
Scalable: the customer can choose how much they want and ramp up if necessary
Measureable: there’s metering/reporting so you know you are getting what you pay for

While I really couldn’t dispute the awesomeness of Dave’s definition, I challenged it. I just felt the need to add automation to the definition. Here is the OSSAM definition I came up with:

On-demand

Architecture is implemented by an operating framework that allows for rapid elasticity. This framework determines which hardware and software resources are required to meet a range of service-level agreements and subscriber (i.e. customer) expectations.

Scalable

Scalability is achieved on the back end via tight and loose coupling of hardware resources, orchestrated to meet the changing demands of different use cases. A grid may provide the computational and storage resources or a network of edge caching servers may provide content distribution. Many public cloud providers offer both. On the front end, virtual and paravirtual machinery provides subscriber-facing service nodes powered by an elastic hardware and network infrastructure layer (resource pool) of computational nodes and storage area networks on the back end. Vertical scaling is limited to the capacity of one piece of today’s best hardware, but cloud scalability means that arrays of nodes can be offered a service layer or unit… which provides horizontal scalability, rapidly on demand.

Self-service

A multi-tenant framework must exist to provide at least two tiers of service layers. The underlying infrastructure tier represents technical operations and the top tier(s) represent one or more abstracted service layer(s) … oriented to providing services specific to an applied scope of operations. (For example, a business use case for a particular department or organizational unit). Self-service also means that you have the ability to manage your own service layer(s) if that is how you, the subscriber, decide to provision your resources. For example, if someone is a systems administrator he or she may decide to provision a computer in the cloud with or without a managed operating system or with or without the management of a software library layer. Commodity virtual-machinery which is unmanaged is an example of this, but certainly a fully managed virtual machine could fall under the category of ‘self-service’ if the subscriber tells the API to provide them a managed virtual machine. This leads to my bastardization of the Cloudcamp definition…

Automated

There must be some degree of automation for cloud computing to be a true “vending machine” and more than just a puppet show. When an order is placed, human resources, engine-squirrels, or monkeys must not be employed to carry out the provisioning of services… no matter how rapidly they may be able to provide services. The system must automatically, via an API be able to mechanically provide services within the range of some kind of service-level agreement. While it’s arguable that this is part of “on-demand” services, I think it’s worth making a distinction. The importance is that on a large scale, on-demand services can’t exist without automation. Fail-over at the hardware level, for example, may not be required for cloud computing to be defined… but fail-over is a crucial piece of the puzzle if storage and compute nodes within a cluster are to provide reliable and sustainable support for complex layers of virtualization.

Measureable

Metering and reporting is important not only for billing in public cloud service implementations, but also in private clouds which service enterprise departments. Measurement provides a quantitative analysis of resource utilization and allows for more efficient use of computational resources. On the “front-end” metering tells subscribers how much resources they are consuming and on the “back-end” measurement should also tell infrastructure operators when to add hardware resources to the computational / storage grid. With proper fail-over, automation and orchestration these resources should be highly available and a monitoring system should measure that availability.

Some might refer to today as a DevOps Day… and to those who haven’t figured out their failover strategy, today might seem like the day the cloud stood still. But if you’re familiar with Internet service at large, you’ve seen it before. Network events persist, whether it be in the datacenter or in the Cloud… a sad hardship we face on shared networks such as the Internet. Remember that infrastructure services such as EC2 and DBMS services such as RDS are merely service layers on top of a data-center. Are you afraid of Cloud or data-center? Fear not, but perhaps the biggest “cloud” is the dark one powered by those who allow their computers to be compromised. If a denial-of-service attack is distributed, a provider-of-service defender should work just as hard to distribute his or her eggs… well… I guess her eggs in multiple baskets. Failover is a difficult concept for many applications, out of the box, because it requires a great deal of redundancy and synchronization. The database is perhaps the most difficult piece of the puzzle to distribute… especially if it is a relational database. Master -> Slave replication is one way to achieve not only multi-tiered horizontal scalability on demand, but also multi-regional redundancy. Take a look at the reference architecture just announced as part of a Rightscale + Zend horizontal scalability solution:

The separation of static content from dynamic content is a concept that will lead to higher efficiency and higher availability in any Cloud environment. Backups from master to slave databases may seem expensive across availability zones, but perhaps, after today, they are less expensive than we once thought.

Now let’s think about Content Distribution Networks. Static content can be cached at the edge which provides the most availability to your end users. When people think of CDN availability, they might assume “closest geographical region to the end user”… but what if your CDN was smart enough to weigh latency and system load as metrics in the load balancing determination algorithms? Do we have that? Yeah. Skeptics blame AWS / EC2 for today’s hardships, but perhaps some should be thanking them for edge-caching static content worldwide. It’s a saving grace for those who have their eggs scattered amongst 18 geographic regions.

For static content, content distribution networks often have multi-region high availability built in out-of-the-box. It’s a lot easier when dealing with static content, but with some systems architecture and database management expertise, the same caching principles can also be applied to maximize reliable delivery of both static and dynamic content.

If an application provider or platform service provisioner can separate static content from persistent data and also separate important data from not-so-important temporary / session data and deliver these types of data and content with discardable instances… fail-over can be achieved and even automated by replicating data across providers (or at least across cloud regions / availability zones). Once static content, persistent data, and temporal data have been sorted out… a redundant, meshed / multi-homed front-end server-array tier can determine (based on monitoring and availability metrics) which cloud / data-center / availability zone to distribute static and dynamic content from.

I think this type of architecture can be justified not only for fail-over reasons, but perhaps it can also be a way to achieve more rapidly elastic, impressive server performance.

When N. Virginia gets hit hard, it may be quite a hardship, but it shouldn’t be too hard to fail over to your other region’s slave database. If Soichiro Honda is going to tell me that success is 99% failure, then in the case of distributed, edge cached, redundant web systems architecture… perhaps success is 99% failover.

But don’t just go throwing shuriken at network-event coordinators unless your star has more than just these two points. I think a nice third point to sharpen and cut to is the reliability of monitoring systems. It’s good to be monitoring your auto-scaling processes if you’re in a situation where you scale on demand… and you also want to monitor who is demanding the computing resources. Ideally, you’re getting alarmed before your end users are. Reflexive firewalls are a good way to go, but just having good reflexes is part of wearing the agile cat’s hat in general. If you have a fast way to report trouble to the authorities charged with ownership of a compromised node attacking your system, you’re part of the solution and get a gold star.

Conversely, unnecessary reflexive post-mortem backups en-mass may have been a somewhat panicked response to the network event and a contribution to the length of this outage.

Amazon Web Services has done an excellent job (as always) of not only describing what happened and when service is expected to be restored, but what you can do to maximize availability if your service has been adversely affected by the outage. You can access their status updates via RSS feeds directly from the AWS Service Health Dashboard at status.aws.amazon.com.

Here’s a copy of what AWS is saying about EC2 services in the N. Virginia region [ RSS ]:

1:41 AM PDT We are currently investigating latency and error rates with EBS volumes and connectivity issues reaching EC2 instances in the US-EAST-1 region.
2:18 AM PDT We can confirm connectivity errors impacting EC2 instances and increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region. Increased error rates are affecting EBS CreateVolume API calls. We continue to work towards resolution.
2:49 AM PDT We are continuing to see connectivity errors impacting EC2 instances, increased latencies impacting EBS volumes in multiple availability zones in the US-EAST-1 region, and increased error rates affecting EBS CreateVolume API calls. We are also experiencing delayed launches for EBS backed EC2 instances in affected availability zones in the US-EAST-1 region. We continue to work towards resolution.
3:20 AM PDT Delayed EC2 instance launches and EBS API error rates are recovering. We’re continuing to work towards full resolution.
4:09 AM PDT EBS volume latency and API errors have recovered in one of the two impacted Availability Zones in US-EAST-1. We are continuing to work to resolve the issues in the second impacted Availability Zone. The errors, which started at 12:55AM PDT, began recovering at 2:55am PDT
5:02 AM PDT Latency has recovered for a portion of the impacted EBS volumes. We are continuing to work to resolve the remaining issues with EBS volume latency and error rates in a single Availability Zone.
6:09 AM PDT EBS API errors and volume latencies in the affected availability zone remain. We are continuing to work towards resolution.
6:59 AM PDT There has been a moderate increase in error rates for CreateVolume. This may impact the launch of new EBS-backed EC2 instances in multiple availability zones in the US-EAST-1 region. Launches of instance store AMIs are currently unaffected. We are continuing to work on resolving this issue.
7:40 AM PDT In addition to the EBS volume latencies, EBS-backed instances in the US-EAST-1 region are failing at a high rate. This is due to a high error rate for creating new volumes in this region.
8:54 AM PDT We’d like to provide additional color on what were working on right now (please note that we always know more and understand issues better after we fully recover and dive deep into the post mortem). A networking event early this morning triggered a large amount of re-mirroring of EBS volumes in US-EAST-1. This re-mirroring created a shortage of capacity in one of the US-EAST-1 Availability Zones, which impacted new EBS volume creation as well as the pace with which we could re-mirror and recover affected EBS volumes. Additionally, one of our internal control planes for EBS has become inundated such that it’s difficult to create new EBS volumes and EBS backed instances. We are working as quickly as possible to add capacity to that one Availability Zone to speed up the re-mirroring, and working to restore the control plane issue. We’re starting to see progress on these efforts, but are not there yet. We will continue to provide updates when we have them.
10:26 AM PDT We have made significant progress in stabilizing the affected EBS control plane service. EC2 API calls that do not involve EBS resources in the affected Availability Zone are now seeing significantly reduced failures and latency and are continuing to recover. We have also brought additional capacity online in the affected Availability Zone and stuck EBS volumes (those that were being remirrored) are beginning to recover. We cannot yet estimate when these volumes will be completely recovered, but we will provide an estimate as soon as we have sufficient data to estimate the recovery. We have all available resources working to restore full service functionality as soon as possible. We will continue to provide updates when we have them.
11:09 AM PDT A number of people have asked us for an ETA on when we’ll be fully recovered. We deeply understand why this is important and promise to share this information as soon as we have an estimate that we believe is close to accurate. Our high-level ballpark right now is that the ETA is a few hours. We can assure you that all-hands are on deck to recover as quickly as possible. We will update the community as we have more information.
12:30 PM PDT We have observed successful new launches of EBS backed instances for the past 15 minutes in all but one of the availability zones in the US-EAST-1 Region. The team is continuing to work to recover the unavailable EBS volumes as quickly as possible.
1:48 PM PDT A single Availability Zone in the US-EAST-1 Region continues to experience problems launching EBS backed instances or creating volumes. All other Availability Zones are operating normally. Customers with snapshots of their affected volumes can re-launch their volumes and instances in another zone. We recommend customers do not target a specific Availability Zone when launching instances. We have updated our service to avoid placing any instances in the impaired zone for untargeted requests.
6:18 PM PDT Earlier today we shared our high level ETA for a full recovery. At this point, all Availability Zones except one have been functioning normally for the past 5 hours. We have stabilized the remaining Availability Zone, but recovery is taking longer than we originally expected. We have been working hard to add the capacity that will enable us to safely re-mirror the stuck volumes. We expect to incrementally recover stuck volumes over the coming hours, but believe it will likely be several more hours until a significant number of volumes fully recover and customers are able to create new EBS-backed instances in the affected Availability Zone. We will be providing more information here as soon as we have it.

Here are a couple of things that customers can do in the short term to work around these problems. Customers having problems contacting EC2 instances or with instances stuck shutting down/stopping can launch a replacement instance without targeting a specific Availability Zone. If you have EBS volumes stuck detaching/attaching and have taken snapshots, you can create new volumes from snapshots in one of the other Availability Zones. Customers with instances and/or volumes that appear to be unavailable should not try to recover them by rebooting, stopping, or detaching, as these actions will not currently work on resources in the affected zone.

10:58 PM PDT Just a short note to let you know that the team continues to be all-hands on deck trying to add capacity to the affected Availability Zone to re-mirror stuck volumes. It’s taking us longer than we anticipated to add capacity to this fleet. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.

Notice the ENTIRE CLOUD has certainly not collapsed. They are providing you a way to spin up instances in many availability zones that are available as usual. These are highly available availability zones which are not affected by this outage and may serve as failover with proper implementation of redundant server architecture.

Here’s a copy of what Amazon Web Services is saying about RDS services in the N. Virginia Region [ RSS ]:

1:48 AM PDT We are currently investigating connectivity and latency issues with RDS database instances in the US-EAST-1 region.
2:16 AM PDT We can confirm connectivity issues impacting RDS database instances across multiple availability zones in the US-EAST-1 region.
3:05 AM PDT We are continuing to see connectivity issues impacting some RDS database instances in multiple availability zones in the US-EAST-1 region. Some Multi AZ failovers are taking longer than expected. We continue to work towards resolution.
4:03 AM PDT We are making progress on failovers for Multi AZ instances and restore access to them. This event is also impacting RDS instance creation times in a single Availability Zone. We continue to work towards the resolution.
5:06 AM PDT IO latency issues have recovered in one of the two impacted Availability Zones in US-EAST-1. We continue to make progress on restoring access and resolving IO latency issues for remaining affected RDS database instances.
6:29 AM PDT We continue to work on restoring access to the affected Multi AZ instances and resolving the IO latency issues impacting RDS instances in the single availability zone.
8:12 AM PDT Despite the continued effort from the team to resolve the issue we have not made any meaningful progress for the affected database instances since the last update. Create and Restore requests for RDS database instances are not succeeding in US-EAST-1 region.
10:35 AM PDT We are making progress on restoring access and IO latencies for affected RDS instances. We recommend that you do not attempt to recover using Reboot or Restore database instance APIs or try to create a new user snapshot for your RDS instance – currently those requests are not being processed.
2:35 PM PDT We have restored access to the majority of RDS Multi AZ instances and continue to work on the remaining affected instances. A single Availability Zone in the US-EAST-1 region continues to experience problems for launching new RDS database instances. All other Availability Zones are operating normally. Customers with snapshots/backups of their instances in the affected Availability zone can restore them into another zone. We recommend that customers do not target a specific Availability Zone when creating or restoring new RDS database instances. We have updated our service to avoid placing any RDS instances in the impaired zone for untargeted requests.

11:42 PM PDT In line with the most recent Amazon EC2 update, we wanted to let you know that the team continues to be all-hands on deck working on the remaining database instances in the single affected Availability Zone. It’s taking us longer than we anticipated. When we have an updated ETA or meaningful new update, we will make sure to post it here. But, we can assure you that the team is working this hard and will do so as long as it takes to get this resolved.

These updates are not direct from Amazon, but merely a copy, so please subscribe to the Amazon Service Health Dashboard for more freshly updated information regarding their service (which I still insist is high quality).

- Asher Bond
It’s a long way down if your head is in the CLOUD.

I think in the early days of the commercial Internet, Cloud referred to telecommunications infrastructure that you subscribed to or didn’t know about or care about. Service-orientation is inevitable when technology is applied.. this is because people want service when they have a different focus. Infrastructure-as-a-service (IaaS), Platform-as-a-service (PaaS), and various other technologies-as-services describe somewhat specific cloud architectures and frameworks which can be delivered with participation-in-technical-details-on-demand service level agreements. Cloud Computing (IaaS and arguably PaaS) require automation and self-provisioning to really be actual cloud computing… and there must be a compute service running. Self-provisioning generally is implemented in a way that facilitates IT managers/brokers to delegate provisioning authority to a department or organizational unit. Once the organizational unit can subscribe to the public or private cloud service, they can provision resources on demand and pay (or not pay) for consumed resources. Without automation or self-provisioning… it’s just not cloud computing. It might be some other kind of cloud… like software-as-a-service… but even software-as-a-service requires an autonomy component of some kind. For example, when you use Gmail… you’re using software-as-service. Gmail has automated the delivery of your email. SaaS runs servers in the background and guess what? Servers automate processes. But to really speak with integrity about cloud computing, I think it’s important to know that what we’re talking about is automated, self-provisioning systems that allow people to subscribe to infrastructure-as-a-service on demand. It’s not a puppet show, but more of a vending machine.

There are more characteristics of cloud computing than just automation and autonomy or other methods of self-provisioning. Cloud computing exposes APIs that allow subscribers to access computational resources on demand which serve as an abstraction layer between aggregated (and possibly also distributed) hardware and virtual (or paravirtual) machinery. A proudly provisioned Cloud usually boasts some kind of synergistic automation, monitoring, distribution, and compute aggregation at every service layer. The customer experience should be a burst-friendly, high availability, easy to use, “just-works” success story. Nobody likes when the vending machine takes money then gets jammed or you find out that the people inside your TV and radio are really there looking back at you and listening to all your secrets. So again, the Cloud isn’t a puppet show, but more of a vending machine. Get Served.

Summary

11,969 http requests handled @ 84 nanoseconds across 100 concurrent connections? Yeah. Here’s what happened:


root@ip-10-161-82-11:/var/www/nginx-default# ab -n 1000000 -c100 http://localhost:80/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests

Server Software:        nginx/0.7.65
Server Hostname:        localhost
Server Port:            80

Document Path:          /
Document Length:        34989 bytes

Concurrency Level:      100
Time taken for tests:   83.544 seconds
Complete requests:      1000000
Failed requests:        0
Write errors:           0
Total transferred:      35202867880 bytes
HTML transferred:       34989862555 bytes
Requests per second:    11969.72 [#/sec] (mean)
Time per request:       8.354 [ms] (mean)
Time per request:       0.084 [ms] (mean, across all concurrent requests)
Transfer rate:          411492.58 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    2   0.2      2       5
Processing:     2    7   0.7      6      15
Waiting:        1    2   0.5      2      12
Total:          5    8   0.7      8      17
WARNING: The median and mean for the processing time are not within a normal deviation
        These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
  50%      8
  66%      9
  75%      9
  80%      9
  90%      9
  95%      9
  98%      9
  99%     10
 100%     17 (longest request)
root@ip-10-161-82-11:/var/www/nginx-default# w
 08:28:26 up 25 min,  1 user,  load average: 0.63, 0.23, 0.08
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
root     pts/0    c-69-181-58-125. 08:21    0.00s  0.01s  0.00s w
root@ip-10-161-82-11:/var/www/nginx-default# cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
stepping	: 10
cpu MHz		: 2659.998
cache size	: 6144 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips	: 5322.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
stepping	: 10
cpu MHz		: 2659.998
cache size	: 6144 KB
physical id	: 1
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips	: 5322.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
stepping	: 10
cpu MHz		: 2659.998
cache size	: 6144 KB
physical id	: 2
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips	: 5322.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 23
model name	: Intel(R) Xeon(R) CPU           E5430  @ 2.66GHz
stepping	: 10
cpu MHz		: 2659.998
cache size	: 6144 KB
physical id	: 3
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 3
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips	: 5322.20
clflush size	: 64
cache_alignment	: 64
address sizes	: 38 bits physical, 48 bits virtual
power management:

root@ip-10-161-82-11:/var/www/nginx-default# cat /proc/meminfo
MemTotal:       15752364 kB
MemFree:        14964352 kB
Buffers:           22708 kB
Cached:           216504 kB
SwapCached:            0 kB
Active:           134052 kB
Inactive:         110996 kB
Active(anon):       6000 kB
Inactive(anon):        0 kB
Active(file):     128052 kB
Inactive(file):   110996 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                36 kB
Writeback:             0 kB
AnonPages:          5860 kB
Mapped:             5052 kB
Shmem:               164 kB
Slab:              28876 kB
SReclaimable:      12480 kB
SUnreclaim:        16396 kB
KernelStack:         872 kB
PageTables:            0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     7876180 kB
Committed_AS:      47812 kB
VmallocTotal:   34359738367 kB
VmallocUsed:        5988 kB
VmallocChunk:   34359732359 kB
DirectMap4k:    15728640 kB
DirectMap2M:           0 kB

Here’s a larger configuration running the latest stable version of nginx:


root@ip-10-166-162-224:/var/www# cat /proc/cpuinfo
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping	: 5
cpu MHz		: 2666.760
cache size	: 8192 KB
physical id	: 0
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 0
initial apicid	: 17
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5335.92
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping	: 5
cpu MHz		: 2666.760
cache size	: 8192 KB
physical id	: 1
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5335.92
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping	: 5
cpu MHz		: 2666.760
cache size	: 8192 KB
physical id	: 2
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 2
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5335.92
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping	: 5
cpu MHz		: 2666.760
cache size	: 8192 KB
physical id	: 3
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 3
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5335.92
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping	: 5
cpu MHz		: 2666.760
cache size	: 8192 KB
physical id	: 4
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 4
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5335.92
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping	: 5
cpu MHz		: 2666.760
cache size	: 8192 KB
physical id	: 5
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 5
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5335.92
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping	: 5
cpu MHz		: 2666.760
cache size	: 8192 KB
physical id	: 6
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 6
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5335.92
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           X5550  @ 2.67GHz
stepping	: 5
cpu MHz		: 2666.760
cache size	: 8192 KB
physical id	: 7
siblings	: 1
core id		: 0
cpu cores	: 1
apicid		: 7
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu tsc msr pae mce cx8 apic mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips	: 5335.92
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

root@ip-10-166-162-224:/var/www# cat /proc/meminfo
MemTotal:       71700024 kB
MemFree:        69213656 kB
Buffers:            9736 kB
Cached:           214992 kB
SwapCached:            0 kB
Active:           116788 kB
Inactive:         116540 kB
Active(anon):       8628 kB
Inactive(anon):      152 kB
Active(file):     108160 kB
Inactive(file):   116388 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 0 kB
Writeback:             0 kB
AnonPages:          8628 kB
Mapped:             5756 kB
Shmem:               172 kB
Slab:              31148 kB
SReclaimable:      21044 kB
SUnreclaim:        10104 kB
KernelStack:        1480 kB
PageTables:            0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    35850012 kB
Committed_AS:      62120 kB
VmallocTotal:   34359738367 kB
VmallocUsed:        6100 kB
VmallocChunk:   34359732247 kB
DirectMap4k:    71680000 kB
DirectMap2M:           0 kB

root@ip-10-166-162-224:/var/www# ls -la index.html
-rw-r--r-- 1 root root 281180 2010-10-16 09:02 index.html

root@ip-10-166-162-224:/var/www# ab -n 1000000 -c100 http://localhost:80/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100000 requests
Completed 200000 requests
Completed 300000 requests
Completed 400000 requests
Completed 500000 requests
Completed 600000 requests
Completed 700000 requests
Completed 800000 requests
Completed 900000 requests
Completed 1000000 requests
Finished 1000000 requests

Server Software:        nginx/0.8.52
Server Hostname:        localhost
Server Port:            80

Document Path:          /
Document Length:        281180 bytes

Concurrency Level:      100
Time taken for tests:   232.069 seconds
Complete requests:      1000000
Failed requests:        0
Write errors:           0
Total transferred:      281395406970 bytes
HTML transferred:       281181405900 bytes
Requests per second:    4309.07 [#/sec] (mean)
Time per request:       23.207 [ms] (mean)
Time per request:       0.232 [ms] (mean, across all concurrent requests)
Transfer rate:          1184132.86 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.2      1       3
Processing:     8   22   0.6     22      65
Waiting:        0    1   0.5      1      46
Total:          9   23   0.6     23      65

Percentage of the requests served within a certain time (ms)
  50%     23
  66%     23
  75%     24
  80%     24
  90%     24
  95%     24
  98%     24
  99%     24
 100%     65 (longest request)


Here’s my smallest cloud instance running apache (tested from an m1.xlarge running in the same availability zone). The results are different because this is a network test involving two nodes. More latency is expected. Actually, there are a lot of differences in this next sample. 2 concurrent connections is much different than 100. The html page being distributed by the Apache http server here is similar to the one from the last sample.


This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking asherbond.com (be patient)
Completed 10000 requests
Completed 20000 requests
Completed 30000 requests
Completed 40000 requests
Completed 50000 requests
Completed 60000 requests
Completed 70000 requests
Completed 80000 requests
Completed 90000 requests
Completed 100000 requests
Finished 100000 requests

Server Software:        Apache
Server Hostname:        asherbond.com
Server Port:            80

Document Path:          /blog
Document Length:        234 bytes

Concurrency Level:      2
Time taken for tests:   135.299 seconds
Complete requests:      100000
Failed requests:        0
Write errors:           0
Non-2xx responses:      100000
Total transferred:      46700000 bytes
HTML transferred:       23400000 bytes
Requests per second:    739.11 [#/sec] (mean)
Time per request:       2.706 [ms] (mean)
Time per request:       1.353 [ms] (mean, across all concurrent requests)
Transfer rate:          337.07 [Kbytes/sec] received

Connection Times (ms)
             min  mean[+/-sd] median   max
Connect:        1    1   9.5      1    3002
Processing:     1    1   0.4      1      33
Waiting:        1    1   0.4      1      32
Total:          2    3   9.5      3    3003

Percentage of the requests served within a certain time (ms)
 50%      3
 66%      3
 75%      3
 80%      3
 90%      3
 95%      4
 98%      4
 99%      5
100%   3003 (longest request)

Considerations

  1. This is not a comparative analysis, but rather a generally uncontrolled experiment to collect system performance data from the cloud.
  2. Service-oriented Architecture is volatile when the supporting service layers are volatile.
  3. Compute infrastructure services (even EC2 m1.* and especially t1.micro) may be volatile depending on network health and demands at a given time.
  4. Benchmarking a local loop-back may give understated performance results on computers with lower IO bandwidth.
  5. Benchmarking a local loop-back may give overstated performance when service traverses networks suffering from high latency between client and server nodes.
  6. Some networks, virtual, and paravirutal compute environments limit the amount of concurrent connections during high (or even moderate) utilization.
  7. 100 concurrent connections isn’t very many, especially for Amazon Web Services.
  8. It would be interesting to see how many requests could be handled with 1000 concurrent connections.

Conclusions

  1. Bigger may have the potential of being better, but requires additional performance tuning for a specific application in order to take advantage of the compute capabilities of an 8 processor configuration.
  2. Sometimes the purpose of data collection reveals itself after such data becomes information.
  3. Sometimes it’s fun to show what a machine is capable of, whether you’re revving the engine on a dyno or just riding through some neighborhoods…

service-oriented modeling practices aren't being practiced?

Just when you thought your app was in “the cloud” … someone showed you that you’re still on a VPS. Everything’s in the box…. DBMS, some scripting framework your web guy likes, maybe some plugins and things. You’ve got a lamp server running on there, some python stuff… ruby on rails, ruby off the rails, and your ex cube-farm neighbor Jim’s whole crazy train of UI experiments. MySQL’s SuSQL? Whos’ SQL is that? Want to throw more compute power at the problem? OK! Service-oriented operators are standing by!

If you can see this, then you might need a Flash Player upgrade or you need to install Flash Player if it's missing. Get Flash Player from Adobe.

System Requirements:

  • Linux…
  • or.. Unix if you prefer
  • not running a non *nix based operating system

Senseless promotional point system options:

  • +10 cloud points if you’re a Debian/Ubuntu user.
  • +20 cloud points if you compile your own kernels.
  • +30 cloud points if you’re from California.
  • +40 cloud points if you’re from The Bay.
  • +100 cloud points if you managed to rack up 100 cloud points just now.

Other Requirements

  • Beta Participation with a tolerance for betavailability.
  • SCRAPERS is a release candidate, so is your app, probably… anyway.

What happens inside the SCRAPER CLOUD…

  1. Normally what happens in the cloud stays in the cloud, but I will tell you anyway…
  2. Your server, VPS, appliance or application (let’s call it an app) is placed into a Scalable Cloud Response Architecture Platform Elasticator (SCRAPE).
  3. Once inside the SCRAPE, your app is replicated and privately analyzed by a service-orientation analyzer engine (SOAE)… well… it’s more like a scraper bike peddled by a service-oriented architect.
  4. Scalability is achieved by dividing your app into persistent data and elastic process service-layers that are provisioned by an Economy of Autoscale Elasticator Engine (EoAEE!!!!!!!).
  5. Your app will continue to get comfortable in the SCRAPE, reaching more efficiency as time goes on.
  6. Once your app has been service-oriented, your app will have learned to autoscale as needed and is ejected into the production cloud environment.
  7. Any SCRAPE’d app is compatible with Amazon Machine Images as well as Ubuntu Enterprise Cloud and Eucalyptus.

Promotional Service Rates for qualified apps (beta)

  • $1.60 per hour per SCRAPE standard (supports most midsized developing and production apps)
  • $3.30 per hour per SCRAPE VIP GOLD (livin’ large for famous apps)
  • additional compute nodes “pedants” can optionally be purchased as needed for $0.025 per hour (according to the optimized load average plan)
  • Network bandwidth and additional IPs: MARKET

Ask about premium pricing here.

In a previous post, I described an experimental method of mounting S3 as a virtual file system within a cloud instance. I’m still in the process of doing spring cleaning… although fall is basically here… but anyway cleanliness is generally overrated until it comes to the idea of getting web files organized properly in the cloud. So before I take a shower this morning I think I’ll finish moving some static content into content distribution networked storage bit buckets.

#!/bin/bash
# Asher Bond 2010
# http://www.asherbond.com/blog/2010/09/23/sizeup-sh-for-cleaning-house-i-mean-cloud/
# sizeup.sh [dir]
# sizes up the present working directory or some other directory
# by summarizing the directories inside. I use this script to
# make sure my cloud compute instances are storing files properly
# in walrus and s3 filesystems or google storage instead of cluttering
# the compute instance's internal file systems

if [ "$#" -gt 0 ] ; then
        cd $1;
fi

find . -maxdepth 1 -type d -exec du -hs {} \;

A Content Distribution Network is often more efficient than one point of delivery such as a single or centralized web server / VPS.

try to store static content in a CDN for better content distribution at the edge

Please note this is risky business down here
Here’s how a thrill seeker could try to move a whole bunch of stuff into a cdn real fast:

sizeup.sh
# oh wow what’s up here? that one is like fulla mp3s and videos
s3-mount.sh cdn.somedomain.com /mnt/cdn.somedomain.com
mv /var/some-directory/some-big-podcast-2010* /mnt/cdn.somedomain.com
echo “hey web master dude I just moved all ur files into the cdn, so update ur links.” | wall
# oh no I’m still the web master…
tar -cvf backup-in-case-my-links-dont-get-updated.tar /var/some-web-site
gzip *.tar
umount fuse
s3-mount.sh backups.somedomain.com /mnt/backups.somedomain.com
mv backup-in-case-my-links-dont-get-updated.tar.gz /mnt/backups.somedomain.com
cd /var/some-web-site
find . -type f -exec perl -p -i -e ‘s/some-old-links/some-new-cdn-links/g’ {} \;
# whoa i hope that worked… LOL!

If you’re wondering if there’s a difference between backups in the Service-oriented Cloud and backups that the rest of the world is familiar with… well… there are 5 key differences mostly stemming from the fact that cloud backups are service-based (surprise-surprise)… but the rest of the world is probably not familiar with backing things up anyway, so let’s continue.

In SOA, a “cloud backup” is done by taking a snapshot outside the virtual device being backed up. It’s not really new. This can be done using XEN, Veritas, Amazon Elastic Block Storage snapshots, etc, etc…

Some people don’t like the idea of backing up an entire instance with it’s binaries, log files, and duplicate data. I believe that redundancy is useful if not necessary for reliable backups, so I take the big snapshots once per day or so, but I also back up smaller files more frequently for added roll-back-ability or whatever you want to call it. Here’s how I back up scripts from cloud appliances to my S3 bit buckets. Remember that a backup is only as good as your ability to restore it and automatic backups should be tested often. You might also want to periodically delete old backups that you don’t need, but this is optional and could be hasty. Redundancy can help ensure better data integrity for backups, but it’s at the cost of disk space and some network bandwidth… and you have to keep backups safe!

#!/bin/bash
# Asher Bond 2010
# http://www.asherbond.com/blog/2010/09/15/service-oriented-backups-from-ec2-to-s3/
# backup-scripts.sh
# backup scripts every hour
# slightly tested on Debian Lenny
# put this in your /etc/cron.hourly

# no trailing slashes
local_backup_dir='/var/backups';
remote_backup_dir='/mnt/backups.asherbond.com';

# script directories to recursively back up
script_dirs='/etc'

# learn the date in rfc-3339 format
date=`date --rfc-3339=seconds | cut -d: -f1 | tr ' ' '-'`;

hostname=`hostname`;

file_prefix="$hostname-backup-scripts-";

# files look like: myhostname-backup-scripts-YYYY-MM-DD-HH.tar.gz when they're done
filename="$file_prefix$date.tar";

cd $local_backup_dir

# delete any local backups older than 7 days
echo "Deleting backups older than 7 days..."
find . -type f -ctime +7 -name "$file_prefix*.tar.gz" -exec rm -f {} \;

echo "Archiving files..."
tar -cvf $filename $script_dirs
gzip $filename

# mount s3 backup bit bucket using FUSE
# http://www.asherbond.com/blog/2010/09/14/mount-an-amazon-s3-bit-bucket-as-a-drive-in-unix-using-fuse/
/etc/asher-bond-cloud/s3-mount.sh backups.asherbond.com

# copy to remote bit bucket
echo "Copying backup to Amazon S3 bit bucket..."
cp $filename.gz $remote_backup_dir

# unmount fuse when done
echo "Dismounting from S3, what a trusty workhorse..."
sleep 30 && umount fuse

echo "FIN."

The output will look something like this:

Archiving Files…
/etc/
/etc/mysql/
/etc/mysql/debian.cnf
/etc/mysql/my.cnf
/etc/asher-bond-cloud/
/etc/asher-bond-cloud/loadavg.py
/etc/asher-bond-cloud/backup-scripts.sh
/etc/asher-bond-cloud/s3-mount.sh
/etc/etc/etc/etc/lol
Deleting backups older than 7 days…
Copying backup to Amazon S3 bit bucket…
Getting object list from S3 …
Validating cache …
Setup complete
Dismounting from S3, what a trusty workhorse.
Fin.

It’s a long way down if your head is in the CLOUD.
- Asher Bond

You can put this in your startup scripts, but I just run mine when I want to upload files to S3 for content distribution or for backup. A couple of my backup scripts invoke this script, then umount fuse when they’re done backing things up to s3. It hasn’t fully been tested yet, but let me know how it works out. Remember to store your AWS credentials in a safe place where only trusted people can read them. It’s also a good idea to expire and rotate them frequently.

#!/bin/bash
# Asher Bond 2010
# http://www.asherbond.com/blog/2010/09/14/mount-an-amazon-s3-bit-bucket-as-a-drive-in-unix-using-fuse/
#
# USAGE: s3-mount.sh bucket-name [/mnt/point/optional/if/different]
#
# mount s3 using s3-simple-fuse
# This is released under GPL
# you might need to:
# See http://code.google.com/p/s3-simple-fuse/ for s3-simple-fuse
# apt-get install python-fuse
# apt-get install python-dateutil
# apt-get install python-boto

# if you don't specify a mount point it just assumes /mnt/bit-bucket-name
function s3-mount ()
{

        if [ "$#" == 1 ] ; then
                # allow for custom /mnt/points
                mnt="/mnt/$1"
        else    mnt="$2"
        fi

        # keep this safe
        aws_access_key_id='AKUIAS7YOURMOM8XUTA4GA'
        aws_secret_access_key='sa09idontask2hsdfkjh34tnotrealna5p8'

        mkdir $mnt
        s3-simple-fuse $mnt -o AWS_ACCESS_KEY_ID=$aws_access_key_id,AWS_SECRET_ACCESS_KEY=$aws_secret_access_key,bucket=$1

}

s3-mount $1 $2

# to-do: mount google storage
# http://code.google.com/apis/storage/
#function google-storage-mount ()
#{
#}