On Cascading Failures and Amazon’s Elastic Block Store

This post is one in a series discussing storage architectures in the cloud. Read Network Storage in the Cloud: Delicious but Deadly and Magical Block Store: When Abstractions Fail Us for more insight.

Resilient, adjective, /riˈzilyənt/ “Able to withstand or recover quickly from difficult conditions”.

In patients with a cough, you know what commonly causes them to keep coughing? Coughing.

Nearly 4 years ago I wrote a post titled “Why EC2 isn’t yet a platform for ‘normal’ web applications” and said that the “No block storage persistence” was a feature of EC2: Making it fine for such things as batch compute on objects in S3 but likely making it difficult for people expecting to use then-state-of-the-art databases.

Their eventual solution was to provide what most people are familiar with, basically a LUN coming off of a centralized storage infrastructure. Thus the command of /mount comes back into use and one can start booting /root partitions from something other than S3. While there was the opportunity to kill centralized SAN-like storage, it was not taken.

Continue reading “On Cascading Failures and Amazon’s Elastic Block Store”

Facebook’s Open Compute: The Data Center is the New Server and the Rise of the Taiwanese Tigers

Today Facebook took the great step of openly talking about their server and datacenter designs at the level of detail where they can actually be replicated by others. Another reason why I call it “great?” Well, it’s interesting that the sourcing and design of these was done by Facebook and with Taiwanese component makers. Nothing new for many of us working in the industry, but it’s something that’s often not discussed in the press when talking about US server companies.

If you take a look at the Facebook Open Compute server page and listen to the video with Frank Frankovsky you’ll hear a few company names mentioned. Many of them might not be familiar to you. Frank is the Director of Hardware Design and Supply Chain at Facebook, and used to be at Dell DCS (the datacenter solutions group) where he was the first technologist. One last piece of trivia: He was the technologist that covered Joyent too. We’ve been lucky enough to have bought servers from him and Steve six years ago and went out for sushi when he was down here interviewing.

So who made the boxes?

Continue reading “Facebook’s Open Compute: The Data Center is the New Server and the Rise of the Taiwanese Tigers”

Comparing Virtual Machines is Like Comparing Cars: It Doesn’t Get to their Actual Utility or Value

A BMW and a Yugo are both cars. In a Yugo, “carpet” was listed as a feature. Enough said.

McCrory recently blogged a Public Cloud hourly cost comparison comparing Microsoft, Amazon, Rackspace and Joyent. I’m happy to see Joyent included in such great company but the comparisons are between “VMs.” As stated by Alistair Croll, “the VM is a convenient, dangerous unit of measure dragging the physical world into the virtual.” And as I’ve said before, the “machine,” either physical or virtual, is a poor measure for capacity planning and actual costs. Just like how a “fiber” is a poor measure of bandwidth. (Remember when the internet was the “cloud” depicted in slides?)

As a criticism, we still do a poor job of making this clear.

Continue reading “Comparing Virtual Machines is Like Comparing Cars: It Doesn’t Get to their Actual Utility or Value”

Part 3, On Joyent and Accelerators as Cloud Computing “Primitives”

In the last part of this series we ended by talking about 6 “simple” utilities that software uses on “servers”. They were

1) CPU space
2) Memory space
3) Disc space
4) Memory bus IO
5) Disc IO
6) Network IO

Along with their natural minimums (zero) and maximums.

Providing compute units that do these utilities

What we’ve always wanted to do at Joyent was provide “scalable network appliances”: online servers that just worked for given functions and were capable of both handling bursts and serving as logical lego-like building blocks for new and legacy architectures. Sometimes these appliances might contain our own software, sometimes not. They would be on a network where it would be difficult for a given piece of software to saturate the immediate parts of it.

For most workloads, a ratio of 1 CPU:4GB:2 spindles works pretty well. The faster the CPU (constrained by power) the better, memory is well … memory, and the size and speed of those spindles can be varied depending on what a node is going to be doing (one end of a workload possibility or the other). In other words, 1) CPU space, 2) Memory space and 3) Disc space aren’t terribly interesting or difficult to schedule and manage in most environments (in some ways, they’re a purchasing decision), and 4) Memory bus IO is set in silicon stone by their creators.

Which ones matter?

We’re left with disc and network IO as being the key utilities and the ability to move things on-and-off disc and in-and-out of the network comes from using more CPU. So when an application experiences a surge in activity, it typically results in more use of the CPU, disc IO and network IO. We decided the best approach was to put CPU and disc IO on a fair-share system, and standardize on an overbuilt network (10 Gbps interconnects, multi-gigabit out of the back of each physical node). Our physical servers exclusively use intel NICs in a PCIe format. This way we can have multiple gigabits per node, physical segregation and failover as options, and we can standardize on a network driver and get away from on-board NICs changing from server to server.

The fair share system overall is supposed to provide for short-term (could just be minutes) bursting needed to handle spikes in demand. Spikes that occur too fast for either hardware upgrades or even to spin up new VMs on most systems (and are often too fast to be noticed in 5-10 minute monitoring pings). This allows for you to stop thinking about the servers that you pay for as being these constrained maximums, and start thinking about a “1 CPU, 4GB of RAM” server as being a guaranteed minimum allotment.

This was at least why we stopped calling them “Grid Containers” and started calling them “Accelerators”.

Moving up the stack

The next installments are going to be talking about some kernel and userland experiences and choices, our desire to use ever more “dense” hardware, 32 and 64 bit environments, and rapid reboot and recovery times.

A Loving Cloud

Yesterday several members of Joyent’s team attended Structure ’08. Jason Hoffman was on a panel that produced some interesting debate about whether clouds should aim to be open. The story was even picked up by the Wall Street Journal’s Don Clark in an article entitled Finding A Friendly Cloud

Jason Hoffman, founder and chief technology officer of a cloud-computing specialist called Joyent, was particularly pointed in warning that Google’s App Engine could represent a lock-in to developers. It is possible to build “a loving cloud,” he argued, that would make it easier to create applications that could be easily moved among different services. Other panelists kept calling Google’s App Engine “proprietary,” which to many techies is equivalent to labeling it both evil and outdated at the same time.

Here’s video of the exchange Jason had with Google’s Christophe Bisciglia

1 Billion Page Views a Month

Here’s a video detailing how LinkedIn built an application (Bumpersticker on the Facebook platform) using Rails (and C Ruby!) that serves up more than 1 billion page views a month.

In my opinion, this ends the debate about whether Rails scales. Rails is a component, it is how the components are architected and delivered that comprises the magic. LinkedIn did amazing work taking advantage of Joyent’s technology stack including innovative ways of leveraging Joyent Accelerators and our BigIP load balancers.

Congratulations to the LinkedIn team. Great accomplishment! You can read more about the how LinkedIn scaled bumpersticker on Joyent in a post on their blog entitled Web Scalability Practices: Bumper Sticker on Rails

View the video.

Update: ZDNet writes about the story.

Amazon Web Services or Joyent Accelerators: Reprise

In the Fall of 2006, I wrote a piece On Grids, the Ambitions of Amazon and Joyent, and followed up with Why EC2 isn’t yet a platform for ‘normal’ web applications and the recognition that When you’re really pushing traffic, Amazon S3 is more expensive than a CDN.

The point of these previous articles was to put what wasn’t yet called “cloud computing” into some perspective and to contrast what Amazon was doing with what we were doing. I ventured that EC2 is fine when you’re doing batch, parallel things on data that’s sitting in S3, and that S3 is economically fine as long as you’re not externally interacting with that data to a significant degree (then the request pricing kicks in). Basically it is incorrect that each are universally applicable to all problems and goals in computing, and that they’re cost-effective. An example of a good use case is a spidering application: one launches a number of EC2 instances, crawls a bunch of sites, puts that information into S3, and then launches a number of EC2 instances to build an index of that data and further store it on S3.

Beyond point-by-point features and cost differences, I believe there are inherent philosophical, technical and directional differences between Joyent and Amazon Web Services. This is and has been our core business, and it’s a business model, in my opinion, that competes directly with hardware vendors and customer taking direct possession of hardware and racking-and-stacking it in their own datacenters.

Cloud computing is meant to be inherently “better” than what most people can do themselves.

What’s changed with S3 and EC2 since these articles?

For S3? Nothing really. There are some additional data “silo” services now. SimpleDB is out and there has been some updates to SQS, but I would say that S3 is by far the more popular of the three. The reason is simple: it’s still possible for people to do silly things when storing files on a filesystem (like put a million directories in one directory), but it’s more difficult to do things as silly with a relational database (you still can, but they’re ultimately handled within the RDMS itself, for example, bad queries).

I’m consistently amazed by how many times I have to go over the idea of hashed directory storage.

For EC2 there’s been some improvements.

Annotating the list from “Why EC2 isn’t yet a platform for “normal” web applications we get:

1. No IP address persistence. EC2 now NATs and EC2 instances are on a private network. That helps. Are you able to get permanently assigned, VLAN’ed network address space? It’s not clear to me.

2. No block storage persistence. There is now an option to mount persistent storage in a “normal” way. Presumably it’s block storage over iSCSI (there’s not many options for doing this), hopefully it’s not a formalized FUSE to S3. We’ll see how this holds up performance-wise, now there’s a bit more predictability in data stored in EC2 but experience has shown me that it only takes one really busy database to tap out storage that’s supposed to be serving 10-100 customers. Scaling I/O is still non-trivial.

3. No opportunity for hardware-based load balancing. This is still the case.

4. No vertical scaling (you get a 1.7Ghz CPU and 1 GB of RAM, that’s it). There are now larger instances but the numbers are still odd. 7.5GB of RAM? I like powers of 2 and 10 (so does computer science).

5 & 6. Creation and handling of AMIs. Experience like this is still quite common, it seems.

Structure of modern applications

The three tiers of “web”, “application” and “database” are long dead.

Applications that have to serve data out (versus just pulling in like the spidering example earlier) are now typically structured like: Load Balancers/Application Switches (I prefer the second term) <-> Cache <-> Application <-> Cache <-> Data. Web and gaming applications are exhibiting similar structures. The caching tiers are optional and either can exist as a piece of middleware or as part of the one of the sandwiching tiers. For example, you might cache as part of the application, or in memcached, or you might just be using the query cache in the database itself. And while there are tiers, there are also silos that exist under their own namespaces. You don’t store static files in a relational database, your static assets are CDN’ed and served from e.g. assets[1-4].yourdomain.com, the dynamic sites from yourdomain.com and users logged-in at login.yourdomain.com. Those are different silos.

How to scale each part and why do people have problems in the first place?

Each tier either has state or not. Web applications are over HTTP, an inherently stateless protocol. So as long as one doesn’t introduce state into the application, the application layer is stateless and “easy” to horizontally scale. However, since one is limited in the number of IP addresses one can use to get to the application, and network latency will have an impact at a point, the “front” has state. Finally, the back-end data stores have state, by definition. We end up with: stateful front (Network) <-> stateless middle <-> stateful back. So our options for scaling would be: Load Balancers/Application Switches/Networking (Vertical) <-> Cache (Horizontal or Vertical) <-> Application (Horizontal) <-> Cache (Horizontal or Vertical) <-> Data (Vertical).

The limit to horizontal scale is the network and its latency. For example, you can horizontally scale out multi-master MySQL nodes (with a small and consistent dataset), but you’ll reach a point (somewhere in the 10-20 node range on a gigabit network) where latency now significantly impacts replication time around that ring.

Developing and scaling a “web” application means that you (or someone) has to deal with networking and data management (and different types of data for that matter) if you want to be cost-effective and scalable.

The approach one takes through this stack matters: platform directions

With the view above you can see the different approaches one can take to provide a platform. Amazon started with data stores, made them accessible via APIs, offered an accessible batch compute service on top of those data stores, introduced some predictability into the compute service (by offering some normal persistence), and has yet to deal with load-balancing and traffic-direction as a service. Basically they started with the back and should be working their way to the front.

At Joyent, we had different customers, customers making the choice between staying with their own hardware, or running on Joyent Accelerators. We started with the front (great networking, application switching), persistence, we let people keep their normal backends (and made them fast) and we are working for better solutions (horizontal) for data stores. Solving data storage needs weren’t as pressing because many were already wedded to a solution like MySQL or Oracle. An example of solving problems at the outermost edge of the network would be the article, The wonders of fbref and irules serving pages from Facebook’s cache. This is an example of programming in application switches to offload 5 pages responsible for 80% of an application’s traffic.

Joyent product progression is the opposite of AWS’s. We solved load-based scale with a platform that starts with great networking, well performing Accelerators, Accelerators that are more focused to do particular tasks (e.g. a MySQL cluster). We are working on data distribution for geographic scale, and making it all easier to use and more transparent (solve the final “scale”, administrative scale).

The technology stack of choice does matter: platform technology choices

Joyent Accelerators are uniquely built on the three pillars of Solaris: ZFS, DTrace and Zones. This trio is currently only present in OpenSolaris. What you put on metal is your core “operating system”. Period. Even if you call it a hypervisor, it’s basically an OS that’s running other operating systems. We put a solid kernel on our hardware.

Accelerators are meant to be inherently more performant then a XEN-based EC2 instance per unit of hardware, and to do so within normal ratios: 1 CPU/4GB RAM, utilities available in 1,2,4,8,16,32,64 gb sized chunks. The uniqueness of DTrace adds unparalleled observability, it makes it possible for us to figure out exactly what’s going on in kernel and userland and act upon it for customers in production.

ZFS lets us wrap each accelerator in a portable dataset, and as we’ve stated many times before, it makes any “server” a “storage appliance”.

Add to this Joyent’s use of f5 BigIP load-balancers, Force10 networking fabric, and dual-processor, quad-core, 32GB RAM servers.

Open and portable: platform philosophy

At Joyent, I don’t see us having an interest in running large, monolithic “services” for production applications and services. Things need to remain modular, and breakage in a given part needs to have zero to minimal impact on customers. Production applications shouldn’t use a service like S3 to serve files, they should have access to software with the same functionality and being able to run it on their own set of Accelerators.

We want software that powers services to be open, available, and enable you to run it yourself here on Accelerators, or actually anywhere you want. We develop applications ourselves exactly like you do, we tend to open source them and this is exactly what we would want from a “vendor”. This route also minimizes request (“tick”) pricing. We don’t want to entirely replace people choices in databases, instead Accelerators have to be made to be a powerful, functional base unit for them. Want to run MySQL, PostgreSQL, Oracle, J-EAI/ejabberd, … then by all means do that. No vendor lock-in.

For both platforms, we have our work cut out for us.