On Cascading Failures and Amazon’s Elastic Block Store

This post is one in a series discussing storage architectures in the cloud. Read Network Storage in the Cloud: Delicious but Deadly and Magical Block Store: When Abstractions Fail Us for more insight.

Resilient, adjective, /riˈzilyənt/ “Able to withstand or recover quickly from difficult conditions”.

In patients with a cough, you know what commonly causes them to keep coughing? Coughing.

Nearly 4 years ago I wrote a post titled “Why EC2 isn’t yet a platform for ‘normal’ web applications” and said that the “No block storage persistence” was a feature of EC2: Making it fine for such things as batch compute on objects in S3 but likely making it difficult for people expecting to use then-state-of-the-art databases.

Their eventual solution was to provide what most people are familiar with, basically a LUN coming off of a centralized storage infrastructure. Thus the command of /mount comes back into use and one can start booting /root partitions from something other than S3. While there was the opportunity to kill centralized SAN-like storage, it was not taken.

Continue reading →

Facebook’s Open Compute: The Data Center is the New Server and the Rise of the Taiwanese Tigers

Today Facebook took the great step of openly talking about their server and datacenter designs at the level of detail where they can actually be replicated by others. Another reason why I call it “great?” Well, it’s interesting that the sourcing and design of these was done by Facebook and with Taiwanese component makers. Nothing new for many of us working in the industry, but it’s something that’s often not discussed in the press when talking about US server companies.

If you take a look at the Facebook Open Compute server page and listen to the video with Frank Frankovsky you’ll hear a few company names mentioned. Many of them might not be familiar to you. Frank is the Director of Hardware Design and Supply Chain at Facebook, and used to be at Dell DCS (the datacenter solutions group) where he was the first technologist. One last piece of trivia: He was the technologist that covered Joyent too. We’ve been lucky enough to have bought servers from him and Steve six years ago and went out for sushi when he was down here interviewing.

So who made the boxes?

Continue reading →

Comparing Virtual Machines is Like Comparing Cars: It Doesn’t Get to their Actual Utility or Value

A BMW and a Yugo are both cars. In a Yugo, “carpet” was listed as a feature. Enough said.

McCrory recently blogged a Public Cloud hourly cost comparison comparing Microsoft, Amazon, Rackspace and Joyent. I’m happy to see Joyent included in such great company but the comparisons are between “VMs.” As stated by Alistair Croll, “the VM is a convenient, dangerous unit of measure dragging the physical world into the virtual.” And as I’ve said before, the “machine,” either physical or virtual, is a poor measure for capacity planning and actual costs. Just like how a “fiber” is a poor measure of bandwidth. (Remember when the internet was the “cloud” depicted in slides?)

As a criticism, we still do a poor job of making this clear.

Continue reading →

HTTP FTW

I was reading “The Web Is Dead. Long Live the Internet” today and a good response on gigaom.

To a number of people, “the web” = HTTP. And HTTP as a protocol on the Internets has clearly won.

The fun thing to notice is that “the web” to Chris Anderson and Michael Wolff is just content delivered on a web site. Video and peer-to-peer is still on average occurring over what protocol?

That’s right!

HTTP.

So while content is diversifying on his web, HTTP is clearly the winner.

Part 2, On Joyent and Cloud Computing “Primitives”

In the first part of this series I made a key list of some of the underlying ideas at Joyent, that we believe that a company or even a small development team should be able to:

  1. Participate in a multi-tenant service
  2. Have your own instantiations of this service
  3. Install (and “buy”) the software to run on your own infrastructure
  4. Get software and APIs from Joyent that allows for the integration of all of these based on business desires and policies.

And said

The successful future “clouds” have to be more accessible, easier to use and to operate, and every single part of the infrastructure has to be addressable via software, has to be capable of being introspected into and instrumented by software and this addressability means that one can write policies around access, performance, privacy, security and integrity. For example, most of our customer really don’t care about the details, they care in knowing what is capable of providing 99.99% of their end users some great experience 99.99% of the time. These concepts have to be bake in.

I continue to think that from a developer’s perspective the future is closer to the SMART platform where Ramin’s comment on an older Joyeur article about EC2 versus Accelerators is relevant, let me quote him:

Whoever has the fewest number of steps and the fastest build/deploy time is likely to attract the most developers. Whoever can show that the operating cost scales linearly with use will have developers casting flower petals in their path 🙂

As an app developer, I don’t care that it runs on Solaris, FreeBSD, or Mac-OS. I want it to work. I want an optimized deployment workflow and a simple way to monitor and keep things running.

That all said.

In the second part to this series I wanted to start talking about “primitives”. I’m saying “start” because we’re going to be going to be covering primitives over the next couple of posts.

I’m going to loosely define “Primitives” (now with a capital P) as of all the stuff underneath your application, your language and the specific software you’re using to store your data. So yes, we’re talking about hardware and the software that runs that hardware. Even though most Primitives are supposed to eventually be hidden from a developer they’re generally important to the business people and those that have to evaluate a technology platform. They are important parts of the architecture when one is talking about “access, performance, privacy, security and integrity”.

Previously, I’ve talked about a bit about Accelerators ( On Accelerators) and that fundamentally we deal with 6 utilities in cloud computing.

The fermions are the utilities where things take up space

1) CPU space
2) Memory space
3) Disc space

The bosons are the utilities where things are moving through space and time

4) Memory bus IO
5) Disc IO
6) Network IO

All of these utilities have physical maximums dictated by the hardware, and they have a limit I’d like to call How-Likely-Are-You-To-Do-This-From-One-Machine-Or-Even-At-All.

I’ll admit at this point of a particular way of thinking. I think “what is the thing?”, “how it is going to behave?”, “what are the minimums and maximums of this behavior?” and finally “why?”.

The minimum for us is easy. It’s zero. Software using 0% of the CPUs, 0 GB of memory, doing 0 MB/sec of disc IO and 0 Gbps of network traffic.

The maximums:

  1. Commercially available CPUs typically top out in the 3s of Ghz
  2. “Normal” servers typically have <128 GB of memory in them and the ratio of 4GB of memory per CPU core is a common one from HPC (we use this and it would mean that a 128 GB system would have 32 cores)
  3. Drives are available up to a terabyte in size but as they get larger you’re making performance trade-offs. And while you can get single namespaces into the petabyte range, even though ones >100 TB are still irritating to manage (for either the increased fragility of a larger and larger “space”, or the variation in latencies between a lot of independent “storage nodes”).
  4. CPUs and memory talk at speeds set by the chip and hardware manufacturers. Numbers like 24 Gbps are common.
  5. Disc IO can be in the Gbps without much of an issue
  6. For a 125 kb page with 20 objects on it, 1 Gbps of traffic will give you 122,400,000 unique page views per day and that in a 30 day month this is 3,672,000,000 page views (Check my math). Depending on how much stuff you have going on, this basically puts you in as a top 100 web property. With the number of public website is ~200 million (source), being in the top 200 is what … 0.00001% of the sites?

As something to think about and as an anchor, I remember seeing a benchmark of a “Thumper JBOD” attached to a system capable of saturating the 4×10Gbps NIC cards in the back of it. Yes the software was special, yes it was in C, and yes it was written with the explicit purpose of pushing that much data off of discs; however, think about that for a minute.

Imagine having a web property doing 120 billion monthly page views coming off of a single “system” that you can buy for a reasonable price. Starting from there, expand that architecture and I wonder with the “right software” and “primitives” where you would end up. If we change it from a web property to a gaming or a trading application, where would you end up? What is the taxonomy of applications out there (common and uncommon) and do we come up with the best architectures for each branch and leaf on that tree?

Please think about that anchor and a taxonomy for a few days and then I’m going to get into some of the key differentiators of our Primitives and answer some of the “Why?”.

On Joyent and “Cloud Computing”, Part 1 of Many

With this post, I’m starting a series where we’re going to be much more explicit about what we’re thinking, what we’re doing, how we’re doing it and where we’re going. I’m not interested in any of it being thought of as impersonal “marketing material”, so I hope you’ll allow the occasional use of “I” and for the interweaving of my perspectives and perhaps a story or two. As the CTO and one of the founders whose lives has been this company, I’m going to go ahead, be bold and impose.

In May of this year, we’ll have been in business for five years, and looking back at it, it’s amazing that we’re still here. Looking back at it, it makes complete sense that we’re still here. Our idea was pretty simple: let’s take what it normally BIG infrastructure (i.e. 64GB of RAM, 16 CPU servers, 10 Gbps networking), virtualize the entire vertical stack and then go to people who normally buy some servers, some switches and some storage, and sell them complete architectures. All done and ready to go.

It is commonly said that businesses only exist out of necessity, that the best ones take away or care of “pain”. Makes sense, doctors only exist because there is disease.

It is a pain for most companies (“Enterprises”) to expertly own and operate infrastructure that can scale up or down depending on their business needs. Infrastructure is more a garden then a monument but rarely treated as one.

It is difficult to be innovative in the absence of observing tens of thousands of different applications doing tens of thousands of different things.

It is easy to make the same mistakes that others have made.

That final point is an important one, by the way. I’ll tell you why in a slightly roundabout way.

I’m a scientist by training and have always been a consumer of “large compute” and “big storage” not just a developer of them (I can write about why that is a good thing another time). While at UCLA as a younger man, I was lucky enough to have had a graduate physiology course where one of the lecturers was Jared Diamond and later on I went to an early reading of one of his books that was coming out, Guns, Germs, and Steel. What reached out and permanently imprinted itself in my mind from that book was the Anna Karenina principle. It comes from the Tolstoy quote, “Happy families are all alike; every unhappy family is unhappy in its own way.”, and simply put it means that success is purely the absence of failure.

Success is purely the absence of failure.

Education and “products” are meant to save people and enterprises from mistakes. Both the avoidance of “known” mistakes (education) and the surviving of “unknown” mistakes (luck, will power, hustle, et al).

Internally I commonly say,

It’s fine to make mistakes as long as 1) no one has ever made that mistake before, if they have, you need to question your education or get some, 2) we’re not going to be repeating this mistake, if we do, we need to question our processes, our metrics and how we communicate and educate and 3) the mistake does not kill us

This is all important because I believe the primary interest in “cloud computing” is not cost savings, it’s in the participation of a larger network of companies and people like you. It is the consumption of products that

  1. Are more accessible and easy to use (usually from a developers perspective)
  2. Are easier to operate

However, that’s not it. I can say with complete certainty that main barrier towards adoption is trust. Trust specifically around privacy, security and integrity.

The successful future “clouds” have to be more accessible, easier to use and to operate, and every single part of the infrastructure has to be addressable via software, has to be capable of being introspected into and instrumented by software and this addressability means that one can write policies around access, performance, privacy, security and integrity. For example, most of our customer really don’t care about the details, they care in knowing what is capable of providing 99.99% of their end users some great experience 99.99% of the time. These concepts have to be bake in.

These are some of the underlying ideas behind everything that we’re doing. At Joyent, we believe that a company or even a small development team should be able to

  1. Participate in a multi-tenant service
  2. Have your own instantiations of this service
  3. Install (and “buy”) the software to run on your own infrastructure
  4. Get software and APIs from Joyent that allows for the integration of all of these based on business desires and policies.

Sometimes people refer to this as the “public and private cloud”. Oooook.

I happen to believe we’re capable of doing this quite well compared to most of the large players out there. We own and operate, and we’re making those APIs and software available (often open source).

Amazon Web Services is a competent owner and operator and allows you to participate in a multi-tenant service like S3, but there are no API extensions for integrity, you cannot get your own S3 separate from other people, you cannot install “S3” in your own datacenter, and of course there is no software that allows an object to move amongst these choices based on your policies.

Question: “If you have a petabyte in S3, how do you know it’s all there without downloading it all?”, I can answer that question if I had a petabyte on Netapps in-house.

The Cisco, Sun, HPs and IBMs (I’ll toss in the AT&Ts too) of the world want to sell you more co-lo, perhaps something best called co-habitation, more hardware, more software, less innovation and no ease of operating.

Larry Ellison says that this is all a marketing term. I think he’s wrong. We think he’s wrong, and that Joyent seems to be unique in focusing on making the above a reality.

Next week, I’ll be discussing “primitives” and “architectures”.

What is Cloud Computing?

We went to the Web 2.0 Expo and asked “What is Cloud Computing ?”. YouTube has a high definition version of What is Cloud Computing

Check out what the following people had to say:

What do you think Cloud Computing is?