On Grids, the Ambitions of Amazon and Joyent

There’s a lot of talk lately about “grids”.

And Amazon.

The word “grid” has reappeared in marketing materials and we’ve seen it brought up during the the emergence of companies offering utility computing and storage products (or at least they want you to think that’s what they’re really offering).

There’s also definitely been a PR push from Amazon about it’s Amazon Web Services product line. Really starting with a keynote at MIT Tech Review’s Emerging Technologies Conference (eweek coverage), the Web 2.0 thing and the pre-expo BusinessWeek article, then here, here, here and here and is looking to continue with Werner Vogel speaking at the Future of Web Apps conference, an active evangelism group and more and more press.

I like all of this because it validates our own business model: we have our own applications, we provide infrastructure to others, and those others tend to be like-minded developers. And as that all gels with time, effort and development what emerges is a platform.

Simple.

Right?

A platform.

But let me take the opportunity to discuss, clarify and challenge a few things.

What is a grid? What is “The Grid”?

(Note the a versus the and grid versus Grid.)

In 1998, Ian Foster and Carl Kesselman said:

“A computational grid is a hardware and software infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities.”

Then four years later in 2002, Ian Foster generalized the definition (See What is The Grid) and said that a “grid” would have to have meet three criteria:

  1. Coordinate between different organizations (implying that there would be economic, social and security models and policies),
  2. Adhere to open standards and protocols, and
  3. Deliver a “non-trivial QoS”.

This provided for less of a functional and compositional definition (a grid is servers that do xyz) and more of a social and business definition. Notice that this still defines a grid, meaning that as an organization, we can say we’re providing a grid service when these criteria are met. The Grid is a larger utopian umbrella of world-encompassing computational might!

I’d also suggest reading the ideas of a grid’s structure and function as laid out in The Physiology of the Grid.

The analogous concept of a grid and The Grid in our electrical utility industry (see a good paragraph by Nick Carr) is easy to see and conceptualize, and even it still faces issues in deregulation and commoditization (perhaps you remember Enron and $1000/month electricity bills for a one bedroom apartment in San Diego, California?).

But where one comes across “grids” in computer science and IT most often is the concept of a networking grid (and beyond that a networking “mesh”). And network peering between providers is of course a reality and common.

The network and the concept of peering is a good place to bring a couple of things together (and bring it back into the conversation when we talk about Amazon).

Because the “network” is still the real bottleneck both technically (latency and speed of ethernet) and economically in The Grid.

There is also currently no such thing as Grid Peering (the automatic failover and re-distribution of stuff from one provider to another with the goal of providing that “non-trivial QoS”) but that’s why there is an “open standards and protocols” criterion in Foster’s list.

Networks and interconnects aren’t free, they are laid down by private companies that seek to make money from them, and the speed, latency and capacity of the networks are important.

The fastest connections on the internet now are really OC-192 which in ethernetland, we’ll say is essentially 10 Gbps connections (GigaOm recently reported on Infinera’s 100 Gbps ethernet connection that could carry data over a 4000 kilometer fiber network). In datacenters, the common limit is 1 Gbps with 10 Gbps ethernet and slightly faster and lower latency Infiniband making some appearances. Within a computer though the interconnects are significantly faster and the components are very close, for example, a Sun X4100 has “three 8.0 GB/sec HyperTransport links with 6.0 GB/sec access between processor and memory”. 6 GBps is 48 Gbps (1 byte = 8 bits, and GB versus Gb is Byte versus bit).

The processor and memory talk to each other with a cap of 48 Gbps and because they’re in close proximity the latency is very low. The typical connectivity out of the back of a server would be 1 Gbps with standard copper (and up to 10-40 Gbps for ethernet up to infiniband), and filling up a 1Gbps uplink to the outside world (“The Internet”) would run you approximately $50,000/month for a single (decent Tier 1) provider.

So the limit to completely blurring out physical and geographical distinctions between computers is the network (still). It really serves then as the backplane for The Grid (from Dave Hitz of Netapp).

What if everything was normalized out? Could one have processors in one chassis and RAM in another? Could an application in one datacenter context switch to servers in another datacenter?

Yes.

Onto Amazon

In the last year, Amazon has launched a series of interesting services: a simple messaging queue, S3 storage (utility storage) and EC2 (utility computation).

They’re relatively easy for an experienced programmer to use, and you get near instant access to them when you sign up. They’re not typical though, and there is only API access to storage.

The instant access and scalability must have some limits as well, and I don’t mean physical or computational limits, I mean that EC2 and S3 would be a great target for spammers and their various cousins to hit. Imagine being able to ramp up 5000 EC2 images with the same list of emails and setup, and start cranking out emails to billions of people. Massively in parallel, you could get a lot done in a short period of time. What stops this behavior? Where’s the audits? How real-time is the billing? What kinds of protections are in place for people who could use EC2 to launch a DDOS attack on another provider or company?

I’m sure that it’s covered in a acceptable use policy but a policy doesn’t proactively stop such behavior. And as someone who runs mail systems where 90% of the traffic is spam, I get concerned.

A decent rationale of Bezos for getting into this space is seen in some of the initial reports,

“‘The reason we’re doing this is because we think we can empower developers with a new kind of Web-scale technology. And we can make a profitable business for ourselves.’”

“‘The idea of using infrastructure Web services to remove costs for other businesses is something that’s already being accomplished by efforts like S3,’ said Bezos. ‘Our goal is to build services that are incredibly easy for developers to use and also very reliable, that can return results rapidly at a low cost, and allow users to pay by the drink.’” (from eweek)

As the weeks went by since the MIT Emerging Technology conference, there where challenges about how this is really a distraction for Amazon, who runs one of the largest online e-commerce sites in existence and is just now barely profitable. Amazon Web Services is a separate LLC though so it’s already off as another business. I don’t think that Amazon gave all of it’s infrastructure to AWS though, and then buys it back themselves as a utility.

So we saw the appearance of the rationale that Amazon is simply reselling non-utilized space, memory and CPU cycles, and it begins to show up in other’s writing with the rationale propagated by Nick Carr and Isabel Wang,

“But Bezos argues that Amazon is a natural in the emerging utility computing world. It has been honing its skills in large-scale technology operations for 11 years, and it has invested billions of dollars in its setup. Why not offer that infrastructure to others while it’s idle?” (Isable Wang)

I think the first half of that is fine but I don’t agree with the non-utilization argument.

There is no concept of “idle” in storage: while the disk might not be accessed at a given moment, the files are definitely taking up space.

This is an important distinction, because if Amazon needed space in a crunch, they couldn’t reclaim it. Not if other people are paying to be there.

If they needed computational power in a crunch, they could reclaim it. It would just mean turning everyone else off, assuming that they’re actually selling excess. The store’s infrastructure of course has limits, an XBOX promotion brought the site down for a bit during this last Thanksgiving holiday.

Even though the SmugMug story of saving lots of money has been making the rounds, I don’t buy it. Purely because we do our own storage infrastructure and I know that we can offer storage at the same price as Amazon S3 and still make money … so?

Instead the story of webmail.us is the most compelling and best use I’ve seen so far. By “best”, I mean most appropriate for the infrastructure, and applications. They intelligently use the combination of the message bus, compute and storage, to keep the correct things near each other. I think it’s also a great example of “The Grid” having the ability to get functionality off another provider, and to have that provider give you enough compute and storage to handle what should be best handled “locally”.

Because where is the main infrastructure for webmail.us itself? It’s at Rackspace.

So is webmail.us’s use of Amazon’s web services a success for Amazon or a failure of Rackspace? Or both?

Will Amazon’s product offering mature to the point where it would make sense to run everything there? Will Rackspace and similar managed hosting companies wise-up and begin to offer comparable services?

Will we begin to do Grid Peering relationships where say our users or Rackspace’s users could have network access from our servers to Amazon’s without incurring a bandwidth charge?

Who is hurt most by Amazon’s moves then? Nicolas Carr has this one correct, “Most of the big tech vendors—including IBM, HP and Sun—offer basic computing infrastructure as a pay-as-you go utility. But I think at the moment, they’re being outmaneuvered by Amazon Web Services,” (quote from news.com.com).

Amazon is really the biggest threat for large hardware manufacturers and could very well take away the long-tail of server customers and small businesses.

As a growing startup, remember you can’t “out google google” (as suggested in the businessweek article) by using someone else’s infrastructure. Doesn’t anyone realize that it’s not a coincidence that Google releases practically no information, data or software about their infrastructure, and the iterations that it’s gone through over the years?

The only argument is that you don’t need to worry about it until it’s a nice problem to have (how many googles or walmarts are there?).

Joyent happens to be in a similar space with our “Grid Accelerators”

Joyent is an infrastructure and development company that has put together a multi-site, multi-million dollar hosting setup for our own applications’ use and for the use of others. Our applications are predominantly in the Ruby on Rails framework, which we’ve been involved in since its inception via our TextDrive hosting product, and we also host a large number of sites and software written in Perl, Python, Java and even Erlang.

We’ve been selling the infrastructure pieces since the summer of 2006, and I think they have some nice Key Features that a lot of the competition does not.

Key features of what Joyent offers:

  • AMD and T1 SPARC Sun Fires
  • Sun Storage
  • Solaris Nevada
  • One and Ten gigabit ethernet networking throughout.
  • Physically separate public, private and storage networking
  • iSCSI and NAS
  • Level3’s telco grade facilities
  • High-end edge-of-network F5 load balancers
  • On-demand RAM at $50/GB/month
  • On-demand CPU at $200/CPU/month
  • On-demand Storage at $0.50/GB/month or $1/GB/month
  • On-demand Bandwidth at $0.20/GB

You can see the NIC cards separated from each other with a public (4.71.165.93), private and trunked 2 Gbps connection to storage.

[z09578AA:~] admin$ ifconfig -a
e1000g0:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
inet 4.71.165.93 netmask ffffff80 broadcast 4.71.165.127
e1000g1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 4
inet 10.71.165.93 netmask ffffff00 broadcast 10.71.165.255
aggr1:1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2 inet 172.16.165.93 netmask ffff0000 broadcast 172.16.255.255

We have recently wrapped up a few PDF documents for the new website and I thought this article would be a great time to go ahead and release them early.

There is a solid Datasheet, a more detailed Whitepaper and a couple of use cases: a Hosting cluster, a MySQL cluster setup and an example of deploying out a Ruby on Rails application.

Let me know if you have any questions, comments or concerns about our own things, and you can buy these pieces right now on the main textdrive site.

9.5 hours until Java is open-sourced

Sun is releasing the source to Java SE, Java ME, and Glassfish in nine and a half hours, and then releasing the rest of the stuff around the spring of 2007.

Most interestingly they’re releasing it under the GPLv2 license. Floyd Marinescu has a great explanation as to ‘Why GPL?’.

Once it hit midnight eastern here I noticed a bunch of news articles coming out on the wire about it. CNET’s news.com, Javaworld, Mercury News, and the AP’s Tech Wire and I’m sure more are covering the news.

If you’re interested there is a webcast with Jonathan Schwartz and Rich Green at 930am PST.