OK nginx is cool

For the simple reason that it’s the only static web server I’ve seen that supports Solaris’s Event Ports

events {
    worker_connections  1024;
    use eventport;

I’m cutting over the ton of static servers we have to it.

If you’re interested in a x86/64 build for Solaris


Just drop it in place

$ ln -s /opt/pcre-7.0 /opt/pcre
$ ls -l /opt/
pcre -> /opt/pcre-7.0/

There’s an dot.xml file in /opt/nginx that you can use for SMF.

(note there’s currently some issues in the eventports implementation, so leave it commented out for now and nginx will use dev/poll)

Solaris, DTrace and Rails

We committed ourselves to Solaris as our base operating system two years ago as Solaris was becoming OpenSolaris. We needed a solid operating system that was 32/64bit, can manage lots of CPUs and RAM, one that we could contribute to, and we realized that three features would be a competitive advantage if we became experts in using them in production: ZFS, Zones and DTrace (the pre-existing observability tools in Solaris are quite excellent).

With time, it’s been surprising to me how that’s not clear to a lot more people, and we at Joyent have found ourselves spending a considerable amount of time simply being Solaris advocates.

So this last Sunday we (Ben and I) had a DTrace jam session at the Obvious offices with Bryan Cantrill and Jeremy over at Twitter, and were just running through an expanded way of looking at what ruby processes are doing when in production (the luxury of the processes being an issue is what great load balancers gives you).

(Jeremy, Bryan, Ian; the MacBook Pro in front of Bryan happens to be Ben’s, Bryan only uses Solaris on an Acer Ferrari 😉 )

Bryan happens to be a great Solaris advocate, and his biggest hammer is DTrace. So we try and get him in front of as many customers as we can, so that he can make an experiential case for why everyone with applications in production should be using Solaris. James Governor seems to agree.

We use DTrace all the time in identifying performance issues in our customer’s and in our own applications (remember we have some of the largest and oldest rails apps around), but it helps when its creator makes an appearance and a case for it as well. It was a bonus that Ian Murdock was there as well. Ben covered his impressions (or as he said, his “verdict”) of Ian after being able to spend a Sunday afternoon with him talking about Solaris.

From just a little bit of time, and out of the box it was clear that a lot of CPU was spent in an odd place: raised exceptions would generate backtraces that were going through hundreds of frames.

The result was a ticket filed at dev.rubyonrails.org 16 hours ago, and David committed the changes to Rails itself 5 hours ago. Everyone benefits from something pointed to by DTrace on Solaris.

And I think this is much in line with David’s main points about communities around open source projects, and a nice example of an open source operating system and tool giving you specific insights into a open source language and web framework, and then submitting and applying fixes to the right places. Blaine, the guys at Twitter, and DHH should be proud of this example.

But this introspection into what the Ruby processes are doing is still not deep enough. DTrace tells us from the outside of an interpreted language what it’s doing in regard to tens of thousands of different “probes” throughout the operating system.

You can see the number of places DTrace is hooked in with

$ dtrace -l | wc -l
$ dtrace -n fbt:::entry'{@[execname] = count()}'
dtrace: description 'fbt:::entry' matched 24798 probes

That’s more probes then I often know what to do with, but it’s great that they’re there.

Nearly a year ago, we worked with Bryan and the DTrace team to get as complete a set as possible of is-enabled probes into Ruby and to maintain that going forward (there’s only been patches against 1.8.2 up to this point). We’ve done that internally, and now with the increased use of Solaris by our own customers, and with DTrace showing up in the what is likely the most common development platform for Rails, Mac OS X, we renewed our collaboration today.

So we’re polishing up the patches for Ruby 1.8.5 (the version we and our Accelerator customers still run in production, and we’ll do 1.8.6 as well once that’s done in QA), and we’ll be sending those off to Bryan and his crew by Thursday for some vesting, and we’re aiming to release those next week.

The result is going to be a tremendous amount of insight into production Rails (and other Ruby) processes, and we hope that lots of improvements (even application-specific) can come from that.

On Accelerators

Fundamental Philosophy and Origin

Accelerators rose from two needs: a standardized stack capable of serving our own growing applications and the appearance last year of large companies and startups needing “enterprise rails”.

Our applications, like Strongspace and the Joyent Connector are over 2 years old now, were some of the earliest revenue-generating Rails applications and have grown to cover a significant build out. There was a bit of learning how to do stuff, that’s applicable to anyone else. Like any infrastructure there’s a few boundaries (>10 servers, >100 servers, >1000 servers, >100 TB of storage) that when you’re crossing them, it’s critical to rigorously standardize. We found ourselves moving to one operating system (Solaris, we used BSD and Linux before), one type of server, one type of switch, one type of routers, one kind of ethernet cable, one kind of storage, one kind of hard drive, one kind of DRAM and one type of hardware-based load-balancers. This introduces needed predictability, makes automated management possible, makes virtualization easier, allows for systems level software development and makes all the parts interchangeable.

We were a little ahead of the curve in our community, and with a significant number of contracts coming our way, we productized the Accelerators last summer with two goals in mind: minimize the capital expenditures and worry around operations and facilities, and allow development teams and businesses to focus on and grow their applications.

We haven’t always done that perfectly, and we’ve made mistakes here and there, but every mistake has been noted, learned from and won’t be repeated. What we’ve found is an increasing common position that Allan Leinwand covered well in a recent article at a Gigaom and that is the perception that infrastructure is already a full utility (it isn’t but we’re trying to make it so) and that one “can deploy a wildly successful Web 2.0 application that serves millions of users and never know how a router, switch or load-balancer works.”

There then is a couple of key things that are needed beyond simply servers and network drops and these are load-balancers capable of handling not just significant traffic but also wide horizontal spread, and storage that is both resilient to all failures (doesn’t lose data. ever.) and scalable to 100s of TBs for a single customer (again, the first customer of that size storage was Joyent).

Load Balancers

We’ve been long-time users and fans of F5’s BIG-IP as a carry-over from when some of us were in “big enterprises”.

But there’s real technical reasons for using them, and to give you an example, a pair of the big BIG-IPs load-balances all of adobe.com, which is the place where everyone and their mother downloads Acrobat from (a popular download during the tax seasons), and they constantly run in their spec’ed range of 2-10 Gbps. That’s not a trivial amount of traffic, and there’s not a trivial number of backend servers either.

What makes them quite relevant to us is that it’s possible to put 300-400 mongrels behind a single floating IP address and watch as 20-30 connections get evenly distributed across them.

It’s possible to have the BIG-IPs force caching headers, force pipelining by making it appear that you’re using distributed assets on different hostnames, layer7 direct traffic so you can do things like separate out application servers for different controllers, different routes, and to differentially handle parts of your site (separate page views from API calls), and even load-balance rings of MySQL multi-master servers behind a single IP address.

Basically a lot of logic can be coded into them so they can accomodate applications that weren’t quite built to scale.

In my opinion, these capabilities are one the most discriminating features, and all things is a key part of a scalable stack.


The other is that there should be no worries, loss of data or loss in performance when any component of the storage fails or is unavailable for a time. A development team shouldn’t have to worry about catastrophic hardware failures. This is achieved with a fully redundant storage infrastructure: RAID6 across 9-14 drives, and network volumes coming up from dual trays and dual controllers via physical separate switches, cables and network interface cards. The only concern one should have is backups to protect yourself against accidental file deletion (like a migration gone wrong), and for this there’s point-in-time snapshots (from ZFS).

Pay For Idle or Fair Share?

An Accelerator is not “pay for idle” (but the tools are from a pay for idle world), and Solaris userland has to be BSDized or Linuxized so that there’s no barriers to adoption.

In a “pay for idle” system, you’re paying for CPU to sit when they’re doing nothing, what we do is allow people to burst up and use CPUs on a node as long as no one else needs them. When one does need CPU time and you’re the “non-bursting party”, it’s under a Fair Share Scheduling (FSS) algorithm, and usage for an application receives guaranteed minimums. The problem is that when navigating a on-top-of-the-OS virtualization such as Solaris zones, tools like “top” can still see kernel statistics and they report total CPU use. In some cases, this is disconcerting and a common point of discussion, but when an application needs CPU, it gets it, and in fact with FSS, an application that hasn’t been cranking away on a CPU should get an even greater priority. When you add the work Joyent is contributing such as automated migrations based on CPU loads in combination with load-balancing (and quality of service being managed there), and observability tools that correctly report what one is using and has available to them; one is on a wonderful middle ground between pay for idle and an intelligent CPU scheduler that’s aware of application loads. Joyent can begin to treat a farm of processors like a single processor, each operating system like a raft of processes on a unified operating system, each user process like an event in a grander process. It all balances.

BSD Userland

We’ve been working to move to a userland that’s very familiar for people coming from FreeBSD, NetBSD and Mac OS X. There’s very little taken away in Solaris, and a lot of new tools, and in a lot of ways we offer optimized binaries in combination with the normal paths to them that one would expect.

The first place the new userland is going to show up will be in the “new” shared.

“But never forget that you can only stumble if you are moving.”

-Richard P. Carlton, Former CEO, 3M Corporation

How to completely ruin a great piece of server kit (regarding the Sun X4200 M2)

Here’s how you do it. First, you take what is considered a pinnacle of x86 server design, the glorious x4200 where every single chip has been selected for maximum reliability and performance. Like, say, the quad on-board Intel Gigabit Ethernet chips. Then, you create a new revision called the x4200 “M2” and replace the first two Gigabit Ethernet ports with fscking NVidia NForce crap. That’s it. Done. You’ve just ruined it.

The Intel GigE chip and (perhaps more importantly) driver are considered the best across all operating systems. Whether it be FreeBSD, Linux, Windows, or Solaris, the Intel driver rocks. It’s understood. It performs. It’s reliable. People go out of their way to build systems with these chips. The Nvidia “nge” driver, however, is not exactly regarded as a top notch piece of software. Yes, I’m being polite.

Perhaps even more dramatic is what changing network chips does to OS driver profiles. I’m talking Jumpstart profiles. Kickstart profiles. Ghost images. Boot disks. In a bigger shop, now you’ve got re-qualify the machine, a process that might take weeks or months.

Come on Sun, this is the kind of thing that really pisses off sysadmins who know their hardware, and shops like ours where things are really standardized from top to bottom. We’re not going to keep wasting 2 days trying to get M2’s to PXEBOOT (also a change). We went through the entire stack, and it still isn’t working. To make matters more irritating, the change is mentioned nowhere on sun.com, the architectural whitepapers linked to from the site (about X4200 M2s) are now incorrect, and don’t get me started on the phone support. So we’re sending a stack of these back and we’re currently lighting candles in churches, praying that the x4100s don’t get screwed up as well. I mean can’t we just keep buying something again and again and again and again, and just have it work?