Fermions, Bosons and the 6 Utilities

When I used to teach university chemistry, I’d always start with the statement:

The universe (at one level) is made of two things and two things only: fermions and bosons.

Fermions are the things that have “stuff”: they have mass and can be charged (or not). Bosons are the things that have no “stuff”: they do not have mass nor do they have charge. Bosons in many ways are the things that move fermions. This comes from Quantum Mechanics, where we see that Fermions have spins of +1/2 or -1/2, and Bosons have a spin of 1. This the baseline and the binary division is given to us by the Standard Model.

We also already had an understanding of this division: Fermions are matter, and Bosons are energy. Matter is the stuff of the universe, and energy moves matter.

A simple, appealing, mutually exclusive, yin-and-yang description of things. I don’t mind things that end up being in powers of 2 or 10, or form a nice little tree.

I like to think we have a similar division in compute utilities: things that take up space (Fermions/Matter) and things that move or are the movement of stuff (Bosons/Energy).

Conceptually I group them as

The fermions

1) CPU space
2) Memory space
3) Disc space

The bosons

4) Memory bus IO
5) Disc IO
6) Network IO

This in my mind forms the 6 Utilities that we must have fine-grained, differential controls and metrics on in a “cloud computer” that fairly serves many people. We have to understand the possible minimum and maximum values, and we have to figure out how to balance them all with real workloads. These are the prerequisites that we watch, measure and learn from so we can ask and answer questions such as “How do I pair together one customer that’s CPU-intensive and another that’s disc IO-intensive and have the sum appear just like a single, well performing CPU– and Disk IO-intensive application?”.

The reality is that most operating systems still don’t have a complete set of tools around the 6 Utilities in terms of resource management, QoS (quality of service), virtualization and teasing these apart in a way that serves a number of people sharing physical resources. Operating systems still are basically for a single person using a single “computer” at a time, and there’s real challenges around saying that we should just use BIG servers and divvy them up. There’s even challenges around many cores and lots of RAM.

I wonder if we can in fact have a single purpose operating system that serves both the single user and the “cloud”, and based on the work we’ve been doing with OpenSolaris, I’d say “No”.

Billions Served: Joyent Accelerators a Real Platform for Growth

Mark Mayo of Joyent gave a presentation yesterday evening at a Facebook developer garage in Vancouver, Canada. One statistic from his presentation really stands out. Joyent provides on-demand infrastructure for one application serving nearly one billion page views per month. One billion. Moreover, the infrastructure cost for that application is just over $10K per month. And, if the fickle desires of Facebook users turn away from this customer, they aren’t tied down to a contract. We help people scale up, and scale down.

As Rod said in an earlier post, Joyent is powering 11% of Facebook application usage and growing rapidly. Joyent can do this because the Joyeurs have built a real, open protocol, open standards, professional cloud computer. We have hardware load balancers and high-end routers capable of driving billions of page views across our entire network every month.

I am really proud of what the Joyeurs have accomplished. Congratulations.

P.S. the application is a Rails app. I think these facts put to bed any issues regarding Rails and scaling.

(Photo and blog quote from: Miss 604)

NetApp versus Sun, Sun versus NetApp, and Both versus Common Sense

As you might have heard and likely read in the back-and-forth blogging of Dave Hitz (a NetApp founder) and Jonathan Schwartz (CEO of Sun Microsystems), the two are at each other’s throats. Well not really at each other’s throats: NetApp went nuclear and Sun hit back even harder.

Basically NetApp says that Sun’s ZFS steals it’s existence from NetApp’s WAFL, an earlier Copy-On-Write (COW) file system. Sun in return wants to ensure that NetApp cannot sell its filers and pull NetApp’s ability to legally use NFS.

For those of you that need reminding, Sun invented NFS (the wikipedia page has a sufficient history) but NetApp did the best commercial appliance implementation. I forget if NetApp handed Sun their asses in the marketplace with a superior product, or if Sun just never delivered a filer-like appliance (I say “I forget” because while I can recall using NetApp, I can’t remember using a single offering from Sun beyond just a workstation).

NetApp filers do happen to be one of the best NFS appliances out there, and they happen to combine that with iSCSI and Fiber Channel in one rig. Pretty flexible, and if you look around you’ll see that it’s pretty unique. For example, an EMC Clariion does iSCSI or Fiber Channel, and if you want it to do NFS, then you basically buy a “gateway” (just a server attached by fiber).

We at Joyent are in a bit of a odd and unique position (I think). We happen to use what I call The Holy Z2D Trinity (ZFS, Zones and DTrace), and not just that we use ZFS on top of block storage from NetApp filers. And the team here from myself, to Mark, to Ben have written tools for filers, have managed many petabytes worth of them (often petabytes all at one time), and have been around them since they came out.


In fact, besides my common boast that we’re one of the larger or the largest OpenSolaris/Solaris Nevada installations in the world, I’d venture to guess that we have more ZFS on top of NetApp filers via iSCSI than just about anyone else.

Now let’s think how we even got here in the first place.

  • NetApp developed a nice little filesystem they call WAFL.
  • NetApp contributes to FreeBSD and FreeBSD is the core OS in some of their products (yes please don’t try and “educate” me on a rig I’ve used for a decade, I said “some”, I know they have their own OS).
  • WAFL or something like it has presumably been ported to FreeBSD.
  • When pulling up block storage LUNs from our filers, we still need to a filesystem on it. (Got that? Not everything is NFS, There still needs to be a filesystem on the server.)
  • Nothing like WAFL or a nice consistent COW filesystem was ever contributed back to FreeBSD.
  • FreeBSD instead just has UFS2, and while it has soft updates, you’ll still to need to run a fsck in single user mode for hours once you’re in the 100s of GBs, and ironically the busier the system is, the worse that is (again, yes I know there’s a background fsck, try running that on a busy mail server, it’ll crash, guaranteed).

So larger drives (or LUNs) plus a shitty, non-resilient file system meant that all of FreeBSD had to go. Despite everything great about it and a lot of time invested by the team over a decade in using it. We had to leave FreeBSD and go to wherever we could to get a good, easy to use, resilient, modern file system.

That file system happened to be ZFS and the operating system was OpenSolaris. The additions of DTrace and Zones (we used Jails before) formed our three pillars.

But stop and imagine for a minute that NetApp had been kind enough to give WAFL or a similar filesystem back to the FreeBSD project. Imagine that we had something like WAFL to put on top of our block storage LUNs that were coming up from the NetApp filers.

Got that?

We didn’t want or need WAFL on a real operating system to develop a competitive storage product, we needed WAFL-like on a real operating system in order to simply use our NetApps.

Then we wouldn’t have ever made our first step to OpenSolaris. We’d currently have a business based on FreeBSD 7 with WAFL, Jails and DTrace (hooray for that port!), and believe me leaving FreeBSD was painful in many ways, with the salve being ZFS and DTrace.

While I think there is a good degree of posturing going on betwee the two companies, and it’s fascinating to see it going on in blogs, both parties are full of it and don’t quite get it.

NetApp Dave:

You should have given WAFL or a WAFL-lite file system to FreeBSD and then all of us could happily use it on top of iSCSI or fibre-channel block storage. You would have made it the best operating system to put on top of any networked block storage, being smart guys, you could have figured out how to do it while making NetApp even more money, and without spawning a bunch of FreeBSD storage appliance clones. The fact that you didn’t do this is why Joyent uses Solaris, and you’re responsible for the new market need that’s out there for ZFS and via ZFS, Solaris itself.

Stop being so shortsighted and give a little, think FreeBSD 7.5 + WAFL.

We’re the poster child for OpenSolaris adoption, and that fact that we’re using it … you could have prevented it.

Sun Jonathan:

It’s great that ZFS was developed, magnificent that it was open-sourced, and I know that it was valid and true creation of Jeff Bonwick (who we count as a good friend). You make hardware, you created NFS, you’ve always touted the importance of the network in computing, yet I’ve had to always use NetApps to get fast and reliable NFS and iSCSI block storage (even to put my ZFS filesystems on).

Ship a decent piece of hardware + software capable of fast and reliable NFS + iSCSI. NetApp only exists because of Sun’s failures in actual product development.

“What Ifs”

If NetApp wins, there goes our beloved ZFS (yes I understand indemnity, but I care more about continued development) and NetApp, you don’t have an alternative for us. Thanks for nothin’.

If Sun wins, there’s goes our NetApps, and Sun, we can’t get an alternative from you. Thanks for nothin’.

And finally

I think we’ve been a good customer of both of you, when you fight, it hurts us more than anyone else.

Joyent DTrace Probes for Ruby in Apple’s Leopard

We are looking forward to installing Leopard on our Macintoshes here at Joyent this weekend. We talked in the recent episode of ps pipe grep about the effort at Sun to ensure Solaris is a good operating system for laptop computers. (Cricket sounds).

But seriously, now that OS X 10.5 (Leopard) is shipping (today), with DTrace probes for Ruby from Joyent, Macintosh laptops, now more than ever, are the platform for web application development. Especially Ruby/Rails applications.

Joyent is pleased to have contributed the DTrace probes for Ruby that are used in Leopard. The same probes are available on our Accelerators. Purrrrrr.

Using DTrace on MySQL

Even though there aren’t DTrace probes for MySQL released yet, we can still get useful information from MySQL. DTrace has a pid provider which allows us to get into any function the program is executing and see it’s arguments. The only drawback is you have to go digging around in the source code to find out what you want to see. But thanks to guys like Brendan Gregg, they have already done some of the digging for us.

Even if we want to go digging around ourselves, it’s really not that hard; you just have to get your feet wet. And because Accelerators have DTrace probes enabled, you can take advantage of using DTrace on MySQL. I will show some examples of this and how easy it is to hunt down your own functions.

First let’s start with functions that have already been dug up for us:

mysql_parse(thd, thd->query, length, & found_semicolon);

This is the function MySQL uses to parse a query. So all we have to do is trace this function through the pid provider and we get to see all the queries coming through. This shows arg1 as being the query, and we must copy it in to kernel land where DTrace works for it to see the string:

root@ferrari:~# dtrace -qn ‘pid$target:mysqld:*mysql_parse *:entry { printf(”%Y %sn”, walltimestamp, copyinstr(arg1)) }’ -p `pgrep -x mysqld`
2007 Sep 27 10:04:35 select * from blah
2007 Sep 27 10:04:58 select * from tablenothere

Notice that this will show all queries, even if they aren’t successful. Now that we can trace queries, this can give us good information. For example we can see what queries are executed the most:

root@ferrari:~# dtrace -qn ‘pid$target:mysqld:*mysql_parse *:entry { @queries[copyinstr(arg1)] = count() }’ -p `pgrep -x mysqld`

select * from blah 5

select * from tablenothere 10

You can’t get this kind of information from MySQL unless you write some kind of script to parse through the query log. If we know that there is a query being executed 1000 more times than the others, we could always try to get this one to cache. Now lets say we want to find out how long a query took to execute. The function mysql_execute_command does the actual execution of the queries so all we do here is subtract the entry and return timestamps of this function. The script shown below uses this:

root@ferrari:~# ./exactquerytimes.d -p `pgrep -x mysqld`
Tracing… Hit Ctrl-C to end.

Query: SELECT COUNT FROM joe_visitors where upper(vs_browser) not like ‘GOOGLE‘ and upper(vs_browser) not like ‘GOOGLE BOT‘ and upper(vs_browser) not like ‘BOT‘ and upper(vs_browser) not like ‘MSN‘ and upper(vs_browser) not like ‘MSNBOT‘ and upper(,

Time: 2.32

On the MySQL side, it showed this query being executed at 2.32 seconds as well:
1 row in set (2.32 sec).

This is awesome information because as of now MySQL doesn’t allow you to see a slow query that is less than 1 second (I believe this is a fix in MySQL 5.1). So with this, we can see not just slow queries, but all queries and how long they take to execute with their times.

Now let’s try this same query but I bumped my query_cache_size up to 50M:

The first try (won’t hit the cache):

root@ferrari:~# ./exactquerytimes.d -p `pgrep -x mysqld`
Tracing… Hit Ctrl-C to end.

Query: SELECT COUNT FROM joe_visitors where upper(vs_browser) not like ‘GOOGLE‘ and upper(vs_browser) not like ‘GOOGLE BOT‘ and upper(vs_browser) not like ‘BOT‘ and upper(vs_browser) not like ‘MSN‘ and upper(vs_browser) not like ‘MSNBOT‘ and upper(,

Time: 2.28

And the second try hits the cache but doesn’t show anything through DTrace. So this means a query that is served from the cache won’t show up in mysql_parse. Right now this probably doesn’t mean much, but as you learn more about how the internals of MySQL work then troubleshooting becomes much easier down the road.

So far this has all been information that was provided. Now I will show how simple it is search through MySQL’s source and look at functions.

First we need to decide what to look for. Let’s say we want to find out every time that slow query is written to the slow query log. First we download the MySQL source code from http://www.mysql.com. Now we can search through the source code for ‘slow query’:

root@ferrari:/export/home/derek/mysql-5.0.45# ggrep -ri ‘slow query’ *

This turns up only a few source code files, one of them looking most obvious called log.cc with the expression “Write to the query log”. The MySQL code is very well commented so it makes searching really easy. Looking in this file, that comment is right above the function:

Write to the slow query log.
bool MYSQL_LOG::write(THD *thd,const char *query, uint query_length,
time_t query_start_arg)

It’s obvious that this function is the function that writes to the slow query log. Running this script below, looking for function LOG while a slow query is being inserted shows this function being executed with some weird characters around it:

root@ferrari:~# dtrace -F -n ‘pid$1:mysqld:*LOG *:entry {} pid$1:mysqld:*LOG *:return {}’ `pgrep -x mysqld`
dtrace: description ‘pid$1:mysqld:*LOG *:entry ‘ matched 126 probes
0 -> _ZN9MYSQL_LOG5writeEP3THD19enum_server_commandPKcz
0 -> _ZN9MYSQL_LOG5writeEP3THDPKcjl
0 < – _ZN9MYSQL_LOG5writeEP3THDPKcjl

The only thing bad about tracing MySQL through the pid provider is that these weird characters change between MySQL versions, so we can’t always trace for the function ‘_ZN9MYSQL_LOG5writeEP3THDPKcjl’ if we want it to work on other machines. We have to trace for *MYSQ*LOG*write* which slowquerycounts.d uses:

root@ferrari:~# ./slowquerycounts.d -p `pgrep -x mysqld`
Tracing… Hit Ctrl-C to end.

SELECT COUNT FROM joe_visitors where upper(vs_browser) not like ‘GOOGLE‘ and upper(vs_browser) not like ‘GOOGLE BOT‘ and upper(vs_browser) not like ‘BOT‘ and upper(vs_browser) not like ‘MSN‘ and upper(vs_browser) not like ‘MSNBOT‘ and upper(
Count: 3

As you can see DTrace can be very powerful even if we don’t have probes released yet, we just have to do a little extra work. Some of the information shown can be obtained from MySQL, but using DTrace still provides a benefit because we don’t have to enable anything in the MySQL configuration, possibly making us restart the server. I’m providing these scripts in the MySQLDTraceKit.tar.gz. Hopefully in the near future we will have real MySQL probes.

Virtualization is more than just consolidation

I was asked to co-present with an engineer from Sun at an upcoming conference in October. I asked him to do his slides and then shoot me over the presentation so I could fill in my half. I noticed that his view of virtualization and mine were very different. To put it into jargon speak, there is a difference between Redshift virtualization and Blueshift virtualization.

So what is this redshift/blueshift stuff about? Redshift is a theory created by Sun Microsystems CTO Greg Papadopoulos. Essentially it says that there are two different classes of business: “blueshift” companies that grow according to GDP and are essentially over-served by Moore’s Law that computing power doubles every two years, and “redshift” companies that grow off the charts, and which are grossly under- served by Moore’s Law.

At Joyent, we have both kinds of customers, but we are really born to serve the “redshift” market. When your site is in development, you want to save on cost. But when you open up and start to explode like a Twitter or FaceBook, you join the redshift elite – and suddenly cost falls away as a concern, replaced with a vicious need to scale to meet the demands. When users pour into your site, the ability to scale is the difference between life and death, between getting rich and updating your resume.

Virtualization is rarely properly distinguished between these redshift and blueshift environments, but the concepts are old ones: horizontal scaling (more systems) and vertical scaling (bigger systems). And here’s the rub, most vendors aim virtualization at the latter category (blueshifting), vertically scaling customers under the banner of “consolidation”; buy bigger systems that do more to fight the ills of under-utilization.

Of course, this matter of “under-utilization” is itself a tricky subject. Vendors show utilization graphs that demonstrate how most of your systems are sitting idle in racks, and that by consolidating into a larger virtualized system you can increase utilization dramatically. But is that what customers really want? Any sysadmin will tell you that if there is a load average above 1.0, users freak out. I’ve dubbed this phenomenon “pay for idle”; even in a consolidated environment, customers don’t want the system to do a lot of work.

And why? For excess capacity to meet the demands of the future. Customers want a 4-core Opteron system with 8GB of memory not because they need it, but because when Slashdot or Digg pays them a visit they might need it. To be frank, this is human nature and not restricted to computing… when you buy a truck you buy a truck that can haul the largest thing you can imagine ever needing to haul; otherwise sales of half-ton pickups wouldn’t be as high as they are, despite the fact that most are used to pick up groceries. People buy for what they may need at some point, not what they do need now.

Consolidation might increase utilization, but it doesn’t address human nature.

The answer is virtualization, but in two important forms: 1) horizontally consolidated environments, and 2) portability. That is, spread your application across several systems but in such a way that you can move any individual environment around to meet changing demands over time. Put into storage terms, RAID your application and get a vendor guarantee that you can put those disks into a bigger chassis when the need arises. These are the sorts of things we commonly attribute to disk solutions, but not to applications. Would a responsible company ever consolidate 100GB disks into a big 500GB disk? Of course not, we use a lot of small(er) disks in a RAID to enjoy the benefits of both increased protection and performance.

But consolidation is only part of the real promise of virtualization. Portability is key – the ability to duplicate (clone) environments quickly, to spread them over and increasingly large set of resources (physical servers), and to re-allocate resources seamlessly to meet the demands of a changing environment.

When you look at most VPS providers in the market today, they tend to only address convenient aspects of this formula. Solutions based on Xen, such as EC2, allow you to create additional clones of your application environment (horizontally), but not to re-size individual environments based on need (vertically). This isn’t because they can’t, but because it’s easier not to. Furthermore, many solutions out there leave management of the most critical resource – your load balancer – to you, rather than transparently managing a robust solution on your behalf.

True virtualization is about turning “servers” into “application environments” and treating those environments as building blocks that are easily duplicated, dynamically sized, and totally portable. That’s redshifting, and that’s what Joyent does best.

“Is language X scalable? I heard that it isn’t”

A nonsensical question that is rarely qualified.

To quote Theo from his fine book

Languages aren’t slow; implementations of languages are.


Language selection and scalability have little to do with each other; architectural design and implementation strategy dictate how scalable a final product will be.

I couldn’t have put it better myself.

The point is simple.

Just because a given language may be slower (note the “er”, that makes this a relative term) in parsing XML doesn’t mean that

  1. it’s fundamentally slow (an absolute term),
  2. that as part of a web stack it can’t do 10,000 requests/second, and
  3. simply tells you something about it’s performance in a given situation not about it’s “scalability”.

In case you’re wondering, I just got off a call where I was asked this question not about Ruby but about Java. Interesting.