NetApp versus Sun, Sun versus NetApp, and Both versus Common Sense

As you might have heard and likely read in the back-and-forth blogging of Dave Hitz (a NetApp founder) and Jonathan Schwartz (CEO of Sun Microsystems), the two are at each other’s throats. Well not really at each other’s throats: NetApp went nuclear and Sun hit back even harder.

Basically NetApp says that Sun’s ZFS steals it’s existence from NetApp’s WAFL, an earlier Copy-On-Write (COW) file system. Sun in return wants to ensure that NetApp cannot sell its filers and pull NetApp’s ability to legally use NFS.

For those of you that need reminding, Sun invented NFS (the wikipedia page has a sufficient history) but NetApp did the best commercial appliance implementation. I forget if NetApp handed Sun their asses in the marketplace with a superior product, or if Sun just never delivered a filer-like appliance (I say “I forget” because while I can recall using NetApp, I can’t remember using a single offering from Sun beyond just a workstation).

NetApp filers do happen to be one of the best NFS appliances out there, and they happen to combine that with iSCSI and Fiber Channel in one rig. Pretty flexible, and if you look around you’ll see that it’s pretty unique. For example, an EMC Clariion does iSCSI or Fiber Channel, and if you want it to do NFS, then you basically buy a “gateway” (just a server attached by fiber).

We at Joyent are in a bit of a odd and unique position (I think). We happen to use what I call The Holy Z2D Trinity (ZFS, Zones and DTrace), and not just that we use ZFS on top of block storage from NetApp filers. And the team here from myself, to Mark, to Ben have written tools for filers, have managed many petabytes worth of them (often petabytes all at one time), and have been around them since they came out.

Hmm.

In fact, besides my common boast that we’re one of the larger or the largest OpenSolaris/Solaris Nevada installations in the world, I’d venture to guess that we have more ZFS on top of NetApp filers via iSCSI than just about anyone else.

Now let’s think how we even got here in the first place.

  • NetApp developed a nice little filesystem they call WAFL.
  • NetApp contributes to FreeBSD and FreeBSD is the core OS in some of their products (yes please don’t try and “educate” me on a rig I’ve used for a decade, I said “some”, I know they have their own OS).
  • WAFL or something like it has presumably been ported to FreeBSD.
  • When pulling up block storage LUNs from our filers, we still need to a filesystem on it. (Got that? Not everything is NFS, There still needs to be a filesystem on the server.)
  • Nothing like WAFL or a nice consistent COW filesystem was ever contributed back to FreeBSD.
  • FreeBSD instead just has UFS2, and while it has soft updates, you’ll still to need to run a fsck in single user mode for hours once you’re in the 100s of GBs, and ironically the busier the system is, the worse that is (again, yes I know there’s a background fsck, try running that on a busy mail server, it’ll crash, guaranteed).

So larger drives (or LUNs) plus a shitty, non-resilient file system meant that all of FreeBSD had to go. Despite everything great about it and a lot of time invested by the team over a decade in using it. We had to leave FreeBSD and go to wherever we could to get a good, easy to use, resilient, modern file system.

That file system happened to be ZFS and the operating system was OpenSolaris. The additions of DTrace and Zones (we used Jails before) formed our three pillars.

But stop and imagine for a minute that NetApp had been kind enough to give WAFL or a similar filesystem back to the FreeBSD project. Imagine that we had something like WAFL to put on top of our block storage LUNs that were coming up from the NetApp filers.

Got that?

We didn’t want or need WAFL on a real operating system to develop a competitive storage product, we needed WAFL-like on a real operating system in order to simply use our NetApps.

Then we wouldn’t have ever made our first step to OpenSolaris. We’d currently have a business based on FreeBSD 7 with WAFL, Jails and DTrace (hooray for that port!), and believe me leaving FreeBSD was painful in many ways, with the salve being ZFS and DTrace.

While I think there is a good degree of posturing going on betwee the two companies, and it’s fascinating to see it going on in blogs, both parties are full of it and don’t quite get it.

NetApp Dave:

You should have given WAFL or a WAFL-lite file system to FreeBSD and then all of us could happily use it on top of iSCSI or fibre-channel block storage. You would have made it the best operating system to put on top of any networked block storage, being smart guys, you could have figured out how to do it while making NetApp even more money, and without spawning a bunch of FreeBSD storage appliance clones. The fact that you didn’t do this is why Joyent uses Solaris, and you’re responsible for the new market need that’s out there for ZFS and via ZFS, Solaris itself.

Stop being so shortsighted and give a little, think FreeBSD 7.5 + WAFL.

We’re the poster child for OpenSolaris adoption, and that fact that we’re using it … you could have prevented it.

Sun Jonathan:

It’s great that ZFS was developed, magnificent that it was open-sourced, and I know that it was valid and true creation of Jeff Bonwick (who we count as a good friend). You make hardware, you created NFS, you’ve always touted the importance of the network in computing, yet I’ve had to always use NetApps to get fast and reliable NFS and iSCSI block storage (even to put my ZFS filesystems on).

Ship a decent piece of hardware + software capable of fast and reliable NFS + iSCSI. NetApp only exists because of Sun’s failures in actual product development.

“What Ifs”

If NetApp wins, there goes our beloved ZFS (yes I understand indemnity, but I care more about continued development) and NetApp, you don’t have an alternative for us. Thanks for nothin’.

If Sun wins, there’s goes our NetApps, and Sun, we can’t get an alternative from you. Thanks for nothin’.

And finally

I think we’ve been a good customer of both of you, when you fight, it hurts us more than anyone else.

Joyent DTrace Probes for Ruby in Apple’s Leopard

We are looking forward to installing Leopard on our Macintoshes here at Joyent this weekend. We talked in the recent episode of ps pipe grep about the effort at Sun to ensure Solaris is a good operating system for laptop computers. (Cricket sounds).

But seriously, now that OS X 10.5 (Leopard) is shipping (today), with DTrace probes for Ruby from Joyent, Macintosh laptops, now more than ever, are the platform for web application development. Especially Ruby/Rails applications.

Joyent is pleased to have contributed the DTrace probes for Ruby that are used in Leopard. The same probes are available on our Accelerators. Purrrrrr.

Using DTrace on MySQL

Even though there aren’t DTrace probes for MySQL released yet, we can still get useful information from MySQL. DTrace has a pid provider which allows us to get into any function the program is executing and see it’s arguments. The only drawback is you have to go digging around in the source code to find out what you want to see. But thanks to guys like Brendan Gregg, they have already done some of the digging for us.

Even if we want to go digging around ourselves, it’s really not that hard; you just have to get your feet wet. And because Accelerators have DTrace probes enabled, you can take advantage of using DTrace on MySQL. I will show some examples of this and how easy it is to hunt down your own functions.

First let’s start with functions that have already been dug up for us:

mysql_parse(thd, thd->query, length, & found_semicolon);

This is the function MySQL uses to parse a query. So all we have to do is trace this function through the pid provider and we get to see all the queries coming through. This shows arg1 as being the query, and we must copy it in to kernel land where DTrace works for it to see the string:

root@ferrari:~# dtrace -qn ‘pid$target:mysqld:*mysql_parse *:entry { printf(”%Y %sn”, walltimestamp, copyinstr(arg1)) }’ -p `pgrep -x mysqld`
2007 Sep 27 10:04:35 select * from blah
2007 Sep 27 10:04:58 select * from tablenothere

Notice that this will show all queries, even if they aren’t successful. Now that we can trace queries, this can give us good information. For example we can see what queries are executed the most:

root@ferrari:~# dtrace -qn ‘pid$target:mysqld:*mysql_parse *:entry { @queries[copyinstr(arg1)] = count() }’ -p `pgrep -x mysqld`
^C

select * from blah 5

select * from tablenothere 10

You can’t get this kind of information from MySQL unless you write some kind of script to parse through the query log. If we know that there is a query being executed 1000 more times than the others, we could always try to get this one to cache. Now lets say we want to find out how long a query took to execute. The function mysql_execute_command does the actual execution of the queries so all we do here is subtract the entry and return timestamps of this function. The script shown below uses this:

root@ferrari:~# ./exactquerytimes.d -p `pgrep -x mysqld`
Tracing… Hit Ctrl-C to end.

Query: SELECT COUNT FROM joe_visitors where upper(vs_browser) not like ‘GOOGLE‘ and upper(vs_browser) not like ‘GOOGLE BOT‘ and upper(vs_browser) not like ‘BOT‘ and upper(vs_browser) not like ‘MSN‘ and upper(vs_browser) not like ‘MSNBOT‘ and upper(,

Time: 2.32

On the MySQL side, it showed this query being executed at 2.32 seconds as well:
1 row in set (2.32 sec).

This is awesome information because as of now MySQL doesn’t allow you to see a slow query that is less than 1 second (I believe this is a fix in MySQL 5.1). So with this, we can see not just slow queries, but all queries and how long they take to execute with their times.

Now let’s try this same query but I bumped my query_cache_size up to 50M:

The first try (won’t hit the cache):

root@ferrari:~# ./exactquerytimes.d -p `pgrep -x mysqld`
Tracing… Hit Ctrl-C to end.

Query: SELECT COUNT FROM joe_visitors where upper(vs_browser) not like ‘GOOGLE‘ and upper(vs_browser) not like ‘GOOGLE BOT‘ and upper(vs_browser) not like ‘BOT‘ and upper(vs_browser) not like ‘MSN‘ and upper(vs_browser) not like ‘MSNBOT‘ and upper(,

Time: 2.28

And the second try hits the cache but doesn’t show anything through DTrace. So this means a query that is served from the cache won’t show up in mysql_parse. Right now this probably doesn’t mean much, but as you learn more about how the internals of MySQL work then troubleshooting becomes much easier down the road.

So far this has all been information that was provided. Now I will show how simple it is search through MySQL’s source and look at functions.

First we need to decide what to look for. Let’s say we want to find out every time that slow query is written to the slow query log. First we download the MySQL source code from http://www.mysql.com. Now we can search through the source code for ‘slow query’:

root@ferrari:/export/home/derek/mysql-5.0.45# ggrep -ri ‘slow query’ *

This turns up only a few source code files, one of them looking most obvious called log.cc with the expression “Write to the query log”. The MySQL code is very well commented so it makes searching really easy. Looking in this file, that comment is right above the function:

/*
Write to the slow query log.
*/
bool MYSQL_LOG::write(THD *thd,const char *query, uint query_length,
time_t query_start_arg)

It’s obvious that this function is the function that writes to the slow query log. Running this script below, looking for function LOG while a slow query is being inserted shows this function being executed with some weird characters around it:

root@ferrari:~# dtrace -F -n ‘pid$1:mysqld:*LOG *:entry {} pid$1:mysqld:*LOG *:return {}’ `pgrep -x mysqld`
dtrace: description ‘pid$1:mysqld:*LOG *:entry ‘ matched 126 probes
CPU FUNCTION
0 -> _ZN9MYSQL_LOG5writeEP3THD19enum_server_commandPKcz
0 -> _ZN9MYSQL_LOG5writeEP3THDPKcjl
0 < – _ZN9MYSQL_LOG5writeEP3THDPKcjl

The only thing bad about tracing MySQL through the pid provider is that these weird characters change between MySQL versions, so we can’t always trace for the function ‘_ZN9MYSQL_LOG5writeEP3THDPKcjl’ if we want it to work on other machines. We have to trace for *MYSQ*LOG*write* which slowquerycounts.d uses:

root@ferrari:~# ./slowquerycounts.d -p `pgrep -x mysqld`
Tracing… Hit Ctrl-C to end.

SELECT COUNT FROM joe_visitors where upper(vs_browser) not like ‘GOOGLE‘ and upper(vs_browser) not like ‘GOOGLE BOT‘ and upper(vs_browser) not like ‘BOT‘ and upper(vs_browser) not like ‘MSN‘ and upper(vs_browser) not like ‘MSNBOT‘ and upper(
Count: 3

As you can see DTrace can be very powerful even if we don’t have probes released yet, we just have to do a little extra work. Some of the information shown can be obtained from MySQL, but using DTrace still provides a benefit because we don’t have to enable anything in the MySQL configuration, possibly making us restart the server. I’m providing these scripts in the MySQLDTraceKit.tar.gz. Hopefully in the near future we will have real MySQL probes.