Recovery Monkey: Musings on backups, storage, tuning and more

Choose a Topic:

Tue
9
Mar '10

More tales from the field: Sizing best practices – does Compellent follow them?

Technorati Tags: ,,

Note: I edited this a bit to remove some confusing pieces of info.

Another one came in. I’ll keep calling the offenders out until the craziness stops. Fellow engineers - remember that, regardless of where we work, our mission should be to help the customer out first and foremost. Then make a sale, if possible/applicable. I implore you to get your priorities straight. If it looks like you’re losing the fight, figure out what your true value is. If you have no true value, you always have the option of bombing the price. But please, don’t sell someone an under-configured system…

This time, it’s Compellent not seeming to follow basic sizing rules in a specific campaign (I’m not implying this is how all Compellent deals go down). The executive summary: In a deal I’m involved in, they seem to be proposing a lot less disks than are necessary for a specific workload, just so they are perceived as being lower in price. This is their second strike as far as I’m concerned (first case I witnessed was Exchange sizing where they were proposing a single shelf for a workload that needed several times the # drives). Third strike gets you a personal visit. You will never repeat the offense after that, but it gets tiring. Education is better.

And before someone jumps on me and tells me that I don’t know how to properly size for Compellent (which I freely admit) I’ll ask you to consider the following:

There is no magic.

This is not a big NetApp FAS+PAM vs multi-engine Symmetrix V-Max discussion, where the gigantic caches will play a huge role. No – this specific case is a fight between 2 very small systems, both with very limited cache and regular ol’ 15K SAS drives. They’re not quoting SSD that could alleviate the read IOPS issue, and we’re not quoting PAM.

Ergo, this is about to get spindle-bound…

And for all the seasoned pros out there: I know you may know all this, it’s not for you, so don’t complain that it’s too basic. This post is for people new to performance sizing (and maybe some engineers :) )

Some preliminaries:

This is a Windows-only environment. So, the customer sent perfmon data for their servers over for me to analyze and recommend a box.

They’ll be running Exchange plus some databases.

From my days of doing EMC I learned some very important sizing lessons (thanks guys) that I will try to summarize here.

For instance - there is peak performance, average, and what we called “steady-state”.

In any application, there will be some very high I/O spikes from time to time. Those spikes are normal and are usually absorbed by host and array caches. This is the “peak performance”.

The trick is to figure out how long the spikes last for, and see if the caches would be able to accommodate them. If a spike is lasting for 30 min it’s not a spike any more, but rather a real workload you need to accommodate.

If the spikes are in the range of seconds, then cache is usually enough. Depends on the magnitude of the spike, the length of the spike and the size of the cache :)

Then, you have your average performance. That just takes a straight math average across all performance points - so, for instance, if you have, at night, very long periods of inactivity, they will affect the average dramatically. Short-lived spike data points won’t affect it as much since there are so few of them. So the average typically gets skewed towards the low end.

Then there’s the concept of “steady state”.

This effectively tries to get a more meaningful average of steady-state performance during normal working periods. Easy to eyeball actually if you’re looking at the IOPS graphs instead of letting excel do its averaging for you.

A picture will make things clearer:

image

In this simplified example chart, the vertical axis represents the IOPS and the horizontal is the individual samples over time. You can see there are very quiet periods, a brief spike, then sustained periods of activity. Without needing a degree in Statistics, one can see that the IOPS needed are about 500 in this chart. However, if you just take the average, that’s only 260, or about half! Obviously, not a small difference. But, again obviously, some extra care is required in order to figure out the real requirements instead of just calculating averages!

So, to summarize: it’s usually not correct to size for maximum or average since they’re both misleading (unless you’re sizing for a minimum-latency DB application – then you often size for maximums to accommodate any and all performance requirements). This is the same for every array vendor. The array and host cache accommodate some of the maximum spikes anyway, but the true average steady-state is what you’re trying to accommodate.

So, now that you know the steady-state true average the customer is seeing, the next step in estimating performance is to look at the current disk queues and service times.

I won’t go into disk queuing theory but, simply speaking, if you have a lot of outstanding I/O requests, they end up getting queued up, and the disk tries to service them ASAP but it just can’t quite catch up. You typically want to see low numbers for the queue (as in the very low single digits).

Then, there’s the response time. If the current response times are overly long (anything over 20ms for most DB/email work), then you have a problem…

What this means is that the observed steady-state workload is often constrained by the current hardware. By examining performance reports, all you are seeing is what the current system is doing.

So, the trick is to find out what performance the customer actually NEEDS, at a reasonably low ms response time with low queuing. The perfmon data is just to ensure you don’t make the performance even WORSE than they’re currently seeing! Finding out the true requirements is really the difficult part.

Finally, once you figure out the final, desired steady-state IOPS requirements, you need to translate them into your specific system, since there’s cache helping, but always some overhead to be considered. For instance, in a system that relies on RAID10/RAID5, you need to adjust for the read/write penalties of RAID. That increases the IOPS needed by nature. Again, this is the same for all array vendors – the only time there’s no I/O penalty, is if you’re doing RAID0 (= no protection).

You see, RAID5 for instance, in order to perform writes, has to do some reads as well, to calculate and write the parity. All very normal for the algorithm. Depending on the read/write mix, this extra I/O can be significant, and absolutely needs to be considered when sizing storage! RAID10 doesn’t need to read in order to write, but has to write 2 of everything, so that needs to be considered as well.

You also need to figure out read vs write percentage, I/O block size distributions, random vs sequential… not rocket science, but definitely extra work in order to do right.

The last thing that needs to be taken into account is the working set. Basically, it means this:

Imagine you have a 10TB database, but you’re really only accessing about 100GB of it repeatedly and consistently. Your working set it that 100GB, not the entire 10TB DB. Which is why the more advanced arrays have ways of prioritizing/partitioning cache allocations, since you typically don’t want a big 50TB file share with 10,000 users causing cache starvation for your 10TB DB with the 100GB working set. You need to retain as much of the cache as possible for the DB, since the 50TB file share is too large and unpredictable a working set to fit in cache.

Unless you understand the true working set, you will have no idea how much cache will be able to truly help that particular workload.

Going back to the reason I wrote this post in the first place:

In this specific, small environment, the non-RAID steady-state percentile IOPS required were close to 3,000, with a working set and I/O pattern that wouldn’t fit in the cache of the small systems. Once adjusted for RAID5, the specific I/O mix demanded 50% more IOPS from the disk. The spikes were fairly high, in excess of 10x the steady-state.

Back to basics: A 15K RPM disk can provide about 220 IOPS with reasonable (<20ms) latency, so about 14 disks are needed to accommodate the pre-RAID performance with under 20ms latency. Remember – that doesn’t include spares or RAID overheads, and will not even accommodate I/O spikes. Calculating with the RAID overhead, about 21 drives are needed, at a minimum. Add a spare or two, and you’re up to 22-23 drives, bare minimum, to satisfy steady-state performance without cache starvation in this specific workload.

And, finally, the offense in question:

Compellent said that with their combo RAID1-RAID5 they only needed a single 12-drive SAS enclosure for the entire workload. Take spares out, and, best case, you’re talking about 11 drives doing I/O. Apparently, the writes happen in RAID1, and the reads as RAID5. I’m not the expert, I’m sure someone will chime in. Maybe my math is a bit off since Compellent has the funky RAID1/RAID5 mix, but there are still I/O penalties…

Based on the above analysis, this somehow doesn’t compute with 11 drives, half what my calculations indicate… so, my final question is:

How do Compellent engineers size for performance?

D

Wed
3
Mar '10

EMC’s incredible marketing and the FAST fairy tale (and a bit on how to reduce tiers)

I’m in MN prepping to teach a course (my signature anti-FUD extravaganza), and thought I’d get a few things off my chest that I’ve been meaning to write about for a while. Some Stravinsky to provide the vibes and I’m good to go. It’s getting really late BTW and I’m sure this will progressively get less coherent as time goes by, but I like to write my posts in one shot…

I never cease to be amazed by what’s possible with the power of great marketing/propaganda. And EMC is a company that has some of the best marketing anywhere. Other companies should take note!

Think about it: Especially on the CX, they took an auto-tiering implementation as baked as wheat that hasn’t been planted yet, and managed to create so much noise and excitement around it that many people think EMC actually invented the concept and, heavens, some even believe that the existing implementation is actually decent. Worse still, some have actually purchased it. Kudos to EMC. With the exception of some of Microsoft’s work, nobody reputable has the stones any more to release, amidst such fanfare, a product this unpolished. Talk about selling futures…

Perception is reality.

I’m an engineer by training and by trade first and foremost, and, regardless of bias, I consider the existing FAST implementation an affront. Allow me to explain, gentle reader…

The tiering concept

Some background info is in order. Most arrays of any decent size and complexity sold nowadays are configured with different kinds of disk, purely out of cost considerations. For instance, there may be 30 really fast drives where a bunch of important low-latency DBs live, another 100 pretty fast drives where most VMs and Exchange live, then 200 SATA drives for bulk storage and backups.

Don’t kid yourself: If the customer buying the aforementioned array had enough dough, they’d be getting the wunderbox with all super-fast drives inside – all the exact same kind of drives. That’s just simpler to deal with from a management standpoint and obviously the performance is stellar. Remember this point since we’ll get back to it…

Of course, not everyone is made of money, so arrays that look like the 3-tier example above are extremely common. Just enough drives of each type are purchased in order to achieve the end result.

What typically ends up happening is that, over time, some pieces of data end up in the wrong tier, for one reason or another. Maybe a DB that was super-important once now only needs to be accessed once a year; or a DB that was on SATA now has become the most frequently-accessed piece of data in the array. Or, perhaps, the importance of a DB flip-flops during a month, so it only needs to be fast maybe for month-end-processing. So now, you need to move stuff around so that what needs to be fast is shifted to the fast drives.

Pressure points and the need for passing the hot potato

But wait, there’s more…

The entire performance problem is created in the first place due to most array architectures being older than mud. In legacy array architectures, LUNs are carved out of RAID groups, typically made of relatively few disks. So, in an EMC Clariion, it’s best practices to have a 5-disk RAID5 group. You then ideally split up that group into no more than 2 LUNs and assign one to each controller.

With disks getting bigger and bigger, creating 1-2 LUNs can become exceedingly difficult – a 5-disk R5 group made with 450GB drives in a Clariion offers a bit over 1.5TB of space, which is too much for many application needs – maybe you just need 50GB here, another 300GB there… in the end, you may have 10 LUNs in that RAID group that’s supposed to have no more than 2. The new 600GB FC drives make this even worse.

So, in summary, what ends up happening is that you split up that RAID group into too many LUNs in order to avoid waste. And that’s where your array develops a serious pressure problem.

You see, now you may have 10 different servers hitting the exact same RAID group, creating undue pressure on the 5 poor disks struggling to cope with the crazy load. I/O service times get too high, queue lengths get crazy, users get cranky.

Again – this whole problem exists exactly because legacy array architectures don’t automatically balance I/O among all drives.

But for those afflicted Paleolithic systems, wouldn’t it be nice if we could move some of those hot LUNs, non-disruptively, to other RAID groups that don’t suffer from high pressure?

That’s what EMC’s FAST for the Symmetrix and CX does. It attempts to move entire LUNs to faster tiers like SSD. Which, BTW, is something you can do manually, but FAST attempts to automate the task (kinda, depends, etc).

The current FAST pitfalls

Let’s examine first how FAST (Fully Automated Storage Tiering) is implemented. Since it’s really 3 utterly different solutions, depending on whether you have Symm, CX or NS:

On the Symmetrix it’s always been there in the form of Symmetrix Optimizer, which may not have been aware of tiers but it definitely knew about migrating to less busy disks. Now you can teach it about tiers, too. But it’s not, in my mind, a new product, even if EMC would like you to believe it is. It looks to me too much like Optimizer + some new heuristics. But the Gods of Marketing managed to create unbelievable commotion about something that was an old feature. What amazes me is that nobody seems to have made the connection – maybe I’m really missing something. I’m sure someone from EMC will correct me if I’m wrong. In my experience, Optimizer, when purchased, often did more harm than good, was difficult to manage and, ultimately, was left inactive in many shops – with the beancounters lamenting the spending of precious funds on something that never quite worked that well. Oh, and it seems the current version doesn’t support thin LUNs. But of the FAST implementations on EMC gear it is the more complete version, exactly because Optimizer has been there for a long time…

On the far more popular CX platform, what happens is like a tribute to kludges everywhere. Consider this:

  1. Movement is one-way only (FC to SATA, or FC to SSD). More of a one-shot tool than continuous optimization!
  2. You need a separate PC that will crunch Navisphere Analyzer performance logs, this takes a while
  3. The PC will then provide a list of recommendations
  4. Depending on which LUNs you approve it will invoke a NaviCLI command to move the specified LUNs in the box
  5. Doesn’t support thin provisioning
  6. Not sure if it supports MetaLUNs
  7. It is NOT automatic since you have to approve the move! Ergo, it should not be sold under the name “FAST” since the “A” stands for “Automated”, aren’t there laws for false advertising?

On the Celerra NS platform (EMC’s NAS), one needs to purchase the Rainfinity FMA boxes, which then can move files between tiers of disk based on frequency of access. One is then limited by the scalability of the FMA – how many files can it track? How dynamically can it react to changing workloads? What if the FMA breaks? Why do I need yet more boxes to do this?

Ah, but it gets better with FASTv2! Or does it?

EMC has been upfront that FAST will become way cooler with v2. It better be, since as you can see it’s no great shakes at the moment. From what the various EMC bloggers have been posting, it seems FASTv2 will use the thin provisioning subsystem to go to a sub-LUN level of granularity.

The granularity will obviously depend on how many disks you have in the virtual provisioning pool, since a LUN (just like with MetaLUNs) will be split up so that it occupies all the disks in the pool. The bigger the pool, the better. This should provide better performance (it does with other vendors) yet EMC in their docs state the current version of virtual provisioning (at least on the CX) has higher overhead when compared to their traditional LUNs and will provide less performance. I guess that’s a subject for another day, and maybe they’ll finally revamp the architecture to fix it. Back to FASTv2:

The “busyness” of each LUN segment will be analyzed, and that segment will then move, if applicable, to another tier. Of course, how efficient that will end up being will depend on how you do I/O to the LUN in the first place! If the LUN I/O is fairly spatially uniform, then the whole thing will have to move just like FASTv1. But I guess with v2 there’s at least the potential of sub-LUN migration, for cases where a clearly delineated part of the LUN is really “hot” or “cold”. Obviously, since the chunk size will still be significantly large, expect a bunch of non-applicable data to move with the stuff that should be moved.

The real problem

First, to give credit where it’s due: Compellent already has had sub-LUN moves for a long, long time. Give those guys props. They actually deserve it.

However – both the Compellent approach as well as FASTv2 and, even worse, v1, suffer from this fundamental issue:

Lack of real-time acceleration.

Think about it – performance has to be analyzed periodically, heuristics followed, then LUNs or pieces of LUNs have to be moved around. This is not something that can respond instantly to performance demands.

Consider this scenario:

You have a payroll DB that, during most of the month, does absolutely nothing. A fully automated tiering system will say “hey, nobody has touched this LUN in weeks, I better move it to SATA!”

Then crunch time comes, and the DB is on the SATA drives. Oopsie.

People complain, and the storage admin is forced to manually migrate it back to SSD.

Kinda defeats the whole purpose… unless I’m missing something the size of Titanic.

So, you may have to write all kinds of exception rules (provided the system lets you). Some rules for most DBs, Exchange, a few apps here and there…

Soon, you’re actually in a worse state than where you begun: You have the added complexity and cost of FAST, plus you have to worry about creating exception rules.

Now here’s a novel idea…

What if you actually put your data in the right tier to begin with and what if, even if you didn’t, it didn’t matter too much?

For instance – normal fileshares, deep archives, large media files, backups to disk – most people would agree that those workloads should probably forever be on SATA if you’re trying to save some money. With 2TB drives, the SATA tier has become super-dense, which can be very useful for quite a few use cases.

DBs, VM OS files – should usually be on faster disk. But no need to go nuts with several tiers of fast disk, a single fast tier should be sufficient!

LUNs and other array objects should try to automatically span as many drives as possible by default without you having to tell the array to do that… that way you avoid the hot spots in the first place by design, thereby reducing or even removing the need for migrations (I can still see some very limited cases where migration would be useful).

And finally, large, intelligent cache (as in really large) to help with real-time workload demands, dynamically and as-needed, by caching tiny 4K chunks and not wasting space on gigantic pieces… with the ability to prioritize the caching if needed. Not to mention being deduplication-aware.

Wouldn’t that be a bit simpler to manage, more nimble and more useful in real-world scenarios? The cache will help out even the slower drives for both file and OLTP-type workloads.

Maybe life doesn’t need to be complicated after all.

It’s almost 0300 so I’d better go to bed…

D

 

 

 

 

 

Mon
22
Feb '10

Protecting your existing legacy storage investment with virtualization – do’s and don’ts

It’s an undeniable fact that many customers, while they would love to use the highly advanced features of modern disk arrays, have already made a big investment in legacy storage. Sure, it doesn’t have all the great features, but it’s already there, frequently there’s a lot of it, and the maintenance isn’t expiring for another year or two so it’s not economically feasible to get rid of it.

Another issue most enterprises face is data migration – whether that’s to move from old to new on the same vendor, or from vendor to vendor. No matter how you cut it, you’ll have to do it someday.

A third issue is performance on the existing gear – maybe you have a ton of legacy storage but it’s just not performing the way you’d expect.

The final issue is managing disparate arrays. Nobody really wants to do that.

There are storage virtualization products that, conceptually, try to solve some of those issues in a similar way to how VMware, Hyper-V and Xen address similar issues with servers.

The idea is that you virtualize your existing storage behind gear that will give it some extra capabilities, centralized management and thereby extend its service life and maybe even eke out some more performance out of it. Your existing hosts will typically address the storage via the virtualizing device, so obviously some assembly is required (rezoning etc).

The devices I’m aware of fall into 3 basic categories:

  1. Devices that encapsulate existing LUNs and don’t need other equipment or much reconfiguration besides dropping them in, zoning and presenting the LUNs to the hosts through them. Examples are: FalconStor NSS, IBM SVC, HDS USP-V, HP SVSP.
  2. Devices that don’t need other equipment, offer some compelling extra features but cannot encapsulate LUNs and therefore need an initial migration besides the zoning. Example: NetApp V-Series.
  3. Devices that need extensive fabric upgrades besides reconfiguration. Example: EMC Invista (I’m not sure if it needs LUN migrations, I don’t think so but I’m sure someone from EMC will chime in).

There are other differences in the devices listed above, so I created a table and highlighted the areas where there’s either the odd man out or there’s some feature not available with the others. I’m aware that the table is nowhere near complete, but as it is I doubt it will fit onto a web page nicely. If there are inaccuracies, let me know and I’ll fix it. I admit I know little about HP’s SVSP. (re-posted with some SVC edits).

 

Thin Provisioning

Thin Clones

Snapshots

Also an Array

In-Band

Deduplication

Replication

Needs Migration

NAS

Needs fabric Upgrade

FCoE

Perf Acceleration

Can do live FC migrations

Needs some space on array

EMC

N

N

N

N

N

N

N (needs RecoverPoint)

? (prob N)

N

Y

N

N

Y

N

HP

Y (? perf impact)

?

Y (? perf impact)

N

split-path

N

Y

?

N

N

N

N

Y

N

FalconStor

Y (? perf impact)

Y (perf impact)

Y (perf impact)

N

Y

N

Y

Y

N

N

N

Y (SSD cache)

Y

N

HDS

Y (perf impact)

?

Y (perf impact)

Y

Y

N

Y

N

N

N

N

Y (huge cache, RAM)

Y

N

IBM

Y (no perf impact)

Y (perf impact)

Y (perf impact)

Y (limited 4x SSD per node)

Y

N

Y

N

N

N

N

Y (192GB large cache with 8 nodes)

Y

N

NetApp

Y (no perf impact)

Y (no perf impact)

Y (no perf impact)

Y

Y

Y

Y

Y

Y

N

Y (also 10GbE)

Y (gigantic cache, multi-TB)

N (iSCSI, NFS, CIFS only at present)

Y

 

The design decisions are interesting.

Of the above, IBM and FalconStor take the “pure appliance” approach, using Linux servers with custom code – that’s what those boxes were designed to do from the get-go. The idea is that you either have a bunch of old arrays or you buy a bunch of new, cheap and not very capable arrays, then front them with SVC or NSS, thereby making them decent.

Since IBM and FalconStor were always designed to perform this function, they are also, in my opinion, the best-suited for tasks like migrations. Indeed, I believe one can do a “hit and run” with said boxes, i.e. do the migration then remove the boxes from the fabric, making them popular with certain PS organizations.

On the other hand, HDS and NetApp instead offer the virtualization functionality as an additional feature to their arrays – as in, “you’ll probably buy our disk but we can enhance your legacy box, too”.

EMC took a completely different approach and uses out-of-band control servers and intelligent fabric switches to perform the virtualization trickery.

It’s important to note that NetApp lacks the live migration feature, but instead offers deduplication, application-aware snaps, great replication and NAS, and is arguably the most feature-rich platform (I’m trying to not be biased as I’m writing this). The biggest caveat (a deal-breaker for some) is that it can’t encapsulate your existing LUNs – instead, you need to chop up your RAID groups into LUNs, then present them to the NetApp system, which will then need to reformat said LUNs. This process also takes away some space for extra checksum calculations and other overheads. Arguably, you can make this up (and then some) in the end after using the features on tap (sorry). But you still need to figure time to migrate your stuff over gradually.

I believe EMC offers the least features and the most complex implementation – you can do stuff like mirror your LUNs from box to box and do migrations, but your arrays don’t really gain any new features. I have yet to meet a customer that owns this solution. I know there are a few big ones that went that way; it’s just not very common.

Of the devices mentioned above, the SVC is probably the most commonly used, then the USP-V (IBM and HDS always argue on that point since the capability to virtualize comes with HDS boxes whereas virtualization is the only thing the SVC does), then come FalconStor and NetApp, then HP with the relative newcomer SVSP, and last EMC (Invista hasn’t been a particularly successful product for EMC).

Storage Virtualization do’s and don’ts

I’d say that you should only really consider buying a virtualization product if you have well over 10TB of older gear (I’d say over 50TB IMHO) that is not TOO old (i.e. not older than 3-4 years). Quite frequently, if your gear is really old, refreshing it with new just ends up being cheaper. Of course, there’s always eBay.

I’d also recommend not buying new low-end arrays and using virtualization to make them “better”. You are introducing more complexity into the environment, and it won’t necessarily be cheaper, either (something like the SVC has licenses that cost by the TB). Just buy a decent modern array that has all the features you need and be done with it.

Furthermore – don’t get into virtualization just to migrate from your older to your newer arrays. There are other ways.

You should use common sense (imagine that). As you’re not supposed to mix drive types within RAID groups even if you can, you typically don’t want to have an application straddling 5 different arrays, all vastly different in capability, just because you can.

It’s tempting to say “I’ll create a LUN that’s striped among every single disk on 5 different arrays”. Not to say that this should never be done (I’ve RAID-0′d across Symmetrix to get enough performance, long story), but only do it if you know what you’re doing and the exact layout that you’ll end up with. Nothing spells misery like RAID0 across many LUNs in an existing RAID group… :)

Finally – figure out what features are the most important to you. If you want dedupe, NAS and tight app integration, NetApp is the ticket. If you prefer ease of migration, you may want to look at the other solutions.

The guarantees

In order to entice customers to try their stuff, HDS and NetApp have some space savings guarantees in place regarding virtualization. HDS has a flat 50% guarantee (predicated upon converting from RAID1 to RAID5 + thin provisioning) or 20% guarantee (just thin provisioning).

NetApp has the ZIP program. It’s a bit different – there’s no hard number in the savings. Rather, the customer’s data is analyzed and the customer presented with the savings % NetApp guarantees to achieve in their case. If the customer agrees and NetApp achieves the guaranteed savings, then the gear gets purchased. If the savings are not reached, then the customer gets to keep the gear free of charge (that’s right).

Such guarantee programs have been much ridiculed by the vendors that don’t offer them, but I think they show the respective companies believe in their products enough to wrap some kind of guarantee around them.

In conclusion…

Properly deployed, storage virtualization can be effective in increasing the efficiencies of legacy storage footprints lacking in functionality. Just be careful and examine your motives for virtualization before making the move. Sometimes it’s a decidedly false economy.

D

Thu
18
Feb '10

So, are there any independent bloggers? Really?

There was some weird backlash against my site and my person recently – see here and here and in the comments here. Chuck Hollis got all uppity about whether I work at NetApp (with, for) or not.

I find it interesting that this only came up when I wrote something pro-NetApp. Wasn’t even anti-EMC.

It never came up when I was extolling the virtues of RecoverPoint (which I still think is awesome). I didn’t see anyone from NetApp or any EMC competitor start questioning where I worked, where the full disclosure was etc etc. Maybe they all just assumed I worked for EMC. Well – not directly, I was selling a ton of EMC gear, which was in turn paying my mortgage, which is as good as. But, ultimately, I just like the product since, properly deployed, it can solve some real problems.

So why is NetApp the company everyone loves to hate? Is it fear? Disrespect? Lack of understanding? All the above? But, I digress. NetApp customers love the product, and the company’s recent earnings announcement, as well as the fact we sold 1 Exabyte of enterprise storage last year, tells the real story. The People want their highly-functional, space-efficient, simple-to-use, application-aware storage, not 50 different products that are loosely integrated. Volkslagerung! Is that right, German-speaking readers?(edit: Volksdatenspeicher seems better as “storage for the people”).

So, I clarified things in the About page (upper left), I thought it was already clear but apparently not. Chuck is still not satisfied, so I think I’ll have to figure out a way to show some fancy animation of me in some NetApp uniform, hugging Hitz, Lau, Georgens and Mendoza and receiving my MVP award. Plus another animation showing the super-secret initiation ceremony and the extensive branding on my left buttock. Right.What was most interesting in this ad hominem attack was that the important discussion topics were largely ignored, a very efficient tactic to lure the unsuspecting reader’s mind away from the real issues.Which brings us to the subject of this post.

There seems to be this cute, romantic notion that there is such a thing as a truly independent blogger, and if I’m not independent, then what I say is tainted.

Well – let me break it to you and disabuse you of this notion: There ain’t no such thing as an independent blogger.

We are all biased, one way or another, about everything. Our past experiences shape our biases and the automatic stories our brains will create to explain any information we are presented with.

It doesn’t matter whether we work for a storage vendor or are customers – indeed, customers are typically among the most biased IT folks around! (storage vendor employees are usually crusty, jaded, cynical, have been around the block and typically have the dirt on multiple technologies).

I’ve been in customer meetings where I was told the customer doesn’t ever want to talk to EMC again because they treated him badly 10 years ago, or that he doesn’t want to talk to NetApp because he read in Barry’s blog that it only has 30% usable space, another that has FC queuing issues with HDS gear and wants to get rid of it at all costs, yet another that has had some controller panics with IBM gear and wants to get off of that and never touch IBM ever again, the list goes on. Those guys become zealots.

Then you have the other customer type, the one that receives Rolexes and other cool gifts in order to say whatever he’s told to say. Some actually will demand it (I’ve been in one of those meetings, too – “if you give me your watch we may have a deal”. I chose to assume he was kidding, lest I completely lose my faith in mankind).

You then have your “analyst” type that’s an independent industry “expert” – most of those guys haven’t touched the products they’re writing about, ever, and are just rehashing whatever they read in other publications or are told by their vendor drinking buddy. Yet they’re among the most trusted and read. They, too have their personal favorite horses they’re backing…

Finally you have your VAR bloggers. People – those guys make money selling the stuff. Yes, they know the tech, but don’t exactly expect an impartial discussion… plus, they get all kinds of incentives from vendors.

So, who do you trust, when you can’t even trust yourself? Since, by definition, you are also biased, gentle reader…

I wish I could tell you. Ultimately, everyone has an agenda, whether conscious or subconscious. You just need to become shrewd enough to see through the agenda.

Maybe a good starting point is a truly intelligent, fact-based discussion bereft of ad hominem attacks?

D

Wed
10
Feb '10

More FUD busting: Deduplication – is variable-block better than fixed-block, and should you care?

Before all the variable-block aficionados go up in arms, I freely admit variable-block deduplication may overall squeeze more dedupe out of your data.

I won’t go into a laborious explanation of variable vs fixed, but, in a nutshell, fixed-block deduplication means that data is split into equal chunks, each chunk given a signature, compared to a DB and the common chunks are not stored.

Variable-block basically means the chunk size is variable, with more intelligent algorithms also having a sliding window, so that even if the content in a file is shifted, the commonality will still be discovered.

With that out of the way, let’s get to the FUD part of the post.

I recently had a TLA vendor tell my customer: “NetApp deduplication is fixed-block vs our variable-block, therefore far less efficient, therefore you must be sick in the head to consider buying that stuff for primary storage!”

This is a very good example of FUD that is based on accurate facts which, in addition, focuses the customer’s mind on the tech nitty-gritty and away from the big picture (that being “primary storage” in this case).

Using the argument for a pure backup solution is actually valid. But what if the customer is not just shopping for a backup solution? Or, what if, for the same money, they could have it all?

My question is: Why do we use deduplication?

At the most basic level, deduplication will reduce the amount of data stored on a medium, enabling you to buy less of said medium yet still store quite a bit of data.

So, backups were the most obvious place to deploy deduplication. Backup-to-Disk is all the rage, what if you can store more backups on target disk with less gear? That’s pretty compelling. In that space you have of course Data Domain and the Quantum DXi as the two of the more usual backup target suspects.

Another reason to deduplicate is to not only achieve more storage efficiency but also improve backup times by not even transferring over the network data that’s already been transferred. In that space there’s Avamar, PureDisk, Asigra, Evault and others.

NetApp simply came up with a few more reasons to deduplicate, not mutually exclusive with the other 2 use cases above:

  1. What if you could deduplicate your primary storage – typically the most expensive part of any storage investment – and as a result buy less?
  2. What if deduplication could actually dramatically improve your performance in some cases, while not hindering it in most cases? (the cache is deduplicated as well, more info later).
  3. What if deduplication was not limited to infrequently-accessed data but, instead, could be used for high-performance access?

For the uninitiated, NetApp is the only vendor, to date, that can offer block-level deduplication for all primary storage protocols for production data - block and file, FC, iSCSI, CIFS, NFS.

Which is a pretty big deal, as is anything useful AND exclusive.

What the FUD carefully fails to mention is that:

  1. Deduplication is free to all NetApp customers (whoever didn’t have it before can get it via a firmware upgrade for free)
  2. NetApp customers that use this free technology see primary storage savings that I’ve seen range anywhere from 10% to 95%, despite all the limitations the FUD-slingers keep mentioning
  3. It works amazingly well with virtualization and actually greatly speeds things up especially for VDI
  4. Things that would defeat NetApp dedupe will also defeat the other vendors’ dedupe (movies, compressed images, large DBs with a lot of block shuffling). There is no magic.

So, if a customer is considering a new primary storage system, like it or not, NetApp is the only game in town with deduplication across all storage protocols.

Which brings us back to whether fixed-block is less efficient than variable-block:

WHO CARES? If, even with whatever limitations it may have, NetApp dedupe can reduce your primary storage footprint by any decent percentage, you’re already ahead! Heck, even 20% savings can mean a lot of money in a large primary storage system!

Not bad for a technology given away with every NetApp system

D

Mon
8
Feb '10

NetApp disk rebuild impact on performance (or lack thereof)

Due to the craziness in the previous blog, I decided to post an actual graph showing a NetApp system I/O latency while under load and a disk rebuild. It was a bakeoff vs another large storage vendor (which NetApp won).

The test was done at a large media company with over 70,000 Exchange seats. It was with no more than 84 drives, so we’re not talking about some gigantic lab queen system (I love Farley’s term). The box was set up per best practices, with aggregate size being 28 disks in this case.

(Edited at the request of EMC’s CTO to include the performance tidbit): Over 4K IOPS were hitting each aggregate (much more than the customer needed) and the system had quite a lot of steam left in it.

There were several Exchange clusters hitting the box in parallel.

All of the testing for both vendors was conducted by Microsoft personnel for the customer.  The volume names have been removed from the graph to protect the identity of the customer:

clip_image001

Under a 53:47 read/write ratio 8K-size IOPS, a single disk was pulled.  Pretty realistic failure scenario, a disk breaks while the system is under production-level load. Plenty of writes, too, almost 50%.

Ok.  The fuzzy line around 6ms is the read latency.  At point 1 a disk was pulled and at point 2 the rebuild completed.  Read latency increased to 8ms during the rebuild, but dropped back down to 5 after the rebuild completed.  The line at less than 1 ms response time straight across the bottom is the write latency. Yes it’s that good.

So - there was a tiny bit of performance degradation for the reads but I wouldn’t say that it “killed” performance as a competitor alleged.

The rebuild time is a tad faster than 30 hours as well (look at the graph :) ) but then again the box used faster, 15K drives (and smaller, 300GB vs 500GB), so before anyone complains, it’s not apples-to-apples compared to the Demartek report.

I just wanted to illustrate a real example from a real test at a real customer using a real application, and show the real effects of drive failures in a properly-implemented RAID-DP system.

The FUD-busting will continue, stay tuned…

D

Tue
2
Feb '10

Vendor FUD-slinging – at what point should legal action be taken? And who do you believe as a customer?

I’m all for a good fight, but in the storage industry it seems that all too many creative liberties are taken when competing.

Let’s assume, for a moment, that we’re talking about the car industry instead. I like cars, and I love car analogies. So we’ll use that, and it illustrates the absurdity really well.

The competitors in this example will be BMW and Mercedes. Nobody would argue that they are two of the most prominent names in luxury cars today.

BMW has the high-performance M-series. Let’s take as an example the M6 – a 500HP performance coupe. Looks nice on paper, right?

Let’s say that Mercedes has this hypothetical new marketing campaign to discredit BMW, with the following claims (I need to, again, clarify that this campaign is entirely fictitious, and used only to illustrate my point, lest I get attacked by their lawyers):

  1. Claim the M6 doesn’t really have 500HP, but more like 200HP.
  2. Claim the M6 only does 0-60 in under 5 seconds with only 5% of the gas tank filled, a 50lb driver, downhill, with a tail wind and help from nitrous.
  3. Claim that if you fill the gas tank past 50%, performance will drop so the M6 does 0-60 in more like 30 seconds. Downhill.
  4. Claim that it breaks like clockwork past 5K miles.
  5. Claim that they have one, they tested it, and performs as they say.
  6. Claim that, since they are Mercedes, the top name in the automotive industry, you should trust them implicitly.

Imagine Mercedes, at all levels, going to market with this kind of information – official company announcements, messages from the CEO, company blogs, engineers, sales reps, dealer reps and mechanics…

Now, imagine BMW’s reaction.

How quickly do you think they’d start suing Mercedes?

How quickly would they have 10 independent authorities testing 10 different M6 cars, full of gas, in uphill courses, with overweight drivers, just to illustrate how absurd Mercedes’ claims are?

How quickly would Mercedes issue a retraction?

And, to the petrolheads among us – wouldn’t such a stunt look like Mercedes is really, really afraid of the M6? And don’t we all know better?

More to the point – do you ever see Mercedes pulling such a stunt?

Ah, but you can get away with stuff like that in the storage industry!

Unfortunately, the storage industry is rife with vendors claiming all kinds of stuff about each other. Some of it is or was true, much of it is blown all out of proportion, and some is blatant fabrication.

For instance, XIV breaking if you pull 2 disks out – as I state in a previous post, it’s possible if the right 2 drives fail within a few minutes of each other. I think it’s unacceptable, even though it’s highly unlikely to happen in real life. But I’ve seen sales campaigns against the XIV use this as the mantra, to the point that the fallacy is finally stated: “ANY 2 drive failure will bring down the system”.

Obviously this is not true and IBM can demonstrate how untrue that is. Still, it may slow down the IBM campaign.

Other fallacies are far more complicated to prove wrong, unfortunately.

An example: Pillar Data has an asinine yet highly detailed report by Demartek showing NetApp and EMC arrays having significantly lower rebuild speeds than Pillar (as if that’s the most important piece of data management, but anyway – rebuild speed hasn’t helped Pillar sales much, even if it’s true).

To anyone that knows how to configure NetApp and EMC, they’d see that the Pillar box was correctly configured, whereas the others intentionally made to look 4x worse (in the case of NetApp, they literally went against not just best practices but blatantly against system defaults in order to make it slower). However, some CIOs might read this and give credence to it, since they don’t know the details and don’t read past the first graph.

For EMC and NetApp to dispute this, they have to go to the trouble of configuring, properly, a similar system, and running similar tests, then writing a detailed and coherent response. It’s like wounding the enemy soldier instead of killing them, their squadmates have to help them out, wasting manpower. I get it – it’s effective in war. But is it legal in the business world?

Last but not least: EMC and HP, at the very least, have anti-NetApp reports, blogs, PPTs etc. that literally look just like the absurd Mercedes/BMW example above, sometimes worse. Some of it was true a long time ago (the famous FUD “2x + snap delta” space requirement for LUNs is really “1x + snap delta” and has been for years), some of it is pure fabrication (”it slows down to 5% of its original speed if you fill it up!”). See here for a good explanation.

Of course, again that’s like wounding the enemy soldiers: NetApp engineers have to go and defend their honor, show all kinds of reports, customer examples, etc etc. Even so, at some point many CIOs will just say “I trust EMC/HP, I’ve been buying their stuff forever, I’ll just keep buying it, it works”. The FUD is enough to make many people that were just about to consider something else, go running back to mama HP.

Should NetApp sue? I’ve seen some of the FUD examples and literally they are not just a bit wrong but magnificently, spectacularly, outrageously wrong. Is that slander? Tortuous interference? Simply a mistake? I’m sure some lawyer, somewhere, knows the answer. Maybe that lawyer needs to talk to some engineers and marketing people.

Let’s flip the tables:

If NetApp went ahead and simply claimed an EMC CX4-960 can only really hold 450TB, what would EMC do?

I can only imagine the insanity that would ensue.

I’ll finish with something simple from the customer standpoint:

NetApp sold 1 Exabyte of enterprise storage last year, if it was as bad as the other (obviously worried) vendors are saying, does that mean all those customers buying it by the truckload and getting all those efficiencies and performance are stupid and wasted their money?

D

Thu
14
Jan '10

Pillar claiming their RAID5 is more reliable than RAID6? Wizardry or fiction?

Competing against Pillar at an account. One of the things they said: That their RAID5 is superior in reliability to RAID6. I wanted to put this on the public domain and, if true, invite Pillar engineers to comment here and explain how it works for all to see. If untrue, again I invite the Pillar engineers to comment and explain why it’s untrue.

The way I see it: very simply, RAID5 is N+1 protection, RAID6 is N+2. Mathematically, RAID5 is about 4,000 times more likely to lose data than a RAID6 group with the same number of data disks. Even RAID10 is about 160 times more likely to lose data than RAID6.

The only downside to RAID6 is performance – if you want the protection of RAID6 but with extremely high performance then look at NetApp, the RAID-DP NetApp employs by default has in many cases better performance than RAID10 even. Oracle has several PB of DB’s running on NetApp RAID-DP. Can’t be all that bad.

See here for some info…

D

Sat
9
Jan '10

What if you could dramatically improve your application testing times? What would happen to your productivity and to the company’s bottom line?

So, let’s say the DBA (or insert some other discipline) wants to do some testing for a new product (known to happen occasionally) – and the way he would really like to test is to create 20 test cases, which requires 20 copies of the main database. He would then automate the test and therefore get results very quickly.

He approaches the storage admin with the problem, only to be told this isn’t possible since there isn’t enough space on the array. The DBA goes back to his cube frustrated, and figures out some ghetto way of creating at least 1 copy of the database, which creates the following problems:

  1. He has to figure out a way to do it (takes time)
  2. He can only test 1 case at a time (time)
  3. He cannot easily compare what-if scenarios between test cases (lack of flexibility)
  4. His ghetto way of doing it may involve single 1TB disks in a workstation (lack of reliability, time)

Ultimately, the testing takes longer, is error-prone, and the DBA’s productivity level goes way down.

What if the storage admin could, instead, tell the DBA that he can even take hundreds of copies of the DB, there’s no issue doing that?

What would happen to the DBA’s productivity?    

What new ideas would he be able to come up with?

How would that affect the quality of the product?

How would that affect the company’s bottom line? Being able to go to market with improved quality and quicker than the competition?

You see, intelligent storage – intelligently deployed – can solve many more problems than just “give me some space” or “give me more performance”.

There aren’t many technologies out there that can comfortably do this, which is probably why most storage people aren’t aware of this. But an array that can create space- and performance-efficient application-consistent DB clones is the ticket. Being able to create full copies and/or virtual space-efficient copies that end up being unusably slow doesn’t count… :)

The only vendor I know of that can pull this off (properly) is NetApp with their FlexClone technology. One can even use it to deploy thousands of identical VMs… there are some use cases for that, too :)

Activision (the company that makes the famous Guitar Hero game) is a good example of using this technology to rapidly accelerate development – and ended up making the Christmas deadline, which resulted in several more millions in sales. See here.

Oracle is another small company that uses this technology pervasively.

If anyone else knows of more vendors that can do this (properly) please chime in.

D

'

Should techies or business owners decide on technology (or both)?

It’s no secret that, in most companies, the technology folks are primarily the ones deciding on which new technologies to adopt – after all, they are the ones that understand the technology, right? Business owners explain the business problem to the technologists, and the techies take it from there – and ultimately present 2-3 different solutions that will work and the business picks the cheapest.

This could be great – if it weren’t for the fact that, like everyone, techies have their own agenda, which ends up tainting the decision process. Consider some of the following:

  • Comfort level with existing vendor (if it ain’t broke why fix it? This assumes all of the vendor’s products work equally well)
  • Job security (”why learn something new? Maybe they’ll hire someone that already knows this!”)
  • Delusions of grandeur (”I have the power!”)
  • Fear (”it sounds amazing, but what if the stuff doesn’t work?)
  • Disbelief (”my current gear can’t do this, there’s no way this new stuff is that good!”)
  • Laziness (”you mean I have to test this new stuff? It cuts into my online gaming time!”)
  • Envy (”my buddy at this other company has this stuff, I must have something cooler/bigger!”)
  • Lack of time (”I really don’t have the time to test this new stuff!”)
  • Vendor kickbacks (we all know it happens in one form or another, and to the perennially under-paid techies, an expensive gift may be something they will never otherwise be able to afford, so it gains huge importance in their eyes)
  • The inability to grasp the real business drivers
  • The inability to think strategically
  • Being wowed by “cool” features that are of dubious business importance (see other post here)
  • Conversely, not understanding features that could be of immense business importance, that could save the company millions and increase productivity tenfold.

Of course, someone like a CIO or CTO normally acts as the bridge that spans the techie and business worlds, but of course that doesn’t always work (see here).

The only way around the issue is to create a new decision process for the company, one that involves all the interested parties from all departments. As complex as it may sound, this does work, and most of the time new ideas/issues get unearthed (”what do you mean my database is not backed up now?” or “what do you mean it would take 2 weeks to recover my lab environment?”)

Try it, you may be surprised at what happens!

D

Fri
4
Dec '09

Is EMC under-sizing RecoverPoint and Avamar deals to win business?

It’s been a while since I wrote anything – unlike some, I actually have a day job! Well, at least that’s my excuse.

My admiration for RecoverPoint is well known (see older post, which is referenced internally within EMC as a great pro-RecoverPoint article). It really is a good product and, next to VMware, my favorite EMC acquisition.

So it incenses me when I see a good product being misconfigured, and reminds me of Hanlon’s Razor: “Never attribute to malice that which can be adequately explained by stupidity“. You see, I’d rather chalk this up as sales not knowing what they’re doing rather than assume that EMC knows full well the ramifications of their decision and goes ahead and does the dirty deed anyway.

However, I’ve seen multiple cases recently where RecoverPoint and/or Avamar were most decidedly incorrectly sized to support the customer’s workload. The customer likes the price and goes for the solution, only to be in for a nasty surprise later on. Not to worry, everything can be fixed with some more boxes, licenses and hard disks! After all, it’s tough and expensive to rip the stuff out!

To start with RecoverPoint: it can be a wonderful DR tool but, like any tool, needs to be used correctly in order to be most effective. For instance, there are several aspects when designing a RecoverPoint solution:

  • One needs to take into account the sustained throughput each device can handle (minuscule when compared to the total bandwidth of a CX4 or V-Max), and add extra devices in order to comfortably sustain the throughput the customer needs – even if that means you go beyond the 2-device-per-site RecoverPoint SE maximum and into the realm of “full” RecoverPoint (which can do more than 2 appliances per site, for added performance).
  • To expand on the previous point, assume that one of the RecoverPoint devices is “gravy” and is there to fail over if another box breaks. So, you effectively don’t want to be relying on having the full complement of RecoverPoint boxes working. This is especially important in 2-box RecoverPoint SE configs. If one box breaks (and they’re plain Dell 1950 servers) then that should not be debilitating to your performance while you’re waiting for a new box.
  • Licensing is capacity-based, which also needs to be explained to the customer (including what it means price-wise if you go beyond what RecoverPoint SE will support).
  • There is an absolute ceiling for TB replicated
  • There’s a different price depending on whether you want to do local only, remote only or both kinds of replication (CDP, CRR and CLR licenses)
  • Beware of the increased I/O on the array! When doing any kind of traffic through RecoverPoint, at the very least you get quite a bit more I/O on the “journal” (the redo log part of RecoverPoint) in addition to your main disk. If you want to also do local recovery, you could be doing as much as 3x the I/O! You see, you have to send the normal copy of the data through first, then Clariion splits off the I/Os to RecoverPoint, which then writes data to a full local mirror, then also to the journal. Obviously, the array needs enough fast disks to cope with this.
  • As a corollary to the previous point, to do CDP you need at least 2x the space plus a percentage for the journal (depends on the change rate and how far back in time you want to be able to go to)
  • Additionally, you can’t present multiple clones of the data simultaneously, from different points in time – you have to do them one at a time. Could be important in some use cases.
  • Creating a full-speed-access snapshot of your data can take quite a while, again could be important in some cases.
  • Last but not least – RecoverPoint, while efficient, is still subject to the laws of physics, so if you are told you’ll get zero RPO/RTO over a multi-thousand-mile link, stop what you’re doing, email me and I’ll overnight you an industrial-strength cattleprod, gratis… which you can then use on the rep in question.

So – all I’m saying is, ask all the right questions before sending that PO over…

Avamar is a different case altogether. It’s a dedup backup appliance that dedupes the data before it’s sent over the network. It’s very efficient at doing rapid backups over poor WAN connections. You don’t have to pay per-client fees, it supports most major OSes and applications, and is fairly easy to use. However – the original use case for the product was doing centralized backups of multiple small remote sites that are connected via poor links, and it still excels in that. Doing backups of large datasets at the datacenter, on the other hand, is not really what it was designed to do, yet I see it positioned in such a way.

I also see EMC selling really, really small Avamar configs (1-2 boxes), the hope being that dedup will be so effective that it’ll all be a wash in the end. Well – deduplication, in general, is the ultimate “it depends” solution!

Here are some considerations:

  • Not all data deduplicates equally! Make sure you run the EMC dedup estimator not just on fileserver data but also on your DBs! (DBs don’t really dedupe well, and media files and in general anything compressed dedupes even worse). Make sure you really get a good sample of your data analyzed, ideally all of it if possible.
  • If the sizer and dedup tool have only been run for plain fileserver data and that’s not what you have, don’t believe anything you see…
  • Explain your desired retentions and insist you see the Avamar sizer results. A good rule of thumb is that if your data is 5TB, then even with dedupe and compression, you’ll still need about 5TB once you factor in retention, unless you’re one of those rare cases that had tremendous duplication to begin with.
  • Make sure you understand the ramifications of not going to the RAIN grid in the first place – if you get a couple of Avamar boxes they can’t be part of the RAIN architecture, and if you lose one then the entire system is down hard. If you have RAIN, you could lose an entire node and it will be OK (kinda like RAID5 for servers) but migrating from non-RAIN to RAIN is non-trivial. Ask for the details. Ideally, even if you don’t need enough capacity to go RAIN, just buy the appliances to go RAIN but don’t buy the capacity licenses (i.e. you could buy 1TB of capacity yet have 5 nodes that theoretically can have a bunch more capacity).
  • Figure out if you want fast backups or fast recovery or both, and choose product accordingly (the fastest recovery is always replication/snapshots of primary data). Remember – usually, the desired end result is to recover, not to back up!
  • Understand exactly how Avamar can go to tape – the solution is not clean and it’s excessively slow. The product is really meant for those that want to go tapeless.

That’s all I have for now.

D

 

Sat
15
Aug '09

Should your backups to disk consume more disk than you use for production? Seriously?

So, let’s talk about this not-so-hypothetical customer… They have:

  • A few sites
  • A lot of data per site
  • Much of the data is DBs and Multimedia
  • No replication currently
  • Can’t back up everything currently
  • No proper DR
  • Fairly significant rate of change
  • Not the fastest pipes between sites

They asked me to propose a solution that will back everything up and cross-replicate the backups between the sites. They want to move as far away from tape as possible.

After much deliberation and examination of the data and requirements, we concluded that, in order to back everything up (and to stick to their requirements), even with various kinds of dedupe (I sized the solution with best practices for the usual suspects), due to the rate of change and the large amount of data with poor undedupability (that can’t possibly be a word), they will need about 3x the total amount of production space in order to achieve backups to disk (including dedupe!)

So, we declined to propose a solution. I want to sell something as much as the next guy but primarily I want repeat customers and the only way to get a happy repeat customer is to not screw him the first time… And selling them 3x the space only for backups doesn’t make too much sense to me when they could be spending their money much more wisely.

I explained how it doesn’t make sense to spend that kind of money on disk that’s just for backups! After all, backups are a last resort. My list of preferred methods for recovery (from best to worst):

  1. Local and remote replication + application-aware snapshots
  2. Backups to disk
  3. Backups to tape
  4. Snot, a claw hammer, duct tape and bailing wire (sometimes actually works better than tape but anyway…)

Wouldn’t it be a slightly better idea to use maybe 2x the disk, possibly even spend less money compared to the backup-only solution, and instead:

  • Cross-replicate the production data for rapid recovery
  • Achieve full local and remote DR
  • Be able to go back in time with snapshots both locally and remotely
  • Replicate the snapshots themselves automatically
  • Still get dedupe but this time on primary storage (make the current storage last longer)
  • Not need a forklift upgrade (investment protection)
  • Reduce or eliminate tape and reliance on the backup software
  • Get even longer retention than with backups to disk
  • No pipe upgrades
  • Drastically simplify administration
  • Potentially save millions over the next few years!

We’ll see what they decide to do. There was tremendous resistance to what I and a horde of seasoned engineers believe is the proper solution, with all kinds of very reasonable excuses being voiced (”we have no time, no resources, the stakeholders don’t care” etc). However, my position on this is clear. Yes, there’s more short-term pain in order to transform the infrastructure to the utopic vision of the bullets above, but the long-term gains are staggering!

I’ll let everyone know what happened the moment I hear. This one is really interesting…

D

, , , , , , , ,

Wed
29
Jul '09

Ease Of Use, Backup and Recovery And Efficiency in Modern Disk Arrays – What Questions Should You Really Be Asking the Vendors?

It’s interesting how many storage vendors claim their products are easy to use and, indeed, show nice canned demos full of wizards and elves and whatnot that seem to impress most. There are also grandiose claims of magically reliable hardware and other pixie dust… Ultimately, the reality is that:

  1. Most modern arrays, as long as you’re comparing like to like (i.e. from the same class, same kind of RAID), properly configured, will be reliable enough for most uses
  2. Similarly to #1 above: aside from insane marketing cache IOPS (a certain prominent vendor quotes IOPS numbers not even from cache but from the buffers of the FC ports, how realistic is that?) performance is not crazily different between similar-class boxes with similar numbers of disks. Ultimately, cache runs out and you need to hit the spindles… (so, boxes that can contain gigantic amounts of cache such as NetApp with PAM cache boards or EMC’s V-Max with multiple engines have a leg up there)
  3. There Is No Magic
  4. Almost everyone is using the same bits internally (CPUs, disks, RAM…) – with some key enhancements here and there. Don’t let the exact hardware details cloud your judgment. A good example: Let’s say Array X has 2 CPUs at 2GHz and Array Y has 2 CPUs at 3GHz. Unless the arrays come from the same manufacturer and run the same code, it’s VERY difficult to compare. Even if the CPUs are exactly the same, it’s tough to compare. The reason? Running anything (let’s pick Oracle) on the exact SAME hardware may produce wildly different benchmarks depending on whether the OS is Solaris, Linux or Windows, the tunings employed, and whether it’s 64- or 32-bit - the variable here being the OS.
  5. It all comes down to the intelligence, efficiency and reliability of the array software

Some business-related questions to ask the vendor:

  1. How is the support? Is it outsourced or not?
  2. Is the company viable? Is it profitable? Growing? Or is it tiny, struggling and depends on a single cool feature to woo prospects?
  3. References? Are the customers loving it or is it just OK? A cool one I heard today: “since I stopped using <TLA> my blood pressure dropped”…
  4. What large companies are using the technology? It’s one thing to have a reference from a mom-and-pop shop, and another to have one from Oracle, Microsoft etc. (and have multiple PB deployed inside those large companies)
  5. How many PB is the vendor deploying daily? How many total installations CURRENTLY UNDER SUPPORT, I don’t care how many since the company’s inception since the “since inception” means you’ll get numbers including people that got RID of the solution.
  6. Is the vendor OK with giving a performance guarantee (i.e. based on your workload that you will get 100,000 IOPS) and giving you a 100% refund if they fail to meet the metrics?
  7. To expand on the previous item: Is the vendor OK with doing a “Right of Return” - let you return the box if it doesn’t meet some agreed-upon criteria?
  8. Is the vendor OK with doing a Proof-of-Concept?

The prospective customers should probably ask for a bit more detail – and focus on things that will be statistically more important day-to-day than cool features of debatable real-world use. Some technical questions I’d ask:

  • Can I add drives on my own? Easily? Or do I need PS?
  • What requires downtime? Why?
  • What protocols does the array support? Can I use whatever or am I locked in?
  • Do I need extra appliances to support more protocols or are they all truly built-in?
  • Can I expand the ports?
  • Can I switch a LUN so it’s presented via iSCSI instead of FC (or vice versa)?
  • How do I do stuff like add drives to a RAID group? Is it on-the-fly? Do I need to destroy the RAID group?
  • Do I need to add disks in groups or can I add 1-2 if I want?
  • How much realistic protection do the available RAID schemes afford me? And what do I give up?
  • Can I lose any 2 drives in rapid succession without losing data? (dual-drive-loss has happened to various people I know and to me twice, it’s not as rare as you’d think. I lost data…)
  • Does RAID6 result in a performance decrease?
  • What is the real usable capacity, after RAID, based on real disk capacities (base-2 not base-10) and not marketing? You see, a 1TB drive doesn’t really offer 1TB…
  • Explain all the overheads in the system if best practices are followed - in some systems, even after RAID, 10TB usable is more like 5TB usable…
  • How easy is it to have a LUN span multiple disks in multiple RAID groups for performance? Meaning, in practical terms – do I need to worry about the back-end or will the system just take care of it for me?
  • Do I need to worry about adding disks in certain multiples, especially when dealing with such spanned LUNs?
  • Can I move stuff around the array?
  • How quick is the rebuild of drives?
  • Does the array detect impending failures and fail drives before they actually fail, in order to avoid a parity rebuild?
  • Do I need to care and know a lot about the back-end in order to optimize performance?
  • Is it easy to set up replication? Do I need extra appliances? Can I use FC and/or IP?
  • What’s the replication delta? (some arrays have a pretty huge minimum chunk they need to send over, can affect RPO)
  • Is compression supported for replication?
  • Regarding replication (both local and remote): Can I set up logical LUN groups that get treated as one in order to maintain consistency?
  • Can I grow a LUN?
  • Can I shrink a LUN?
  • Can I do it all from 1 place (and when I say all I mean all the way to having the LUN visible in the OS as a Filesystem, complete with proper partition offsets) OR do I need to visit like 3 different interfaces? Most vendors focus on the creation of a LUN. Easily creating a LUN is only a small piece of the puzzle!
  • Can I multi-purpose my disks or do I need to dedicate some to NAS, some to FC?
  • Can I prioritize my I/O?
  • Can I prioritize and tune my cache?
  • Do thin provisioning and snapshots adversely impact performance?
  • How many snapshots can I keep?
  • Can I keep a snapshot for, say, a year without messing up my performance and without needing a ton of space?
  • Can I use snapshots to clone LUNs so that they can be used to rapidly provision, say, servers or VMs without occupying too much space?
  • How easily and quickly can I backup and restore?
  • What kind of application integration is available? Some vendors offer basic VSS integration for Windows, but can I, say, recover individual emails and clone DBs without needing to use my backup application? How easily and quickly?
  • What about integration with applications that aren’t on Windows and may even be custom? Is it easy to properly integrate them?
  • Can I increase the cache size if needed? By how much?
  • Can I tier my data?
  • Does it work with the primary backup apps and VMware SRM?
  • Can I get data encryption?
  • Can I get data compression for all kinds of data?
  • Can I get deduplication? And does it work for backup data only or also for my production data so I can save space?
  • What is the deduplication impact?
  • Can I script operations if I want to?
  • What kind of reporting and data gathering is available?

This is not even a comprehensive list and I’m sure everyone has their own (if you haven’t written your own list down I suggest you do!) but represents what I feel are features that are realistically valuable.

What do you think? Comments always welcome…

D


Sun
14
Jun '09

New ext4 vs XFS benchmarks using Fedora 11 Leonidas

What a difference a kernel rev and/or distribution make. If you recall from a previous post, I was unable to complete postmark testing on Ubuntu 9.04 using ext4, and had to recommend against ext4. Now, with the release of Fedora 11 “Leonidas”, a new kernel seems to make a big difference in performance and stability of ext4.

Some other observations before I show any numbers:

  • This is NOT the same computer as was used in the previous test, don’t use these numbers to compare between Ubuntu and Fedora. It’s a desktop with a 64-bit Athlon and 1GB RAM. I know, I know… I didn’t have access to the other box. Look at Phoronix.com for a comparison of the two.
  • The 2.6.29 kernel seems to have a much better implementation of the CFQ I/O elevator, I only noticed a slight decrease in performance using deadline instead of the increase I usually get with XFS (ext3 and ext4 have always been tuned for CFQ).
  • In this version, using my usual (and sometimes unsafe and daring) mount switches didn’t seem to make a huge difference on XFS and none in ext4 or even ext3, Fedora 11 is really a distribution that the developers want you to be able to use without much fussing.
  • On all tests, I created XFS with mkfs.xfs -f -l lazy-count=1 -l size=128m /dev/…  - this enables the 2 main (and safe) tunings that I believe everyone should follow with XFS. Kinda hard to do while installing a distribution, the Fedora 11 installed wasn’t happy about it. Ubuntu is more forgiving, it lets you boot into the LiveCD and you can manually create partitions before you let the installer do its thing. Convenient for single-root-partition installs…
  • “XFS tuned” means mounted with noatime,logbsize=256k,nobarrier (nobarrier is unsafe unless you’re on a UPS).
  • “ext3 tuned” means barrier=0,noatime,data=writeback. Used to make a big difference…
  • The same disk area was used for all tests
  • Scribefire on Firefox sucks compared to Mac- or Windows-based offline blog editors. There are some KDE-based ones but I didn’t want to download 100s of MB of KDE support infrastructure to run a 600K blog program…

Postmark numbers:

Filesystem Read MB/s Write MB/s IOPS
XFS defaults 4.9 10.34 215
XFS tuned 6.23 13.16 263
XFS noatime,logbsize 6.38 13.47 263
ext4 noatime 9.62 20.32 416
ext3 noatime 5.71 12.06 238
ext3 “tuned” 5.32 11.24 219
ext3 writeback,noatime 4.73 9.98 192

Bonnie++ numbers:

Filesystem
IOPS
Block writes KB/s Rewrite KB/s  
XFS defaults 328.4 116600 52066
XFS tuned 328.6 119981 51639
XFS noatime,logbsize 333 119781 50519
ext4 noatime 335.1 117285 48797
ext3 noatime 294.6 100771 43033

Verdict

  • Ext4 shows great promise!
  • For sheer MB/s on large files, XFS is still better by a small margin
  • If you want to be doing operations on many small files, ext4 is great
  • The reworked CFQ scheduler rocks

D

Sat
13
Jun '09

About the Data Domain acquisition – and is EMC really the best place for Data Domain?

Much has already been written about this imminent acquisition of Data Domain by either NetApp or EMC and, since opinions are like you-know-what, and I have one, here it is… if I ramble, forgive me. I have too much to say and I’m trying to be PC… I wrote and subsequently erased all kinds of stuff that could probably get me in trouble (the more you work with a company the more dirt you uncover, and I have several earth movers’ worth).

I do think that both companies waited too long to try and acquire Data Domain – frankly, it’s staggering to me that other companies that make decent products like CommVault haven’t been acquired yet (I mean, seriously, if EMC want to compete in the backup software space they should just drop Networker and buy CommVault). Consolidation is the trend…

Maybe both NetApp and EMC thought their in-house deduplication would work out for everything, maybe they thought Data Domain wouldn’t become a contender. Maybe they thought it was just a phase. Either way, the backup market is still strong, most people don’t want to move en masse to something like Avamar, not everyone needs VTL, and Data Domain does provide a very convenient way to keep using your existing backup product, make next to no changes, and get better efficiencies.

The simple truth is that EMC needed SOMETHING to combat Data Domain so they signed the agreement with Quantum and rushed the product to market. And then tried to strong-arm the resellers into forgetting about Data Domain and instead selling the new and amazing DL3D (that backfired BTW).

As far as EMC is concerned, the attempt to acquire Data Domain is a slap in the face for Quantum and all the customers that have been pitched/sold DL3D (the OEM’ed Quantum DXi product). EMC has spent quite a bit of time belittling Data Domain and instead pushing a product that has seen very limited testing (I know, I’ve been burned personally by it several times). A good example: EMC recently released a patch to allow backups done with EMC’s Networker to actually be deduplicated (talk about a reason to return a product if there ever was one – like a car that can’t go faster than 10 mph or that gets 2 mpg instead of 20 mpg). You see, there was an issue with the filter that figures out what backup app you’re using, and Networker backups were getting only plain old compression, NO deduplication. This is no secret, if anyone bothers to read the release notes of the recent patches they’ll see this info. Maybe if you’re a DL3D customer you should insist on reading the release notes if they’re not easily available? After all, you have a right to know what’s changing!

Think about this: EMC’s own backup product was not tested with DL3D. Yet EMC happily sold DL3D to customers with Networker. To me, this is a sales-driven company, not a customer-driven company.

Not to mention other crippling bugs, slow startup times (especially in the case of unclean shutdowns) and the abysmal performance which simply stems from how the product is designed – it’s spindle-happy and needs about 2 trays of drives to work well. Oh, and don’t EVER fill it beyond 80% capacity. You’re also not supposed to use it as a normal CIFS/NFS share for archiving anything like email or normal files (arguably a great place for dedup).

So, EMC knew about the DL3D issues (well, some of them, it’s not their product after all, indeed I helped them identify some of the bugs) and played coy with customers. Then, they saw NetApp making a move for Data Domain and realized that by buying Data Domain EMC could accomplish several things:

  • Minimize NetApp’s cash reserves if NetApp does in the end succeed in acquiring Data Domain (but is that necessarily a bad thing for NetApp?)
  • Remove the flailing DL3D and replace it with a product that actually works and is selling very well
  • Get a bunch of solid deduplication and consistency checking algorithms
  • Assimilate a competitor that’s been a huge thorn on EMC’s side in that space
  • Reduce the efficiency of NetApp as a competitor

But think from the customer standpoint for a minute (most of the analysts so far seem to miss the most important player here – and that’s certainly not EMC, NetApp or Data Domain, but the customer). You’ve been pitched DL3D, and now you must forget about that and all the bad things you were told about Data Domain – it’s all good now that it belongs to EMC, you’ll be taken care of. Or you can buy the DL3D if you still want it (and I don’t see EMC derailing ANY existing DL3D campaign, no matter what).

I were a DL3D prospect/customer, I’d be worried no matter what.

Let’s talk about the best place for Data Domain to end up. As far as investors go of course, if they want to make a quick buck and run, the EMC cash offer is tantalizing. But for Data Domain employees, EMC can be a black hole and the added complexity and bureaucracy anything but fun. EMC has become almost too diversified – let’s look at just some of EMC’s storage solutions (I won’t mention the software since then it’d be a REALLY long and weird post):

  • Symmetrix
  • Clariion
  • Celerra
  • Centera
  • Atmos
  • EDL
  • DL3D
  • RecoverPoint
  • Avamar (that’s both a software solution and an appliance)

What’s interesting is that, by and large, the teams in charge of the above products don’t talk much, if at all, with each other. Talk about islands! And, when it comes to sales, EMC has internally competing groups of people that sell the above products – for instance, “NAS overlay” guys only get paid on Celerra sales, and I’ve seen them screw up campaigns that were clearly a pure Clariion play just so they could somehow get some Celerra in so they get paid. The basic EMC sales guy you meet can sell them all and indeed doesn’t care, but the people he relies on for support cannot sell them all and do care about what gets sold. It’s all very fragmented and, again, not a model that operates with the customer’s best interests always in mind. It always baffled me why EMC would allow so much fluff in their sales organization.

So, if Data Domain got absorbed, they’d probably not be enjoying all the “melting pot” advantages the EMC corporate bloggers seem so keen on advertising, and the “large startup” feel (maybe it’s like that in MA for a few chosen people – in most other locations it’s decidedly not like that). They’d just be another acquired unit, internally competing with other units, dealing with large-company politics and other inefficiencies. The EMC stock wouldn’t really become much higher than it is now, if at all. It’s been about the same for quite some time now.

Let’s examine the scenario of NetApp buying Data Domain:

  • NetApp is much more focused than EMC – indeed they have literally less than a handful of major offerings that don’t really compete with each other
  • The NetApp sales force is unified and doesn’t internally compete about what to sell
  • NetApp culture is much closer to Data Domain culture
  • It’s not good for innovation to have one company hoarding 3 dedup technologies, NetApp + Data Domain will actually push EMC more and be better for the customers
  • Data Domain could make NetApp much stronger against EMC, in turn driving NetApp’s stock price up significantly. Which, in turn, would give investors back much more than $2bn, thereby making this the better deal.

The only drawback I see (as do most writing about this) is NetApp’s relatively poor history in managing the few acquisitions they’ve made. But I believe that as long as they leave Data Domain alone and slowly try to integrate the technology in the other products it will all work out.

Hopefully all this made some sense…

D

Mon
18
May '09

Linux filesystem benchmark extravaganza - including Deadline vs CFQ schedulers and ext4 instability

I have some spare time these days so I figured I’d finally test as many filesystems on Linux as I could…

The new ext4 is an option with modern kernels so I loaded Ubuntu 9.04 and tried postmark and bonnie++ on the same partition using various filesystems and switching between the CFQ and Deadline schedulers.

Switching schedulers permanently can be achieved by changing the boot options and appending, say, elevator=deadline, but you can also switch them on the fly by running the following:

echo deadline > /sys/block/sda/queue/scheduler

You can check what’s currently selected by simply typing

cat /sys/block/sda/queue/scheduler

You’ll get back something like:

noop anticipatory [deadline] cfq

The scheduler in brackets is the currently selected one.

Reader beware: Running postmark on ext4 locked up the system repeatedly during the transaction phase of the benchmark, using either my own compiled version and the one from the repository, so obviously there is some issue there and I cannot at this time recommend ext4no other filesystem caused lockups. I did run bonnie++ as well since that didn’t crash with ext4.

The objective of this exercise wasn’t to show which filesystem is fastest, but rather to illustrate that, depending on what you want to do, you may want to re-examine the choice of filesystem and scheduler with your application if you’re running Linux. BTW the current recommendation for Databases and fast intelligent external arrays – and ubuntu’s default in the server edition – is the Deadline scheduler, and not CFQ. However, all other distrubutions at the moment use CFQ!

So, without further ado, some benchmarks… (I’m not including the entire postmark output since it would be far too large, I just kept the most important metrics, anyone that wants the entire results is more than welcome to send me an email and I’ll hook you up).

Postmark MB/s:

Filesystem

Read MB/s

Write MB/s

IOPS

Reiser CFQ

4.85

10.25

227

Reiser Deadline

5.38

11.35

246

XFS CFQ

2.33

4.93

109

XFS Deadline

2.35

4.97

105

XFS Tuned

2.73

5.76

120

JFS CFQ

1.75

3.69

78

JFS Deadline

1.73

3.65

76

Ext3 CFQ

2.71

5.73

115

Ext3 Deadline

2.86

6.03

122

 

MBPS

Postmark IOPS:

iops

Bonnie++ write speed:

Filesystem

IOPS

Block writes KB/s

Rewrite KB/s

Reiser CFQ

428

31657

18199

Reiser Deadline

462

32290

18154

XFS CFQ

471

39901

18557

XFS Deadline

483

39840

19653

XFS Tuned

592

40604

20746

JFS CFQ

433

31651

18528

JFS Deadline

452

39106

18755

Ext3 CFQ

403

31108

17235

Ext3 Deadline

338

31803

17885

Ext4 CFQ

451

39265

18519

Ext4 Deadline

446

39257

18221

bonnieMBPS

Bonnie++ IOPS:

bonnieiops

Observations:

The Deadline scheduler seems to be consistently better for anything that’s not ext-based! A lot of work has been done on the Linux kernel to optimize it for the ext2-3-4 filesystems, and that shows. However, depending on what you want to do, ext3 may not be the best option (I don’t know yet about ext4 for postmark-type loads but based on the bonnie++ results it’s solid).

Here’s a list of some considerations:

  • Will the filesystem host many many small files or a few large ones? Reiser still rules the “many small files” use case, by far. The rest are fairly close, and JFS seriously lags. For large files, XFS is great.
  • Do you care if the filesystem takes a long time to fsck? Ext3 still takes quite long, whereas something like XFS doesn’t. Ext4 should remedy this.
  • Do you care for something that’s still actively being maintained? In this case only ext3-4 and XFS are the options.
  • Do you want defrag tools? Choose wisely since few filesystems do (XFS and ext4).

My current overall recommendation is XFS since it’s mature and also very tunable. For reference, here’s how I got the better results for XFS (the results in the graphs for tuned XFS were with the deadline scheduler):

mkfs.xfs -f -d agcount=4 -l lazy-count=1 -l size=64m /dev/sda7

mount -o nobarrier,noatime,nodiratime,logbufs=8 /dev/sda7 /test

Don’t just follow the above blindly, normally mkfs tries to auto-adjust those (i.e. the agcount) but the important ones to look for are the log size and the mount options, especially the nobarrier and logbufs. Remember though that nobarrier is only recommended if you have battery backup.

D

Mon
27
Apr '09

So, what’s the best way to back up VMs?

Backing up VMs seems to be one of the topics nobody can seem to be able to agree on despite a plethora of reading material on the subject… and maybe because of said plethora.

I will focus on VMware since it is the leading and prevalent virtualization method in the marketplace today (I’m sure the KVM, Xen and Hyper-V fanboys will have their 15 minutes of fame someday).

VMware has several ways for backing up VMs:

  1. Install a backup agent in the VM, just as with a normal client
  2. Back up the entire VM by installing a backup agent in the ESX console
  3. Use VCB (VMware Consolidated Backup).

     

They all have their pros and cons so the short answer to the topic is that there’s no best method, instead you’ll get the “it depends” answer. Sorry. Here’s the skinny on each method:

 

1. Install a backup agent in the VM, just as with a normal client

 

Pros:

  • Everyone understands this, since it works just like a real physical client and can do most of the same things
  • Can do incrementals
  • File-level recovery is straightforward with no confusion as to which VM owns which file
  • Advanced backup features such as DB agents work fine

 

Cons:

  • Impact on the host and network
  • Deployment just as difficult as when using the physical clients
  • Can make backup software licensing more expensive than needed
  • Bare-metal-recovery of VMs only a bit less difficult than with physical boxes

 

2. Back up the entire VM by installing a backup agent in the ESX console

 

Pros:

  • Licensing cost for backup software minimized (1 license needed per ESX server)
  • The entire VM is backed up so recovery is like Bare Metal Recovery – you’ll get the entire box back with a very high probability of success
  • Fast since the virtualization layers are bypassed

 

Cons:

  • Still significant impact on the host and network
  • Cannot restore individual files
  • Advanced backup agents won’t work (no hot backups of SQL or Exchange, for instance)
  • Backups always large since a full backup is required every time
  • Backups take long (see previous point)
  • Requires some scripting knowledge to deploy properly.

 

3. Use VCB (VMware Consolidated Backup).

 

Pros:

  • Works with most backup software
  • Almost no impact on the host or network (backups can be entirely SAN-based)
  • Reduced backup software licensing cost
  • Works with VSS in windows to provide better backup reliability
  • Allows for incremental backups
  • Uses VM snapshots
  • No disk space used for staging of incrementals
  • Very simple DR
  • File-level backups are possible

 

Cons:

  • Cannot back up RDMs in Physical Compatibility Mode
  • Advanced functionality (file-level backups and application integration won’t work with non-windows VMs)
  • Cannot back up clustered VMs (i.e. MSCS-clustered VMs can’t be backed up)
  • FullVM backup speed is limited to 1GB/min (limitation of windows’ cmd.exe but can get around it by creating multiple threads I guess – but you could have speed issues if you cannot break the jobs up and they’re large)
  • Significant disk space needed for Holding Tank (where FullVM copies are placed)
  • Advanced backup agents will not work
  • File-level backups won’t back up the Windows registry
  • File-level recovery is complex and generally a two-step process

 

The lists could go on but as you can see there are serious wrinkles with all the approaches.

The problem is compounded by the fact that most modern backup software has arcane licensing schemes depending on whether an agent is on a VM or not, for instance (CommVault) or allowing you unlimited agents per ESX server as long as you buy the more expensive client license for the ESX server (NetBackup), and various permutations thereof.

Another wrinkle is Deduplication. Products that do source-based Deduplication such as EMC’s Avamar can comfortably have their agents inside the VMs or in the service console since subsequent backups take only a fraction of the time and there’s almost no space penalty. So, with Avamar one could be doing both kinds of backup (entire VM and individual files) and be covered both ways and only worrying about time and space when reading Hawking’s books… The negative is cost.

NetBackup offers another interesting twist since their implementation of VCB allows individual files to be recovered from a FullVM backup – the rationale being that you use their PureDisk Deduplication to store everything in order to reduce the expense of backup disk.

In the end, the only recommendation I can give that doesn’t depend too much on your individual circumstances is to try and do both file-level and FullVM-type backup so that you’re covered in multiple ways. Then replicate those backups, etc… you know the drill by now.

D

 

 

Thu
5
Feb '09

The true XIV fail condition finally revealed (?)

I just got this information:

For XIV to be in jeopardy you need to lose 1 drive from one of the host-facing ingest nodes AND 1 drive from the normal data nodes within a few minutes (so there’s no time to rebuild) while writing to the thing.

Have no way of confirming this but it did come from a reliable source.

A customer recently tried pulling random drives and XIV didn’t shut down and was working fine, but they were from the data nodes.

Why can’t anyone post something concrete here? I’m sure IBM won’t post since the confusion serves them well.

For what it’s worth, the customer is really happy with the simplicity of the XIV GUI.

D

Mon
5
Jan '09

So what exactly is IBM trying to do with the XIV?

By now most people dealing with storage know that IBM acquired the XIV technology. What IBM is doing now is trying to push the technology to everyone and their dog, for reasons we’ll get into…

I just hope IBM gets their storage act together since now they’re selling products made by 4-5 different vendors, with zero interoperability between them (maybe SVC is the “one ring to rule them all”?)

In a nutshell, the way XIV works is by using normal servers running Linux and the XIV “sauce” and coupling them together via an Ethernet backbone. A few of the nodes get FC cards and can become FC targets. A few more of the features:

  • Thin provisioning
  • Snaps
  • Synchronous (only) replication
  • Easy to use (there’s not much you can do with it)
  • Uses RAID-X (no global spares, merely there’s space on each drive, faster rebuilds are possible)
  • Only mirrored
  • A good amount of total cache per system since each server has several GB of RAM BUT the cache is NOT global (each node simply caches the data for its local disks).

IBM claims insane performance numbers with the XIV (“it will destroy DMX/USP!” — sure). But let’s take a look at how everything looks:

  • 180 maximum (or minimum) drives (you can get a half config but I think you always get the 180 drives but license half, I might be mistaken - I believe you have to make a commitment that you’ll buy the whole thing in 1 year)
  • Normal Linux servers do everything
  • Only SATA
  • The backbone is Ethernet, not FC or Infiniband (much, much higher latency is incurred by Ethernet vs the other technologies)

The way IBM claims they can sustain high speed is to not try and make the SATA drives get bound by their low transactional performance vs 15K FC drives or, even worse, SSDs. From what I understand (and IBM employees feel free to chime in) XIV:

  1. Ingests data using a few of the front-end nodes
  2. Tries to break up the datastream into 1MB chunks
  3. The algorithm tries to pseudo-randomly spread the 1MB chunks and mirror them among the nodes (the simple rule being that a 1MB chunk cannot have a mirror on the same server/shelf of drives!)

Obviously, by doing effectively as much as possible large block writes to the SATA drives and using the cache to great effect, one should be able to see the 180 SATA drives perform pretty much as fast as possible (ideally, the drives should be seeing streaming instead of random data). However (there’s always that little word…)

  1. There is no magic!
  2. If the incoming random IOPS are coming at too great a rate (OLTP scenarios), any cache can get saturated (the writes HAVE to be flushed to disk, I don’t care what array you have!) and it all boils down to the actual number of disks in the box. The box is said to do 20,000 IOPS if that happens - which I think is optimistic at 111 IOPS/drive! At any rate, 20,000 IOPS is less than what even small boxes from EMC or other vendors can do when they run out of cache. Where’s the performance advantage of XIV?
  3. The “randomization removing algorithm”, if indeed there’s such a thing in the box, will have issues with more than 1-2 servers sending it stuff
  4. See #1!

Like with anything, you can only extract so much efficiency out of a given system before it blows up.

An EMC CX4-960 could be configured with 960 drives. Even assuming that not all are used due to spares etc. you are left with a system with over 5 times the number of physical disks vs an XIV, tons more capacity etc. Even if the “magic” of XIV makes it more efficient, are those XIV SATA drives really 5 times more efficient (5 times would make it EQUAL to the 960 performance, XIV would have to be well over 5 times more efficient than an EMC box of equivalent size to beat the 960).

Let’s put it that way:

If my system was as efficient as IBM claims, and I had IBM’s money, it’d buy all the competitive arrays, even at several times the size of my box, and publicize all kinds of benchmarks showing just how cool my box is vs the competition. You just can’t find that info anywhere, though.

Regarding innovation: Other vendors have had similar chunklet wide striping for years now (HP EVA, 3Par, Compellent if I’m not mistaken, maybe more). 3Par for sure does hot sparing similar to an XIV (they reserve space on each drive). 3Par can also grow way bigger than XIV (over 1,000 drives).

So, if I want a box with thin provisioning, wide striping, sparing like XIV but the ability to choose among different drive types, why not just get a 3Par? What is the compelling value of XIV, short of being able to push 180 SATA drives well? Nobody has been able to answer this.

I’m just trying to understand XIV’s value prop since:

  1. It’s not faster unless you compare it to poorly architected configs
  2. It has less than 50% efficiency at best, so it’s not good for bulk storage
  3. It’s not cheap from what I’ve seen
  4. Burns a ton of power
  5. Cannot scale AT ALL
  6. Cannot tier within the box (NO drive choices besides 1TB SATA)
  7. Cannot replicate asynchronously
  8. Has no application integration
  9. No Quality of Service performance guarantees
  10. No ability to granularly configure it
  11. Is highly immature technology with a small handful of reference customers and a tiny number of installs! (I guess everyone has to start somewhere but do YOU want to be the guinea pig?)

Unless your needs are exactly what XIV provides, why would you ever buy one? Even if your space/performance needs are in the XIV neighborhood there are other far more capable storage systems out there for less money!

IBM is not stupid, or at least I hope not. So, what IBM is doing is pretty much handing out XIVs to whoever will take one. If you get one, think of yourself as a beta tester. Because I hardly believe that IBM bought the XIV IP without seeing some kind of roadmap, otherwise the purchase would be kinda stupid! If you are a beta-tester, be aware that:

  • XIV cheats with benchmarks that write zeros to the disk or read from not previously-accessed addresses
  • XIV will be super-fast with 1-2 hosts pushing it, push it realistically with a real number of hosts
  • Try to load up the box since if it’s not full enough you’ll get an extremely skewed view of performance - put even dummy data inside but fill it to 80% and then run benchmarks!
  • Test with your applications, not artificial benchmarks
  • Do not accept the box in your datacenter before you see a quote! In at least 3 cases that I know of IBM drops off the box without giving you even a ballpark figure. I think that’s insane.

And last, but not least: I keep hearing and reading about the following being true, I’d love IBM engineers to disprove it:

If you remove 2-3 drives from different trays simultaneously from a loaded system then you will suffer a catastrophic failure (logically makes sense looking at how the chunks get allocated but I’d love to know how it works in real life). And before someone tells me that this never happens in real life, It’s personally happened to me at least once (lost 2 drives in rapid succession) and many other people I know that have any serious real-world experience…

D

'

On current smartphones

The time has come for me to get a new phone (my current one can’t keep up with the demands and the speed or lack thereof ends up frustrating me).

So I’ve been looking at the plethora of devices out there - Berries, Windows Mobile, iPhone, etc (disclaimer: I’ve been a Blackberry user for many years now).

For me, the ideal mobile device needs to:

  • Synchronize seamlessly all my Exchange stuff
  • Be able to display PDF and office docs (not necessarily edit them)
  • Be a great phone (reception, clarity)
  • Have tethering ability
  • Be fast when I multitask on it
  • Offer GPS (almost all current ones do)
  • Have a decent supply of third-party apps
  • Be able to last me for a whole day (NOT just a business day) of pretty heavy usage
  • Have not so much an intuitive OS but an OS that doesn’t get in the way
  • Let me input text very, very fast (I’m writing this on my phone now)
  • Be tough (Mil-Spec would be great)
  • The ability to play music/videos is not essential but is nice to have (all do it now)
  • Camera nice-to-have but it doesn’t need to be amazing
  • Should be able to have a decent web browser
  • Shouldn’t be ridiculously huge…

So here’s the Executive Summary if your needs coincide with mine:

- Get a Blackberry Bold

For the nitty-gritty:

We HAVE to mention the iPhone, it’s a marvel of social engineering, industrial design and amazing marketing/branding. Of course the battery life utterly sucks if you try to use it the way I’d need to but, most importantly: I cannot type on the sucker! I don’t have abnormally large sausage fingers, indeed I believe my digits are downright elegant, yep I just cannot type fast or accurately on the iPhone (this paragraph might have taken me 10 mins to write on it). So we stike that one out.

Then you have the new Blackberry Storm, also touch-screen. On this one, the entire screen is a gigantic button that you need to press in order for it to register. I found that this approach seems to make it way more accurate for me than the iPhone. The battery life and build are also great. Too bad that the hardware can’t keep up, it feels decidedly slow, more so than the iPhone. Scratch that one too.

Then we have the narrow-form-factor Berries. Can’t type on them quickly. Out they all go.

Next are the various and sundry Windows Mobile devices. Almost too much choice here, huge third-party support, some great hardware from a few vendors. But I find that the OS really gets in my way and all of them also feel amazingly slow. Battery life is no great shakes, either.

Nokia has some good ones, the E71 is my favorite, but they don’t sync that elegantly with Exchange plus the keyboard is weird. Great build, though. If you like its keyboard go for it. OS can take some getting used to…

What remains is the Blackberry Bold. Sure, size-wise it feels like holding a slipper against your head (fortunately size was never a very important criterion) but it passes almost all the other tests! It also lets you send/receive emails while on the phone, feels fast, and has an amazing keyboard. Probably because it’s slipper-sized…

Well-made but it’s so nice that it needs a decent case to get ruggedized so you keep it looking nice, in which case it’ll look more like a size 13 boot against my face and I won’t be able to see just how nice it is anyway.

Am I alone in believing that many people would gladly pay a premium for a sleek, ruggedized device that doesn’t look like a Casio G-Shock? I’d be totally OK with the silly and easily disfigured plastic chromed bits being replaced by Kevlar or rubber, a scratch-proof screen, the ability to withstand immersion for 30′, successful drop-testing from 1 story to concrete, flexible circuit boards (not the ultra-thin ribbon type, you can get boards that are almost rubbery), Mil-Spec connectors, port covers…

It’s all possible, it just adds to the cost. But I guarantee most professionals will pay $100-200 more for the ruggedized model that doesn’t need clunky cases. Ericsson, Siemens and Nokia all had standard phones (never made it to the US) that would fit the above description with the exception of the scratch-proof screen (the Ericsson one was pretty amazing - they suggested you wash it to get rid of the dirt - albeit pretty large), but they slowly stopped making them. They weren’t even much more expensive than the plain models!

The old, thick Blackberries used to be pretty tough, I dropped mine onto concrete many, many times (drop-kicked it once) and the only damage was that the vibrating thingy inside stopped working 100% of the time, a no-no among Berry addicts. It did look scratched but it wasn’t painted on so the scratches weren’t that visible.

I hear the iPhone can be tough, at least the original one. The 3G - not so sure. A colleague had his stop working after he dropped it 3 feet. It landed on its back (should be an easy knock to absorb), you can’t even see a scratch on it. Unacceptable, IMO.

D

Sat
27
Dec '08

So, how frequently do you really test DR?

It’s after 4AM, I can’t sleep since I’m in pain after a car accident and I’ve had altogether too much caffeine. I’ve already watched 3 movies. BTW, “I am Legend” - WTF! Never have I seen a decent book butchered so much! The ideas in the book were so much stronger. Seriously, go get the book and forget the movie. Sorry, Will.

Now I’m writing from The Throne Chamber once more (blessed be the Colon Drano caffeine). I’m all cramped up and can’t get up, so I thought why not post something… can’t promise it will make sense since my brain ain’t the clearest at the moment…

So - when was the last time you tested DR? Really?

If I had a penny for every time I heard the line “we back up our servers to tape but we don’t test DR, but we’re confident we’ll be up and running within 36 hours in the event of a disaster” I’d be paying Trump more money than he ever made just so he can shine my shoes, and he’d be thankful.

Let me make something clear: You need to test DR a minimum of twice a year, preferably once a quarter. Anything less and you’re just setting yourself up for failure.

Start by testing the most important machines. You probably won’t even have to artificially inject extra problems to solve (Pervy Uncle Murphy usually is right there beside you to take care of that). Marvel at how long it really takes.

If things go real peachy, did you hit your RPO and RTO? if yes, test with more machines, until you can test with the full complement of boxes your company truly needs to be up and running and making money. Document it all.

If you didn’t hit your RPO/RTO, how much did you miss them by? If it’s by a ridiculous amount, maybe the way you’re going about DR will simply not work - try replication and/or VMware…

Once you get good at it, start inventing scenarios. for instance:

- Pretend one of your tapes is bad. See how long your offsite vendor takes to bring you a fresh set once you figure out what are the barcodes you need.
- Pretend one of the critical servers can’t be recovered and you need to go back 3 weeks. How does this affect the business?
- Recover to dissimilar hardware.
- Pretend you’re dead. Are your documented procedures clear enough for your underling to follow? Are they clear enough for the janitor? The janitor’s 3-year-old kid? The kid’s parakeet? Ultimately, your DR runbooks need to be so clear that even your CEO can follow them easily, and he needs to be able to do so right out of bed, before he’s had his morning ablutions, quad-vanilla-soy-latte and his Zoloft.

Ultimately (and sorry if I’m repeating myself), you probably need to be making at least 2 tape copies, 2 copies of your backup catalog, replicating (ideally CDP) and using VMware all at the same time to have any real insurance policy against disaster.

And if you ever tell me “well, we don’t have the time to be doing DR tests” - do you really think you’ll have the time once disaster really strikes?

And, if you think that a disaster is an RGE (Resume Generating Event) then you probably are working for the wrong company and won’t get much job satisfaction there anyway.

I think I’d better get up before I lose my legs.

Nighty-night

D

Mon
22
Dec '08

My frustration with the quality and education of CIOs, CTOs, IT Directors, what have you… what caliber IT manager should you choose?

As a matter of course, I deal with all kinds of IT manager types during the course of a campaign.

Sometimes said managers are well-versed in technology. Other times they have biases, are bigoted, and so on. Which is fine, I’m more opinionated, cranky and obnoxious than most.

It agitates me encountering IT management types that:

  • Have no technology experience
  • Have no concept of how IT relates to the business
  • Have no idea how much technology costs
  • Have no idea how much being penny wise and dollar foolish can hurt their business
  • Cannot recognize an amazing deal due to their lack of a holistic viewpoint.

However, as annoying as the above bullets are, someone with sufficient intelligence and perserverence that cares will eventually “get it” and become able to at least have a conversation about technology. No, what bothers me more are the managers that:

  1. Do not care about technology
  2. Were promoted “from within” because they either knew someone or they were just the nearest body whose temperature was higher than ambient and are also guilty of #1
  3. Have an IQ less than their shoe size (US units)
  4. Are unable to delegate
  5. Are unable to pick proper subordinates (invariably they pick people whose IQ is in the single digits)
  6. Due to their unbelievable ignorance, pathologically distrust whatever vendors tell them or (the even more irksome)
  7. Get blinded by inane and irrelevant marketing gimmicks (look, the box can do a million IOPS with 10 drives, yours is nowhere near as fast!)
  8. They just believe whatever the last vendor told them
  9. Do not value the work solid partners do for them - there are truly few people that will actually add value, instead of just wanting to take your money!

I lost a couple of deals recently because of #7 and #8. If you’re reading this, you fully well know who you are. Here’s an example - would you not be pissed if:

  • You educated the customer far more than any other vendor - they freely admitted they had no idea what they’d need and indeed asked you to figure it out and suggest a way forward
  • You analyzed the performance of their environment and properly engineered a solution that will, scientifically, accomodate what they have plus a pre-stated amount of future growth without just throwing product at them
  • You analyzed their actual business needs and where they need to be and provided a plan to get there
  • Used best practices for DR and backups
  • Did it all while being less expensive than the competition, especially when considering the lack of essential features the competition suffered.

And what happens? Next thing you know, they’re picking the competitor that:

  • Is unproven (not even a handful of installs where we are)
  • Does not have useful functionality that they will need a few years down the line (VMware SRM anyone?)
  • Did not educate them - indeed ,recommended plainly wrong “best practices” that could bring an iSCSI environment to its knees (interesting what you hear when a storage vendor has no idea about Ethernet, switching, port channeling etc)
  • Blinded them with things like “look we have more cache” or “our box takes more drives!” (they’ll never need them)
  • Did not do thorough (or any) performance analysis (”looking” at random perfmon data doesn’t guarantee success)
  • Cannot even do replication
  • Did not offer them snapshots or any application awareness for backups and DR

I guess I was outsold. As someone I greatly respect and like but am frequently infuriated by likes to say, “tell them what they want to hear”. Maybe I need to become more corrupt.

So what would an ideal IT higher-up look like? I know I could do the job while being drowned and quartered, let alone in my sleep. But I’d get bored. A few pointers on who you should hire:

  1. Someone with real IT experience - ideally someone that started on the operational side and migrated to management
  2. Someone that not just understands but actually likes and appreciates technology (too hard to keep up otherwise)
  3. Someone that understands the financial and business ramifications of action or inaction when it comes to IT purchases
  4. Someone that understands the value of partnerships! Indeed, someone that already has solid partnerships.
  5. Even if you have semi-competent people within, sometimes it’s better to just hire someone with real experience and not wait till the internal hire figures it out, especially if you have projects on the line
  6. Get someone that understands RPO, RTO and what those mean in financial terms
  7. Find someone that used to work in a large corporation but “just had enough” - their experience is invaluable but they’re looking to go to a smaller outfit
  8. They should be able to sell better than most salespeople that visit them!

I could go on but you get the idea.

D

Mon
15
Dec '08

Cinebench benchmarks - performance comparison between Vista 64 and Mac OS X

Been a while since I posted anything - there’s a TON of material but some of us actually do more than blog, it’s quarter/year end, and I barely have time to go to the bathroom…

But this was an easy one so I thought I’d post it real quick. Using Scribefire, a blogging plug-in for Firefox. I hate it.

Disclaimer: The machines used are not identical.

However, the CPUs supposedly are pretty close in speed (2.6 vs 2.8 GHz). Memory is the same.

Graphics are also similar but the Lenovo box has 128MB VRAM whereas the Mac has 512MB and is a faster GPU.

The contenders: Macbook Pro 2.8GHz vs Lenovo T62p (14″ model) running Vista 64, 2.6GHz.

The Mac is running a 32-bit OS (64-bitness is coming with Snow Leopard next year). It also has switchable graphics and one can choose between the on-chipset Nvidia 9400 or the discrete 9600. Typically on-board graphics are pretty crappy.

Despite the dissimilarity of the machines here are some notables:

  • Cinebench really takes off in 64-bit mode in Vista
  • OS X seems to do quite well even though it’s not 64-bit yet
  • The integrated graphics on the new Mac are awesome
  • The discrete graphics are great for a laptop
  • OS X seems to be more efficient than Vista when doing multi-CPU work, at least in this case
  • If someone is looking for a decent modern laptop they can do far worse than the new Macs, even a plain Macbook would be pretty decent

Here’s a chart of the results:

OS/Config 1-CPU 2-CPU GFX Multiprocessor speedup
Macbook Pro 2.8GHz integrated GFX 3208 6051 4813 1.87
Macbook Pro 2.8GHz discrete GFX 3213 5926 6130 1.84
Lenovo 2.6GHz 32-bit 2693 4755 4264 1.77
Lenovo 2.6GHz 64-bit 3040 5367 4256 1.77
Sun
2
Nov '08

Postmark on late 2008 Macbook Pro

So I’m now the proud owner of a tricked-out 2.8GHz MBP.

Naturally I’ve been tinkering with it already (only had it for 2 days). I’ve disabled swapfile encryption, for instance, and I think it makes it have teh snappy.

I compiled postmark for it with -O3 -m64 and ran the usual. Before doing so though I did disable the Spotlight indexer like this:

sudo launchctl unload /System/Library/LaunchDaemons/com.apple.metadata.mds.plist

PostMark v1.5 : 3/27/01
pm>set number 10000
pm>set transactions 20000
pm>set subdirectories 5
pm>set size 500 100000
pm>set read 4096
pm>set write 4096
pm>run

Time:
273 seconds total
256 seconds of transactions (78 per second)

Files:
20092 created (73 per second)
Creation alone: 10000 files (833 per second)
Mixed with transactions: 10092 files (39 per second)
9935 read (38 per second)
10064 appended (39 per second)
20092 deleted (73 per second)
Deletion alone: 10184 files (2036 per second)
Mixed with transactions: 9908 files (38 per second)

Data:
548.25 megabytes read (2.01 megabytes per second)
1158.00 megabytes written (4.24 megabytes per second)

I then enabled spotlight and re-ran the benchmark:

Time:
483 seconds total
468 seconds of transactions (42 per second)

Files:
20092 created (41 per second)
Creation alone: 10000 files (909 per second)
Mixed with transactions: 10092 files (21 per second)
9935 read (21 per second)
10064 appended (21 per second)
20092 deleted (41 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (21 per second)

Data:
548.25 megabytes read (1.14 megabytes per second)
1158.00 megabytes written (2.40 megabytes per second)

Obviously spotlight is very aggressive in its indexing and tries to do it ASAP - you lose half your performance when doing metadata-intensive processing. The results though, while sucky for the specs of the box, are far, far removed (and much better than) what an old colleague got on his beastie: http://recoverymonkey.net/wordpress/?p=62 - granted, my box is faster but it shouldn’t be THAT much faster.

I   urge my newfound Mac brethren to help out in determining the cause.

More benchmarks to follow.

D

Thu
16
Oct '08

A word of caution when setting up a deduplicating VTL

Based on some recent experiences I wanted to make people aware of some caveats with setting up a VTL with deduplication. This is specifically regarding the EMC DL3D (AKA Quantum DXi) but applies to all of them. This will be a mercifully short and to the point post. Here’s the rub:

  • Create small virtual tapes (100GB max, I’d go even smaller, obviously depends on your environment)
  • Create a bunch of virtual tape drives (you might have to create 20-30!)
  • Do NOT I repeat NOT multiplex in the backup software! It screws up the deduplication algorithm.
  • Do not compress the data before the backup
  • Do not encrypt the data
  • Be mindful of your retention policies, start gently then work your way up.
  • I’d personally not multi-stream a server at all, just so I can keep the tape utilization high. What I mean: Say you do not do multiplexing but you are multistreaming – i.e. you’re sending 10 streams from your client. This means you will need 10 tapes without multiplexing, so you’ll end up writing a tiny bit on each tape. It doesn’t take a genius to realize that you’ll end up with a ton of tapes with not much data on them, which will cause them to be appended to with more tiny amounts of data, which will in turn cause them to expire way later than you’d like.
  • If you can use the box as NAS and know how to get the throughput up there then do so, that way there’s no issue with multiple streams. My Data Domain boys are chuckling now (they always prefer to do NAS, but that also has to do with the fact that their box can’t really do VTL properly yet. Oh, the cattiness! BTW my company does sell quite a lot of their stuff).

The same rules apply otherwise as in my previous post about tuning NetBackup for large environments.

Regarding using the DL3D/DXi as NAS: Plug in as many GigE ports as you can, but make sure your switch can do straight-up EtherChannel (not LACP). So you pretty much need to have a “proper” Cisco switch in order to get the full benefit. Then use multiple media servers. Use a separate NAS share per media server. Team the NICs on the backup servers for performance (do LACP or PaGP there, whatever works with the server’s NIC software). Then call me in the morning.

D

 

Wed
20
Aug '08

What is the value of your data? Do you have the money to implement proper DR that works? How are you deciding what kind of storage and DR strategy you’ll follow? And how does Continuous Data Protection like EMC’s RecoverPoint help?

Maybe the longest title for a post ever. And one of my longest, most rambling posts ever, it seems.

Recently we did a demo for a customer that I thought opened an interesting can of worms. Let’s set the stage – and, BTW, let it be known that I lost my train of thought multiple times writing this over multiple days so it may seem a bit incoherent (unusually, it wasn’t written in one shot).

The customer at the moment uses DASD and is looking to go to some kind of SAN for all the usual reasons. They were looking at EMC initially, then Dell told them they should look at Equallogic (imagine that). Not that there’s anything wrong with Dell or Equallogic… everything has its place.

So they get the obligatory “throw some sh1t on the wall and see what sticks” quote from Dell – literally Dell just sent them pricing on a few different models with wildly varying performance and storage capacities, apparently without rhyme or reason. I guess the rep figured they could afford at least one of the boxes.

So we start the meeting with yours truly asking the pointed questions, as is my idiom. It transpires that:

  1. Nobody looked at what their business actually does
  2. Nobody checked current and expected performance
  3. Nobody checked current and expected DR SLAs
  4. Nobody checked growth potential and patterns
  5. Nobody asked them what functionality they would like to have
  6. Nobody asked them what functionality they need to have
  7. Nobody asked how much storage they truly need
  8. Nobody asked them just how valuable their data is
  9. Nobody asked them how much money they can really spend, regardless of how valuable their data is and what they need.

So we do the dog-and-pony – and unfortunately, without really asking them anything about money, show them RecoverPoint first, which even worse than showing a Lamborghini (or insert your favorite grail car) to someone that’s only ever used and seen badly-maintained rickshaws, to use a car analogy.

To the uninitiated, EMC’s RecoverPoint is the be-all, end-all CDP (Continuous Data Protection) product, all nicely packaged in appliance format. It used to be Kashya (which seems to mean either “hard question” or “hard problem” in Hebrew), then EMC wisely bought Kashya, and changed the name to something that makes more marketing sense. Before EMC bought them, Kashya was the favorite replication technology of several vendors that just didn’t have anything decent in-place for replication (like Pillar). Obviously, with EMC now owning Kashya, it would look very, very bad if someone tried to sell you a Pillar array and their replication system came from EMC (it comes from FalconStor now). But I digress.

RecoverPoint lets you roll your disks back and forth in time, very much like a super-fine-grained TiVo for storage. It does this by creating a space equal to the space consumed by the original data that acts as a mirror, plus the use of what is essentially a redo log (so to use it locally you need 2x the storage + redo log space). The bigger the redo log, the more you can go back in time (you could literally go back several days). Oh, and they like to call the redo log The Journal.

It works by effectively mirroring the writes so they go to their target and to RecoverPoint. You can implement the “splitter” at the host level, the array (as long as it’s a Clariion from EMC) or with certain intelligent fiber switches using SSM modules (the last option being by far the most difficult and expensive to implement).

In essence, if you want to see a different version of your data, you ask RecoverPoint to present an “image” of what the disks would look like at a specified point-in-time (which can be entirely arbitrary or you can use an application-aware “bookmark”). You can then mount the set of disks the image represents (called a consistency group) to the same server or another server and do whatever you need to do. Obviously there are numerous uses for something like that. Recovering from data corruption while losing the least amount of data is the most obvious use case but you can use it to run what-if scenarios, migrations, test patches, do backups, etc.

You can also use RecoverPoint to replicate data to a remote site (where you need just 1x the storage + redo log). It does its own deduplication and TCP optimizations during replication, and is amazingly efficient (far more so than any other replication scheme in my opinion). They call it CRR (Continuous Remote Replication). Obviously, you get the TiVo-like functionality at the remote side as well.

What’s the kicker is the granularity of CRR/CDP. Obviously, as with anything, there can be no magic, but, given the optimizations it does, if the pipe is large enough you can do near-synchronous replication over distances previously unheard of, and get per-write granularity both locally and remotely. All without needing a WAN accelerator to help out, expensive FC-IP bridges and whatnot.

There’s one pretender that likes to take fairly frequent snapshots but even those are several minutes apart at best, can hurt performance and are limited in their ultimate number. Moreover, their recovery is nowhere near as slick, reliable and foolproof.

To wit: We did demos going back and forth a single transaction in SQL Server 2005. Trading firms love that one. The granularity was a couple of microseconds at the IOPS we were running. We recovered the DB back to entirely arbitrary points in time, always 100% successfully. Forget tapes or just having the current mirrored data!

We also showed Exchange being recovered at a remote Windows cluster. Due to Windows cluster being what it is, it had some issues with the initial version of disks it was presented. The customer exclaimed “this happened to me before during a DR exercise, it took me 18 hours to fix!!” We then simply used a different version of the data, going back a few writes. Windows was happy and Exchange started OK on the remote cluster. Total effort: the time spent clicking around the GUI asking for a different time + the time to present the data, less than a minute total. The guy was amazed at how streamlined and straightforward it all was.

It’s important to note that Exchange suffers more from those issues than other DBs since it’s not a “proper” relational DB like SQL is, the back-end DB is Jet and don’t let me get started… the gist is that replicating Exchange is not always straightforward. RecoverPoint gave us the chance to easily try different versions of the Exchange data, “just in case”.

How would you do that with traditional replication technologies?

How would you do that with other so-called CDP that is nowhere near as granular? How much data would you lose? Is that competing solution even functional? Anyone remember Mendocino? They kinda tried to do something similar, the stuff wouldn’t work right in a pristine lab environment, I gave up on it. RecoverPoint actually works.

Needless to say, the customer loved the demo (they always do, never seen anyone not like RecoverPoint, it’s like crack for IT guys). It solves all their DR issues, works with their stuff, and is almost magical. Problem is, it’s also pretty expensive – to protect the amount of data that customer has they’d almost need to spend as much on RecoverPoint as on the actual storage itself.

Which brings us to the source of the problem. Of course they like the product. But for someone that is considering low-end boxes from Dell, IBM etc. this will be a huge price shock. They keep asking me to see the price, then I hear they’re looking at stuff from HDS and IBM and (no disrespect) that doesn’t make me any more confident that they can afford RecoverPoint.

Our mistake is that we didn’t at first figure out their budget. And we didn’t figure out the value of their data – maybe they don’t need the absolute best DR technology extant since it won’t cost them that much if their data isn’t there for a few hours.

The best way to justify any DR solution is to figure out how much it costs the business if you can achieve, say, 1 day of RTO and 5 hours of RPO vs 5 minutes of RTO and near-zero RPO. Meaning, what is the financial impact to the business for the longer RPO and RTO? And how does it compare to the cost of the lower RPO and RTO recovery solution?

The real issue with DR is that almost no company truly goes through that exercise. Almost everyone says “my data is critical and I can afford zero data loss” but nobody seems to be in touch with reality, until presented with how much it will cost to give them the zero RPO capability.

The stages one goes through in order to reach DR maturity are like the stages of grief – Denial, Anger, Bargaining, Depression, and Acceptance.

Once people see the cost, they hit the Denial stage and do a 180: “You know what, I really don’t need this data back that quickly and can afford a week of data loss!!! I’ll mail punch cards to the DR site!” – typically, this is removed from reality and is a complete knee-jerk reaction to the price.

Then comes Anger – “I can’t believe you charge this much for something essential like this! It should be free! You suck! It’s like charging a man dying of thirst for water! I’ll sue! I’ll go to the competition!”

Then they realize there’s no competition to speak of so we reach the Bargaining stage: “Guys, I’ll give you my decrepit HDS box as a trade-in. I also have a cool camera collection you can have, baseball cards, and I’ll let you have fun with my sister for a week!”

After figuring out how much money we can shave off by selling his HDS box, cameras and baseball cards on ebay and his sister to some sinister-looking guys with portable freezers (whoopsie, he did say only a week), it’s still not cheap enough. This is where Depression sets in. “I’m screwed, I’ll never get the money to do this, I’ll be out of a job and homeless! Our DR is an absolute joke! I’ll be forced to use simple asynchronous mirroring! What if I can’t bring up Exchange again? It didn’t work last time!”

The final stage is Acceptance – either you come to terms with the fact you can’t afford the gear and truly try to build the best possible alternative, or you scrounge up the money somehow by becoming realistic: “well, I’m only gonna use RecoverPoint for my Exchange and SQL box and maybe the most critical VMs, everything else will be replicated using archaic methods but at least my important apps are protected using the best there is”.

It would save everyone a lot of heartache and time if we just jump straight to the Acceptance phase where RecoverPoint is concerned:

  • Yes, it really works that well.
  • Yes, it’s that easy.
  • Yes, it’s expensive because it’s the best.
  • Yes, you might be able to afford it if you become realistic about what you need to protect.
  • Yes, you’ll have to do your homework to justify the cost. If nothing else, you’ll know how much an outage truly costs your business! Maybe your data is more important than your bosses realize. Or maybe it’s a lot LESS important than what everyone would like to think. Either way you’re ahead!
  • Yes, leasing can help make the price more palatable. Leasing is not always evil.
  • No, it won’t be free.
  • If you have no money at all why are you even bothering the vendors? Read the brochures instead.
  • If you have some money please be upfront with exactly how much you can spend, contrary to popular belief not everyone is out to screw you out of all your IT budget. After all we know you can compare our pricing to others’ so there’s no point in trying to screw anyone. Moreover, the best customers are repeat customers, and we want the best customers! Just like with cars, there’s some wiggle room but at some point if you’re trying to get the expensive BMW you do need to have the dough.

     

Anyway, I rambled enough…

 

D

    

Mon
16
Jun '08

Massive benchmark comparison between Windows XP, Vista and 2008 Server, 32- and 64-bit

Found this while surfing and couldn’t resist posting the link. This guy did a massive array of tests on pretty much all versions of Windows that matter at the moment. The short version? If it’s performance you’re after, there’s no clear winner, since they all have their strengths. Overall, of the currently-supported OSes 2008 server seems to have the edge, as indicated by my own experiences. Attaching the results below, but here’s a link, too.

microsoft OS benchmarks

Tue
10
Jun '08

Virtualized Windows I/O performance on Xen with and without the optimized PV drivers, and versus the Linux host

One of my readers, Randall Ehren, was kind enough to provide benchmarks for Xen-virtualized Windows 2003 and XP with and without the optimized PV driver, and also compare to the underlying host. Most of the text below is copied verbatim from his correspondence with me, I just added some clarification in places.

physical machine description:
dell poweredge r200 server, 8GB ram, 2×250GB SATA 7200rpm in a RAID1

Xen host: ubuntu 8.0.4 LTS Server edition running xen 3.2 hypervisor (this is referred to as the dom0 machine)

This server is hosting two virtual servers (1 - windows 2003 server (1GB RAM), 2 - windows xp (1GB RAM)) and I performed two postmark benchmarks - one with an out of the box windows installation (indicated by “no PV drivers”), the other with a paravirtualized disk driver (indicated by “with Xen PV 0.9.6 drivers”) whose purpose is to greatly increase disk & network performance for windows-based virtual machines running under Xen. the drivers can be found here:

 http://wiki.xensource.com/xenwiki/XenWindowsGplPv

Postmark settings:

set number 10000
set transactions 20000
set subdirectories 5
set size 500 100000
set read 4096
set write 4096

Underlying host

##
## server: ubu 8 amd64 / iron / ext3fs on LVM2


## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
##

Linux vm5 2.6.24-17-xen #1 SMP Thu May 1 15:55:31 UTC 2008 x86_64 GNU/Linux

Time:

        73 seconds total
        59 seconds of transactions (338 per second)

Files:
        20092 created (275 per second)
                Creation alone: 10000 files (10000 per second)
                Mixed with transactions: 10092 files (171 per second)
        9935 read (168 per second)
        10064 appended (170 per second)
        20092 deleted (275 per second)
                Deletion alone: 10184 files (783 per second)
                Mixed with transactions: 9908 files (167 per second)

Data:
        548.25 megabytes read (7.51 megabytes per second)
        1158.00 megabytes written (15.86 megabytes per second)


## server: w2k3 32bit / xen 3.2 / ntfs (no PV drivers)


## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows 2003 server “drive”

Time:

        193 seconds total
        123 seconds of transactions (162 per second)

Files:
        20092 created (104 per second)
                Creation alone: 10000 files (166 per second)
                Mixed with transactions: 10092 files (82 per second)
        9935 read (80 per second)
        10064 appended (81 per second)
        20092 deleted (104 per second)
                Deletion alone: 10184 files (1018 per second)
                Mixed with transactions: 9908 files (80 per second)

Data:
        548.25 megabytes read (2.84 megabytes per second)
        1158.00 megabytes written (6.00 megabytes per second)


## server: w2k3 32bit / xen 3.2 / ntfs (with Xen PV 0.9.6 drivers)
## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows 2003 server “drive”

Time:

        129 seconds total
        68 seconds of transactions (294 per second)

Files:
        20092 created (155 per second)
                Creation alone: 10000 files (204 per second)
                Mixed with transactions: 10092 files (148 per second)
        9935 read (146 per second)
        10064 appended (148 per second)
        20092 deleted (155 per second)
                Deletion alone: 10184 files (848 per second)
                Mixed with transactions: 9908 files (145 per second)

Data:
        548.25 megabytes read (4.25 megabytes per second)
        1158.00 megabytes written (8.98 megabytes per second)




## server: winxp 32bit / xen 3.2 / ntfs (no PV drivers)


## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows xp “drive”

Time:

        336 seconds total
        274 seconds of transactions (72 per second)

Files:
        20092 created (59 per second)
                Creation alone: 10000 files (178 per second)
                Mixed with transactions: 10092 files (36 per second)
        9935 read (36 per second)
        10064 appended (36 per second)
        20092 deleted (59 per second)
                Deletion alone: 10184 files (1697 per second)
                Mixed with transactions: 9908 files (36 per second)

Data:
        548.25 megabytes read (1.63 megabytes per second)
        1158.00 megabytes written (3.45 megabytes per second)




## server: winxp 32bit / xen 3.2 / ntfs (with Xen PV 0.9.6 drivers)
## storage: direct attached / raid1 / 2×250g 7.2k sata / idle
## xen notes: using ‘tap:aio’ file driver for windows xp “drive”

Time:

        233 seconds total
        181 seconds of transactions (110 per second)

Files:
        20092 created (86 per second)
                Creation alone: 10000 files (222 per second)
                Mixed with transactions: 10092 files (55 per second)
        9935 read (54 per second)
        10064 appended (55 per second)
        20092 deleted (86 per second)
                Deletion alone: 10184 files (1454 per second)
                Mixed with transactions: 9908 files (54 per second)

Data:
        548.25 megabytes read (2.35 megabytes per second)
        1158.00 megabytes written (4.97 megabytes per second)

 

Conclusion: seems that the PV driver does help greatly with I/O performance. Of course, comparing to the performance of the underlying host the VMs suck. I’d like to see Randall run the test and use the same box to run at least 2003 in native mode and then post, that should give a great comparison between NTFS and ext3.

Randall/D

Wed
28
May '08

Finally, some postmark results for OSX! And how does it do versus Windows?

My colleague Ian (last name withheld to save him from the Mac zealots) compiled the postmark code on his beloved Mac and ran it with the same settings I use in general (see older blog posts, just search for postmark).

I’ve been curious for the longest time to see how OSX performs in this test, since most UNIX and -alike systems work great with it. I wanted to see if OSX would be appropriate for a high IOPS-type environment (my belief being that due to the choice of kernel and filesystem it would suck - Mach and HFS+ not being exactly ideally suited to such tasks).

This is obviously not the most scientific test but I think it is good enough to get a rough gauge.

I’m still waiting for the specifics on the Mac but it’s an older Intel-based 17″ Macbok Pro with a 2.16GHz CPU, 5400 RPM HD and 2GB RAM.

The horrendous result (I think my rusty abacus did better once):

Time:
1259 seconds total
1186 seconds of transactions (16 per second)

Files:
20092 created (15 per second)
Creation alone: 10000 files (163 per second)
Mixed with transactions: 10092 files (8 per second)
9935 read (8 per second)
10064 appended (8 per second)
20092 deleted (15 per second)
Deletion alone: 10184 files (848 per second)
Mixed with transactions: 9908 files (8 per second)

Data:
548.25 megabytes read (445.92 kilobytes per second)
1158.00 megabytes written (941.85 kilobytes per second)

To compare and contrast (and save you from searching the older posts):

On a similar-spec Thinkpad T60 running XP (1.8GHz Core Duo, 2GB RAM, 60GB 5400 RPM HD):

Time:
174 seconds total
91 seconds of transactions (219 per second)

Files:
20092 created (115 per second)
Creation alone: 10000 files (222 per second)
Mixed with transactions: 10092 files (110 per second)
9935 read (109 per second)
10064 appended (110 per second)
20092 deleted (115 per second)
Deletion alone: 10184 files (268 per second)
Mixed with transactions: 9908 files (108 per second)

Data:
548.25 megabytes read (3.15 megabytes per second)
1158.00 megabytes written (6.66 megabytes per second)

 

And on the spankin’ T61p running 2008 Server (2.6GHz Core 2 Duo, 4GB RAM, 200GB 7200 RPM HD):

Time:
110 seconds total
39 seconds of transactions (512 per second)

Files:
20092 created (182 per second)
Creation alone: 10000 files (454 per second)
Mixed with transactions: 10092 files (258 per second)
9935 read (254 per second)
10064 appended (258 per second)
20092 deleted (182 per second)
Deletion alone: 10184 files (207 per second)
Mixed with transactions: 9908 files (254 per second)

Data:
548.25 megabytes read (4.98 megabytes per second)
1158.00 megabytes written (10.53 megabytes per second)

 

The drive and CPU speed is important but postmark results are largely a function of filesystem and cache efficiency. It’s also worth noting that postmark is in no way optimized for windows since it is just standard C code and indeed was meant to be run on Unix boxes. Typically, good Unix filesystems beat Windows in postmark (my record run time was like under 10s on a Solaris box and DMX).

Unless something is wrong, HFS+ and/or the OSX cache are execrable for this kind of workload, which is a pity. Maybe there are better mount options? Some tuning options?

This is huge! Even if there’s some issue (disk fragmentation, for instance) the difference in sheer IOPS performance between OSX and pretty much anything else is staggering.

Any Mac users out there that want to chime in and save the day please let me know, I’ll send you the source and you can compile it whichever which way. I truly hope there is some serious error here.

D

Tue
13
May '08

Lowest-impact antivirus tool I’ve ever tried

I’ve been trying out ESET’s NOD32 on my 64-bit 2008 Server box. Before that I’d tried Avast! – which has great detection but noticeably slows down my computer, even when it’s loading pre-cached and pre-checked content (easy test: load Firefox with and without Avast! several times. It’s ALWAYS much slower to load with antivirus on than without. Without Avast! it loads instantly).

So I put in NOD32 Business Edition and the performance difference is staggering. Indeed, I can’t tell the difference between having it on or off. Unless you ask for a scan of the entire box the antivirus process never even goes to 1% of CPU consumption. If you check various online tests of the different antivirus programs they do show NOD32 having some of the best performance overall (including possibly the best heuristics engine with practically zero false positives), plus it works with 2008.

Other progs (Like Kaspersky) also work well but they’re much slower. I think I’ve found my holy grail when it comes to virus protection.

The one massive drawback is that Business Edition (which is the only one that supports 2008) is ONLY sold in 5-computer packs. It’s not expensive (boils down to like $40 per box, same as Home Edition) but I don’t HAVE 5 servers, I just have my 1 laptop that runs 2008.

I asked ESET and they wouldn’t sell me a single Business license. That’s just silly. The product is priced right, is totally solid but they won’t sell you a less than 5 licenses. I won’t spend 200-odd bucks for one machine.

Their response was that most businesses have more than 5 computers in general, so even if they have only 1-2 servers the rest of the licenses can be used on desktops/laptops. Which makes sense but it doesn’t help me :)

The only other product I’d consider now is Avira’s Antivir (same great speed and detection rate, however it provides many more false positives) but I hear it doesn’t work on 2008.

Damn the box is fast now, I forgot how blazing 2008 feels unencumbered by other fluff :)

D

Thu
8
May '08

Retarded storage and thin-skinned people

So this is kind of a long but funny story and a rant against oversensitive people at the same time.

About a year ago, this sales guy and I go to this architecture firm since they told us they are in dire need of a better storage solution.

We meet with their admin, real nice young guy, let’s call him Mike. He explains to me how they have this old <insert few-letter-company name> clustered NAS with some JBOD behind it. They’re having performance issues, it’s not scalable, they don’t replicate it or do snaps, the list goes on about how much he hates that box. It’s just not working out.

He then mentions he wasn’t part of the decision to buy the box and he just wants to get rid of it and get something much better.

So I start explaining to him the higher-end NAS solutions, I talk about the EMC Celerra, all the things it does etc. The whole explanation takes like 2 hours since he really was unfamiliar with a lot of the basics so I started from the ground up, explained the entire concept and architecture etc.

By the end of this we’re bonding with the guy, he’s throwing some F bombs in casual conversation, all in all we’re comfortable. He tells me he finally gets it, he realizes it took him a while to see the big picture but now he totally understands the value prop. He’s excited.

I feel stoked since I like the guy and it’s not often that you get to educate someone and make them that happy. Very rewarding. So we’re joking some more and I mention how the old box is pretty much retarded when compared to the EMC box, since the EMC box does so much more it’s ridiculous.

He laughs about that and agrees, we joke some more, I promise him I’ll send him a config to look over and we leave.

On the way out he tells me how great it all was, and cautions me jokingly that it’s probably not a good idea to mention to more conservative customers that their existing storage is retarded. We laugh and part ways in a very friendly fashion. Of course I don’t normally say something like that, I only did because we were joking around and bonding and, most importantly, he told me it wasn’t his baby and that he hates it. Usually the coast is clear after something like that :)

So I send him his config, he’s getting a great deal, all very well architected. No response. I call him, no response. Eventually the rep calls him, and Mike tells the rep how he was offended that I called his storage retarded and he doesn’t want to do business with us. I thought this was the weirdest thing ever. My initial reaction is that maybe someone close to him is mentally retarded – but if that were the case, he should have shown some kind of reaction when I first mentioned the dim-wittedness of his existing storage.

But wait, there’s more.

About a year later… different gig, different rep. I get the invite to go to this place and talk about storage. They’ve had problems for years and have a really old and bad system in place and really need a replacement. I walk in, and of course it’s the same exact architecture firm! I tell the rep that this is probably a bad idea and that I should leave. I don’t have time though because Mike comes to greet us.

The moment he sees me, he’s like “sorry guys, this is not gonna happen, you just leave now so we don’t waste each other’s time”. He says that he really respects my expertise but he won’t do business with a company I work at. He doesn’t want to speak to another engineer and pretty much kicks us out. I can’t shut up any more and I tell Mike that he has really, really thin skin.

Needless to say the new sales guy is dumbfounded.

The sales guy calls Mike a day or so later and gets an explanation out of him. Mike claims he doesn’t want to deal with engineers that belittle his equipment since how do I know in what financial dire straits they were? Maybe they were forced to buy the retarded storage.

Which is fine but shows that either Mike lied throughout our entire first meeting or has an amazingly bad short-term memory.

I wish Mike all the best in his future endeavors and still stand by my original assertion: get off your retarded storage if it’s causing you problems. Even if you don’t have money there are other appliance-type solutions to be had on the cheap (or free)!

Here are some easy-to-use appliances that are quite good:

You could try all of them as virtual machines if you don’t want to dedicate hardware to them to begin with. That way you can test all of them easily. You can also roll your own with Solaris 10 or Linux, of course it requires one to know what they’re doing but it’s amazing what can be accomplished for next to zero dollars nowadays.

And Mike, if you ever read this:

Get some thicker skin. And maybe some Gingko Biloba. Moreover, if the real reason I offended you was that someone close to you is retarded – get over it, it’s just an expression!

People are just too damn sensitive these days. Just get the job done.

D

Tue
25
Mar '08

Windows Server 2008 RTM 64-bit performance versus Vista SP1 64-bit, and using 2008 as a workstation

I’ve been using Vista x64 for a while now, just so I can make use of all the memory on my machine (an über-thinkpad), and because I like shiny new things and 64-bitness and don’t want to be one-upped by smug Mac users with their feline-named OSes, mock turtlenecks and their newfound 64-bit capabilities. Of course, with the good comes some bad – Vista, while in my opinion a step forward in many ways, does take a step backward when it comes to some areas of performance and sheer resource requirements. A lot of it can be attributed to poorly-written drivers, especially any Aero GUI slowdowns with nVidia cards.

Since space was running out I bought a new hard drive (200GB Seagate 7200 RPM) and decided to install the RTM 2008 bits. If something went wrong I figured I could always either go back to my old drive or just move Vista to the new drive with some imaging utility or other, no biggie. If 2008 worked out, I’d keep it.

The reason this comparison is worthwhile is that 2008 and Vista SP1 have the same exact kernel – I checked, NTOSKRNL.EXE is the same in both OSes. One would think that the differences wouldn’t be huge and that therefore there’s no point going to 2008. Of course, there are a lot of other pieces aside from the kernel, and I think that Microsoft checks to see what OS you’re running and maybe disables certain features in the kernel accordingly – I couldn’t get the LargeSystemCache registry parameter to have any effect on Vista, for example.

Let’s compare CPU- and Graphics-benchmarks first, since those shouldn’t really be different. I used Cinebench 64-bit.

 

Vista:

Rendering (Single   CPU): 3040 CB-CPU
Rendering (Multiple CPU): 5367 CB-CPU
Multiprocessor Speedup: 1.77
Shading (OpenGL Standard)          : 4256 CB-GFX

 

2008:

Rendering (Single   CPU): 3053 CB-CPU
Rendering (Multiple CPU): 5379 CB-CPU
Multiprocessor Speedup: 1.86
Shading (OpenGL Standard)          : 4478 CB-GFX

 

Slightly better scores for 2008 it seems, but not dramatically better. Next, postmark, since I/O should be where it shines, it being a server and all:

 

Vista:

Time:

        170 seconds total

        98 seconds of transactions (204 per second)

 

Files:

        20092 created (118 per second)

                Creation alone: 10000 files (200 per second)

                Mixed with transactions: 10092 files (102 per second)

        9935 read (101 per second)

        10064 appended (102 per second)

        20092 deleted (118 per second)

                Deletion alone: 10184 files (462 per second)

                Mixed with transactions: 9908 files (101 per second)

 

Data:

        548.25 megabytes read (3.23 megabytes per second)

        1158.00 megabytes written (6.81 megabytes per second)

 

2008:

Initially I had enabled the “advanced performance” in the device manager for disk, since everyone tells you to do so in all tuning guides…

 

Time:

136 seconds total

45 seconds of transactions (444 per second)

 

Files:

20092 created (147 per second)

Creation alone: 10000 files (263 per second)

Mixed with transactions: 10092 files (224 per second)

9935 read (220 per second)

10064 appended (223 per second)

20092 deleted (147 per second)

Deletion alone: 10184 files (192 per second)

Mixed with transactions: 9908 files (220 per second)

 

Data:

548.25 megabytes read (4.03 megabytes per second)

1158.00 megabytes written (8.51 megabytes per second)

 

Much faster than Vista. I then disabled the “enable advanced performance” to see how much slower it would become:

 

Time:

110 seconds total

39 seconds of transactions (512 per second)

 

Files:

20092 created (182 per second)

Creation alone: 10000 files (454 per second)

Mixed with transactions: 10092 files (258 per second)

9935 read (254 per second)

10064 appended (258 per second)

20092 deleted (182 per second)

Deletion alone: 10184 files (207 per second)

Mixed with transactions: 9908 files (254 per second)

 

Data:

548.25 megabytes read (4.98 megabytes per second)

1158.00 megabytes written (10.53 megabytes per second)

 

Amazingly, much faster, not slower! I did some checking and this is what the setting actually does… it re-introduces an older, somewhat undesirable behavior. A bit hard to find the proper explanation, and I hope Microsoft makes what happens behind the scenes a bit more obvious. At the moment it’s quite obscure, and every guide tells you to enable it for performance. Just leave it alone. BTW the Vista score is with the setting disabled.

 

Could I have run other benchmarks like Sandra etc? Sure, but I just wanted to keep it simple and there just wasn’t enough time.

 

The next step is to run the tests on the same hardware with XP. That’s forthcoming.

 

Conclusion:

 

Seems like Microsoft did something right. Even with the 64-bit version (that takes naturally more RAM than the 32-bit one), 2008 Server takes less memory than Vista (2-300MB less at any given time in my case), runs quicker and just feels better, kinda like an unencumbered Vista. Simple things like searching a huge index in Outlook happen much faster than before. The Server Manager app is awesome, and one can try out the Hyper-V Hypervisor (BTW that, predictably, clashes with VMware and disables your power management, so beware). A server OS is in general also more secure and, over time, probably more reliable, given the workloads it’s supposed to run.

 

Can everyone run it? Should they? No, not unless you have a license for 2008 through MSDN or somesuch, otherwise it’s expensive. Some assembly is also required, and you do need to know what you’re doing. However, if you’re so inclined, you can easily get the demo version of 2008. Apparently there are clean, documented ways to increase the evaluation period (no cracks or BIOS spoofers) that I think come from Microsoft but I’m not going to list them here just in case…

 

In addition, while almost all my apps installed fine (including games and hairy driver stuff like Daemon Tools), 2 things didn’t: Bluetooth and my Logitech mouse drivers. I don’t quite use Bluetooth but I liked some of the features of my mouse (the utterly kickass Logitech VX Revolution), now it’s just like a normal mouse. I’m still keeping 2008. I’m sure other stuff will have issues, like DRM/BluRay. For people that like the Windows Sidebar: there are hacks to get it working that involve copying stuff from Vista. I think the sidebar is largely useless.

 

FYI, there are 2 notable omissions in 2008: Readyboost and Superfetch. Superfetch exists as a service but to even get it to start you have to edit the registry. I didn’t think it helped much so I disabled it again. Readyboost isn’t even an option. And the old-style boot prefetch that worked in 2003 Server doesn’t seem to be there. So it does boot a bit slower than Vista, but not much. Once you get the box up and running it’s fast though.

 

In the end, I’m leaving 2008 on my box, and that’s all that matters.

 

D

Mon
4
Feb '08

NetApp posts SPC-1 results

NetApp posted some SPC results showing their 3040 box performing pretty well in SPC-1 relative to an EMC box.

There have been rumors that when running multiple features in a NetApp box then performance suffers. Which kinda negates the whole value prop of NetApp (since that’s when people typically choose NetApp - they want one box to do everything).

A realistic test would be to have OTHER apps sharing the array (on other spindles), as is usually the case. Almost nobody dedicates an entire array of that size to a single app.

Have the box do CIFS, NFS, iSCSI AND FC.

Show performance over a significant period of time (another point NetApp detractors use – performance declines over time due to WAFL fragmentation).

THEN show the performance delta as each feature is enabled.

Obviously hard to do and maintain kosher SPC results but it would be a worthwhile addendum and, if successful, would shut up the NetApp detractors (since that’s a usual technique for selling against NetApp). I’d also show performance in degraded mode.

Anyone have any data on NetApp performing either way when used as a multi-role box?

A note on the EMC config and interpreting those benchmarks in general, be they SPC or SPEC or whatever: ALWAYS READ THE FULL DISCLOSURE regarding the test, don’t just look at the graph. If you’re not technical, get a techie to explain it to you.

For instance, looking at the way the EMC box was set up, I highly doubt it was done using EMC’s best practices. To wit:

  1. They didn’t maximize the write cache
  2. They seem to not have used separate spindles for the snapshot area (a differentiator since, unlike NetApp, EMC not only allows such a thing to happen but actually encourages it)
  3. They could have used MetaLUNs more instead of striping using Windows.

I’d be willing to bet dollars to nuts that the NetApp box was set up properly :)

Another thing: look at the response times in the graphs.

Like they say, “only believe 50% of the statistics you read”.

D

Thu
20
Dec '07

Ate at Delmonico’s in NYC

I was helping out a customer with some backup issues in the Wall street area and they happened to be literally across the street from Delmonico’s.

At the end of a particularly long day I thought I’d reward myself with a nice steak, and the proximity to the steakhouse made it hard to resist.

Delmonico’s is one of those places that have been around forever. Bit stuffy inside, I didn’t opt for the wet-aged Delmonico cut but instead went for the T-Bone (dry-aged on-premises). I also had a rather excellent salad with roasted tomatoes, herbs and mozzarella.

This is not going to be one of those inspired entries – the steak just wasn’t that good. It was undercooked, underseasoned and just lacked flavor. I probably should have gone for the house’s signature cut (the famous Delmonico cut) but any decent steakhouse should have no problems making a proper T-Bone…

Maybe I’ll give it another chance. Prolly not.

D

Sun
9
Dec '07

We need more wizards!

No, I don’t mean Gandalf, I mean the software kind. And before I’m accused of being Gates’ live-in cabana boy (it’s all baseless rumors), let me clarify.

It’s a known fact that most OSes need tuning (sometimes significant) to perform well with heavy-duty applications (I’m not talking about your home web server, I’m talking about Exchange, SAP, Oracle, IIS, Apache etc. in large deployments. I acknowledge the fact that most OSes, out of the box, will work OK for anything small).

Most frequently the application documentation will have some kind of tuning guidelines telling you approximately what to do in each OS. The installer sometimes will apply some tunings for you after asking for your permission. Often, the suggested settings are woefully inadequate for truly large implementations, as with NetBackup (the Veritas-suggested tunings work for smaller environments but I have some magical kernel tunings as posted before that make it truly fly when the ridiculous is asked of it – and the difference in the parameters between my config and what Veritas suggests is huge. Oh, and some of my parameters are way smaller than what Veritas recommends. And I won’t call them Symantec, Veritas is a way cooler name anyway, look it up in a Latin-English dictionary).

Frequently, some tunings are so common that I don’t even know why they’re not in the default configuration in certain OSes. Different conversation.

The problem is, there are experts that DO know how to set up and tune the systems properly, but said experts are rarely the admins that install and administer the thing. Usually, a fair portion of those experts do work at the companies that make the OSes and apps.

The elitist among us might say, “tough, the lowly admins need to learn all this stuff, otherwise they’re not worth what they’re paid”. To which I respond with the following points:

  • Not everyone has the time to learn the arcana of several OSes and applications, learning most of the important features is complicated enough and some shops are truly short-staffed
  • The über-experts themselves don’t know it all: They may know how to perfectly set up Exchange but wouldn’t know how to do the same thing with Oracle, how can the basic admins be expected to have such multi-discipline expertise?
  • I firmly believe in the simplicity of the appliance computing model
  • We all have more important things to do (like taking care of the big picture) than constantly worrying about minutiae
  • The people that complain that the admins should be more intelligent are typically the people that actually enjoy dealing with the apocryphal, their jobs are secure anyway
  • There’s money to be made in the simplification of IT – look at Microsoft, EMC/VMware and NetApp. People like simplicity and are willing to pay for it.

Of course, many larger companies will opt for professional services to do the job, but the quality of people just varies dramatically. Just because you’re getting an expensive Veritas PS guy doesn’t mean that

  1. He knows what the hell he’s doing beyond what’s in the installation manual (you know who you are!) and (less significantly)
  2. Is even a Veritas employee, despite his badge (most vendors subcontract smaller companies).

At the moment, most OSes just apply generic formulas based on memory and/or number of CPUs, though somehow do not take into account CPU speed and load, and, indeed, the ancient formulas are a pain with today’s very large memory systems (usually you have to limit some tunables in large-memory HP-UX and Solaris boxes, otherwise some parameters get out of control).

I understand that making OSes truly self-tuning is not here yet, nor will it be for a while (64-bitness has taken away some of the pain though, at least in Windows). In the interim, there are better ways to approach the problem. My suggestion: Modernize the formulas that build the tunables and use simple AI techniques like Expert Systems. At installation time, benchmark the hardware and ask the user what will the server be running? OK, so if the answer is a web server, under what conditions? How many users? And so on. Admins are far more likely to know the answers to those questions than “how many open file handles do you think you’ll need?”

Based on the answers and the benchmark results, the system should either tell you what you want is possible, or bitch.

If the box is to be serving double-duty (or quintuple, in some cases), the wizard should check and see if the tunings will conflict and, if not, tune the whole box so that it can accommodate all the applications.

If you’re creating a filesystem, what will the intended use be? The defaults for almost all filesystems are wrong! One size fits only the people that have that size. The problem is that, once you’ve put in several TB on filesystems someone built with the default parameters, changing them is almost impossible: you have to take a backup, destroy the filesystems, rebuild them then restore the data. Which could have been avoided if, say, maybe not the OS but at least Oracle had the smarts to query the FS and figure out it’s using insufficient log and block sizes and that performance will suck. At which point it should puke and tell you “sorry, this is sub-optimal, either do such-and-such to fix it or continue anyway at your peril”. But of course you’re using raw disks for Oracle, right? Right?

Or take the example of Logical Volume Managers. They are cool, yes. They can work great. They will also let you do insane things such as create multiple LVs and stripe them, even if they’re on the same physical disk! The checks that should have been performed are so ridiculously simple it boggles the mind.

HP kinda started doing something like this a while ago – look at the templates in SAM, you can apply 2-3 different (useless) templates based on what the box will be doing that will affect a few tunables. HP-UX is guilty of needing the most tuning of any current OS I can think of, BTW (It also pays great dividends if you know what you’re doing, I took a Superdome to 2x the I/O performance once, felt proud but it took a lot of effort and research that could have been avoided).

Seems like the intelligence that would make our lives easier is like the proverbial hot potato: always someone else’s problem.

I know it’s a tall order: the whole solution would rely on much deeper interoperability between the various components than we’re used to. But I think the end result would be worth it.

In the meantime, if you have to do it all yourself, at least use common sense and have some golden OS builds that are each good for a different use, then just replicate them as needed.

Anyway, all this is aggravating my hemorrhoids (I call them The Grapes of Wrath), better stop now.

D

 

Fri
7
Dec '07

(Very) Preliminary Windows Server 2008 impressions and Vista Multimedia Performance under battery power

Out of curiosity, I very briefly tried the new Server 2008 Release Candidate (freely available from Microsoft). I’ve been using Vista 64-bit since I need to see all the memory in my machine and, while it works mostly OK, there are some low-level scheduling issues with it – for instance, sound is really choppy on battery power, no matter what I do with the power settings, so I can’t use the thing to watch a DVD or listen to music on the plane. Many others seem to be having the same issues, despite the funky Multimedia Class Scheduler nonsense that Microsoft put in the OS that makes networking slower (great info here), even though older incarnations were not suffering from media playback issues under load. And no, if I disable the Multimedia Scheduler it does NOT work better, it actually gets worse, which means that the service is there to fix some other kludge-y issue Microsoft introduced with the scheduler or something like excessive power throttling of certain devices.

But, as usual, I digress. This is about Server 2008. What’s noteworthy is that Vista SP1 inherits the exact same kernel as Server 2008.

This will be a short entry, there are others online talking more about 2008. What I noticed:

  1. It’s light for a Windows OS. There’s no excessive bloat guys, the thing takes about 300MB of RAM with the default install, and more can be saved by trimming unnecessary services (of which there are very few).
  2. It’s fast. Under preliminary benchmarking, even the RC code (that probably has some features missing and extra debugging code) seems about as fast as 2003 after SP2 (unlike others that have been releasing benchmarks of, say, Vista SP1 in it’s pre-release form, I’d rather wait until the final code is out).
  3. Seems to work with most Vista drivers so, if you want to turn it into a workstation, you can. You can also install the Vista GUI if you’re so inclined with no adverse effects (aside from the ones that come with the Vista UI that is). Runs very smooth.
  4. Application compatibility is similar to that of Server 2003.
  5. The OS does NOT suffer from the same issues as Vista regarding media playback (I made sure I installed the Power Management driver and selected the same kind of PM scheme as Vista). Maybe a good omen come Vista SP1? We shall see.

The new management interfaces are nicely laid out, and selecting Roles for the server and adding or removing features as needed is very simple. It feels more like a well-integrated 2003 R3 rather than Vista.

I didn’t get to play with the new virtualization, it doesn’t seem to be in the RC code (though, reading some documentation, it seems as if it will have VMotion-like capabilities, which I will believe when I see).

UPDATE: 12/17/07

There is no more Vista multimedia performance issue on 2 separate computers. Some patches just released by Microsoft removed the issue (plus the issue of the mouse cursor stuttering). Interestingly, the patches had no mention of fixing said issues. I thought it was a fluke but having seen this fixed on 2 different boxes (one 32-bit, one 64) I don’t think it is.

For the Vista detractors: I’d advise everyone to wait until SP1 – as with most Microsoft releases. It’s no different. They’re actually getting better, NT4 was unusable until SP3 at least… given the unreal amount of code in the system, I’m surprised it runs this well. They really need to slim it down. Supposedly, Windows 7 will be slimmer (http://apcmag.com/7668/beyond_vista_windows_7_what_we_know_so_far). However, it mostly targets the kernel and it was never the Windows kernel that was the issue (it’s actually surprisingly decent), it’s all the crud around it.

D

Thu
15
Nov '07

My opinion on the Sun/NetApp altercation: Both companies should be grateful instead of resorting to lawsuits

Since opinions are like you-know-what, and since I’m decidedly anatomically complete in that respect (some, indeed, claim all of me is composed of implied anatomical part, so maybe that’s why I’m so opinionated), I thought I’d throw my $0.2 in the pot and not stay silent. The whole issue irks me quite a bit, actually.

Like my colleague, Rich, and I think most digerati (there’s a nice word whose time came and went, it seems), I have been following the machismo display between Sun and NetApp (see some representative comments from both sides here and here). BTW, I doubt anything will really happen with the lawsuits, and highly doubt even that money will change hands out-of-court to settle this. This is more about chest-thumping than anything else. But, in a nutshell, it seems it all started due to NetApp wanting to buy some STK patents (from before the STK acquisition), Sun not wanting to sell but instead asking for $36m to license the patents, NetApp being upset and telling Sun they infringe their WAFL patent with ZFS, then Sun telling NetApp to stop selling filers. Those guys are all nuts. I may be missing some facts (NetApp is super-cagey about what STK stuff they wanted) but they are all still nuts.

It seems people will try to patent anything these days. But going after people that you think infringed your patents can be pathetic if your story is not airtight and your goals noble – remember SCO?

I do believe in protecting one’s IP in some way – whether the best way is a patent I’m not so sure, there’s always copyright. I’m not as naïve as some open source zealots that think all patents are evil and that all software should be free. I wonder where they work and how they all make their living? Do those guys all work in places that only do open source and just give away stuff? If I develop a piece of truly cool IP that can result in me making money, rest assured I’ll try to capitalize on it.

However, I do believe that the current patent system is flawed. It’s also difficult (I think impossible) to find people technically competent enough to oversee the process. For instance (and, to cut to the chase), I would have denied NetApp the WAFL patent, since

  1. It’s a simple evolution and/or modification of existing block allocation schemes to facilitate writes (more technical info later on)
  2. There were other COW (Copy On Write) filesystems prior to NetApp, such as LFS and numerous research projects. Specifically,
  3. Daniel Phillips had done most of the COW work prior to NetApp’s patent, but had to abandon work on the tux2 filesystem due to fear of patent laws (see here). He didn’t file a patent first, since nobody that does open source development is thus inclined.

     

But where do you draw the line on what’s truly new and patentable? And what if enforcing a patent is detrimental to the common good? Should Xerox have patented the mouse? It was totally new back then. What if they’d enforced the patent and told Apple and later Microsoft that they are not allowed, no matter what, to use a mouse? Or if HG Wells patented the science fiction novel? If Hoover patented the vacuum cleaner? If RCA patented the television? You get my drift. There would be zero innovation.

I think patenting obvious stuff should just not be allowed. And, if your patent is based on prior art (regardless of whether it’s been patented), it should be summarily denied. If the patent is granted but is then proven after the fact that someone else had figured out the idea first (as in the case of Mr. Phillips), the patent should automatically be invalidated. Complex, no?

Which is why many think that patenting software should not be allowed.

At the end, with some problems, there is only a finite number of solutions (often only one). Researchers may be working simultaneously on the problem. Eventually, only one will be first with a solution. I am opposed to penalizing the other guy simply because he used a similar algorithm to mine (especially when, mathematically, there may be zero other solutions, making every approach to solve the problem produce the same result).

Back to Sun and NetApp. The truth is, I think, pretty simple. While I have enormous respect for both companies (a bit more for Sun, due to their history and my extensive personal experiences), both companies’ major products are based on a tremendous amount of prior art (patented or not, nobody seems to have complained to either company). Truly, they stand on the shoulders of proverbial IT giants. Sun has the PR benefit of having contributed vast amounts of IP to the world, compared to NetApp (though some technologies like NFS and Java have been pretty painful, so it’s a mixed blessing).

NetApp code heavily borrows from Unix, Sun, IBM, Cisco, EMC and many others. For instance, since Data ONTAP (NetApp’s OS) can’t scale beyond 2 boxes, NetApp purchased Spinnaker – SpinOS creates a single namespace that can transcend many nodes (BTW other products such as IBRIX, Exanet and others can do the same thing really well). The current GX OS is bits from the older ONTAP on top of FreeBSD with some SpinOS bits. However, both the older 7G and the newer GX OSes are offered, since 7G does a lot more (SpinOS can be just large-scale NAS – no iSCSI or FC block device targets, even if those targets on a 7G box are just files, but I digress). Of course NetApp wants to move everyone to SpinOS, which explains NetApp’s current craze with NFS everywhere. It’s infectious, now all of a sudden once again everyone wants to use NFS – VMWare, Oracle, senile grannies running compute clusters all over the world. We get it, it’s a shared-namespace, network-based FS, and sure, you can run pretty much anything on it. People have been for decades. How quickly we forget that it really isn’t the best network-based filesystem, and that there was a reason people developed cool alternative technologies such as AFS, Coda, PVFS, the native IBRIX mode, and many others. The new CIFS that’s part of Windows Server 2008 is actually a really decent implementation, but I’ll probably get flamed by the NFS fanbois for saying so.

And how quickly people forget that it was Sun that gave us NFS, warts and all (well, v4.1 ain’t too bad but that’s a collective effort – the wonders of open source). The rather execrable CIFS, BTW, (the other main NetApp “technology”) was not invented by Microsoft but rather by IBM in 1983. IBM and Cisco invented iSCSI. Legato (now owned by EMC) played a fundamental role in developing NDMP. And I can’t even remember who first created versioning filesystems but I fondly remember my VAXes and they used to do that stuff ages before NetApp even existed (not to mention proper manly-man single-system-image clustering, but that’s a story for another day). I’m pretty sure NetApp didn’t develop Fibre Channel, either.

Cue to today: Now everyone can do snapshots, it’s almost de rigeur, and the truly cool do application-aware snaps.

Volume management is standard, too.

Filesystem expansion is everywhere.

Thin provisioning (not a fan but anyway) is becoming more and more prevalent.

iSCSI is everywhere.

So, the real ZFS issues NetApp is complaining about seem to be the “Write Anywhere” and COW parts, since those are really the only true similarities with WAFL. Seriously, like that’s what’s the most important aspect of Sun’s ZFS. Indeed, while very quick for initial writes, a write-anywhere algorithm can lead to horrific fragmentation and continuously-declining performance over time (which is why you have to defrag NetApp filers). It’s just a safe, easy and computationally cheap method for allocating blocks to minimize write time for write-heavy applications such as NFS. Possibly one of the reasons NetApp did it was because in their boxes there are no RAID controllers, there’s just a CPU or two (486’s I believe in the original boxes) that has to do EVERYTHING – RAID calcs, rebuilds, snaps, caching, etc (the back end of all NetApp gear is JBOD). Using WAFL a lot of the inefficiencies in RAID are bypassed, since it will schedule multiple writes in order to fill a RAID stripe. A more elegant approach such as extent-based allocation (like VxFS) would have been too computationally-intensive, especially for writes. Dave and his pals have a good paper on WAFL here, BTW.

Here’s what ZFS is: It was not meant to be a NetApp killer, it’s just a truly modern FS, with few limits, and an amalgam of all the current “cool” technologies and ideas. Snaps, thin provisioning, expansion, volume management, pools, quotas, self-healing, all in a single technology, that’s surprisingly well thought out, and easy to use even from the command line. ZFS is not the raison d’être of the Solaris OS, but merely a feature of it. Plus it does data checksumming with every write, which other filesystems don’t. Your data is exceptionally safe in ZFS. Some test results here. More features here, and it’s easy to see NetApp getting annoyed after reading that page (though they just think COW is a good idea, the other tremendous features are not in NetApp’s WAFL). Not sure if they fixed the read performance issues NetApp has with their implementation, I need to do some testing of my own.

In my opinion, the only reason NetApp became popular is because it trivialized the whole NAS aspect. Made it easy to build decent, clustered NFS/CIFS boxes without the need to know UNIX. If Sun had put a wizard-driven GUI to perform such actions in their boxes 10 years ago, NetApp might not exist today. To date, I think Sun’s management tools are pathetic, no matter how amazingly solid the underlying tech might be. There’s a GUI for ZFS but, again, that’s besides the point. Aside from initial write performance, a NetApp filer is not about WAFL, extending disk pools and whatnot, it’s about all-around ease-of-use and the sheer amount of cool features.

If NetApp wants to sue someone so badly, maybe they need to sue the Openfiler or FreeNAS developers? Or, if they want to go after someone that’s not open source, how about Open-E? That stuff sure looks much more similar to NetApp than anything made by Sun. Really cool, too. Or maybe they need to sue EMC. Those guys sure make some nice, full-featured NAS gear. Among a myriad other solutions…

Suing someone over a filesystem that’s newer and better in almost every single way than yours but uses one common (and unavoidable in the case of COW) design methodology is just plain silly… and, BTW, how did this escape the patent trolls? Another COW implementation?

And if more developers like Daniel Phillips get scared because of patent laws, then innovation will truly be stifled. The whole point of research is that you can reference other people’s ideas so you don’t always have to re-invent the wheel.

NetApp needs to innovate a bit more themselves. They developed a cool technology and have milked it to death, and even made it do things it shouldn’t (like iSCSI and FC targets, the NetApp approach is really unclean but they are trying to force their OS to do everything, whereas companies like EMC go for the more modular approach and are criticized for being “complex”).

I think I’ll stop writing now since it’s getting late. Never was one to save posts for editing later.

D

Fri
26
Oct '07

Ate at the Staghorn steakhouse in NYC

At the insistence of my colleagues (that seem to enjoy the steak posts more than the high-falutin’ technology ones) I decided to visit another NYC steakhouse.

It was raining, I didn’t feel like going further so I went to a place near the office at 2 Penn Plaza (Madison Sq. Garden).

It’s a newer place called the Staghorn on 36th, just west of 8th Ave. Really nice and modern inside, unlike most other NYC steakhouses. Almost totally empty.

The prices are a bit below other joints, probably because the cuts are not quite as colossal.

I opted for a T-bone this time and a house salad. All the cuts had the same price, BTW.

The salad had an excellent vinaigrette with a touch of oregano. I fortified it with a tiny bit of blue cheese.

The steak was truly excellent, dry-aged, with a wonderful nuttiness and caramelization, exhibiting slight undertones of hazelnut.

Not perfect though - had the cut been a bit thicker it would have been juicier, another 4-5 oz wouldn’t be too much to add. Nonetheless, a wonderful piece of beef. In the thicker parts it was amazing in tenderness, texture and flavor.

I finished with a rather good tiramisu that was a touch on the oversoaked side but very tasty.

Recommended. This place shouldn’t be as obscure.

D

Mon
15
Oct '07

Uptempo cache can get paged out! (EDIT: After all, it does NOT).

I normally don’t do retractions unless proven wrong. So, ignore the text below and read Nick’s comment.

—————————-

A warning to those who use Datacore’s Uptempo:

While it works wonderfully as long as the server doesn’t suffer a low memory condition, the memory it reserves for cache will get paged out in low-memory situations.

I found out the hard way (as usual), while running some very demanding VMs (I only have 2GB and not the best laptop, a new machine is forthcoming). The way Uptempo reserves memory is by using a specific process, Dscaddmemory or something like that (I’ve now removed it from my system so I can’t remember the exact name). If you look at Task Manager, that process has as much memory allocated to it as you’ve allocated Uptempo.

When I was running out of RAM, I noticed that the process started shrinking in size, until it was 16MB (out of 280MB). Windows, since it looks like a normal process, decided to page it out in order to reclaim RAM.

Of course, this kinda defeats the purpose. I’d rather page out everything BUT my fancy dedicated cache, the way HP-UX does it if you tell it to (story for another day but HP-UX cache tends to work better if you specify the min and max sizes as the same and not let it auto-allocate).

My real beef with Uptempo is that it didn’t try to reclaim the memory when there most obviously was enough memory for it (after it paged itself out needlessly, I had over 350MB free and plenty in the Windows cache).

It didn’t even try to reclaim the RAM after I quit VMWare and had 1.5GB free.

Obviously, either I’m missing something fundamental or some work needs to be done. Granted, any time you are forced to swap heavily cache won’t help much but they should be at least giving the memory back to the process afterwards.

Supercache never shows up as a process, it grabs the memory when the system boots (it’s one of the first things that happen) and nothing can swap it out. It’s also configurable on-the-fly, Uptempo needs a reboot for any size changes.

With 64-bit all these helper caching programs will probably become obsolete since cache is not limited to 1GB any longer. Though I’m not sure I subscribe to Vista’s Superfetch, since it does make the HD work like crazy when you first start the box and is more suited for boxes that are not shut down it seems. Once it settles down it works OK.

D

Wed
26
Sep '07

Ate at The Old Homestead in NYC

I’ve been hopped up on uppers all day (relax, just a huge amount of chocolate-covered high-test espresso beans, though the amount of caffeine was surely enough to get me disqualified from competing in any sport - every time I pee it smells like freshly-brewed coffee). Needing something to relax me, and since my bowel movements have been altogether too easy lately, I thought I’d go for steak. Two birds with one stone.

It’s been a while since my last red meat extravaganza, and, at the behest of my buddies, I tried The Old Homestead, on 14th and 9th.

The place is a bit old-fashioned, as befits most NYC steakhouses. There’s this weird old sign, stating this place is “the king of beef”.

I bumped into Odin on the way in, he was ordering takeout for the lads. We exchanged knowing nods, told him to say hi.

I was served by a decrepit waiter with a handlebar moustache, he probably was almost too old to fight when he was drafted in WWI. He had an accent so I asked him where his pith helmet was. He, in turn, recommended the 36oz ribeye, priced no more than lighter fare on the menu. Once again, I asked for an internal temperature between 145F and 150F, once again I got a blank stare. So far, only the people at Emeril’s Delmonico in Vegas have been able to respond to this request without batting an eyelid. But that is a story for another day.

I also ordered a chopped salad since I’ve been told I need some roughage. The salad was amazing, and enough for two. I ate the whole thing, not one to ignore roughage consumption guidelines.

Then the steak came.

Oh dear.

The bone wasn’t even that big. The rest was all meat and a bit of fat. This is, to date, the largest single steak I’ve had (though not, alarmingly, the largest amount of meat I’ve consumed in one sitting). And was it good! It was served with a roasted head of garlic, French style. Not quite the consistency of the steak in Flames (that was almost like good Ahi) but still awesome.

I almost couldn’t eat the whole thing. But I did, it was that good. By the end I felt like Mr. Creosote in Monty Python’s The Meaning of Life. And I did not have the “waffer thin mint“.

On the way back to the train, it was hot and, after all this food, I started sweating profusely. I passed by a funeral parlor on 14th and the proprietor eyed me appreciatively. This is not hot-weather food!

Highly recommended.

D

Thu
20
Sep '07

WAN acceleration for remote workers

The deluge of WAN accelerators from Cisco, Riverbed, Juniper, Expand, Packeteer,Bluecoat, Silverpeak etc. etc. is proving good for datacenters. Not sure how many vendors will remain viable in a year or two, but the selection at the moment is decent.

However, most of the vendors don’t address remote desktop acceleration, say for people using 3G cards on their laptops or even cable modems - sometimes the routing to corporate networks can be arcane enough that the ms of latency add up, plus most home connections are asymmetrical anyway.

So, it would be pretty cool to have a WAN accelerator in your laptop, right? Well, so far only two companies have stepped forward:

The far more established product, even if you’ve never heard of it, is AcceleNet Enterprise from ICT (Intelligent Compression Technologies, www.ictcompress.com - they were recently bought by ViaSat). ICT has been doing just this for years, with a veritable who is who of clients (no they haven’t paid me to say this, I just think the stuff is cool). Lots of service providers use it.

ICT deploys a server that acts as a proxy, then you install an agent on your laptop. Transfers are compressed both ways.

The other vendor is known to us all - it’s Riverbed. They have now what’s called Steelhead Mobile. Effectively, it puts a Riverbed box inside your laptop. A normal Steelhead is needed to communicate with, as well as a Steelhead Mobile Controller for management. I saw pricing for the controller and it was a bit dear…

You can even adjust how much cache to give your mini-Riverbed, so if you have the space, go nuts.

Of course, you can also use this technology for servers and save money on appliance costs - I wonder if they have something that checks if you’ve installed it on a server OS, and how much CPU does it take to do it’s thing.

I heard somewhere Cisco is also working on something similar, unsurprisingly.

D

Fri
17
Aug '07

Processor scheduling and quanta in Windows (and a bit about Unix/Linux)

One of the more exotic and exciting IT subjects is the one of processor scheduling (if you’re not excited, read on, practical stuff to be seen later in the text). Multi-tasking OSes just give the illusion that they’re doing things in parallel - in reality, the CPUs rapidly skip from task to task using various algorithms and heuristics, making one think the processes truly are running simultaneously. The choice of scheduling algorithm can be immensely important.

Wikipedia has a nice article on schedulers in general: en.wikipedia.org/wiki/Scheduling_%28computing%29, good primer.

To cut a long story short: the processors are allowed to spend finite chunks of time (quanta) per process. Note that the quantum has nothing to do with task priority, it’s simply the amount of time the CPU will spend on the task. Every time the CPU switches to a new process, there’s what’s called a context switch (en.wikipedia.org/wiki/Context_switch), which is computationally expensive. Obviously, we need to avoid excessive context switching but still maintain the illusion of concurrency.

In Windows Server (that uses a multi-level feedback queue algorithm, FYI), the default quantum is a fixed 120ms, close to many UNIX variants (100ms) and generally accepted as a reasonably short length of time that can fool humans into believing concurrency. Compare this to the workstation-level products (Windows Vista/XP/2000 Pro) that have a variable quantum that’s much shorter and also provide a quantum (not priority) boost to the foreground process (the process in the currently active window). In the workstation products, the quantum ranges from 20-60ms typically, with the background processes always relegated to the smallest possible quantum, ensuring that the application one is currently using “feels” responsive and that no background task hampers perceived performance too much. Typically, in a box that’s used as a busy terminal server this will be the better setting to use since it will ensure that the numerous “in-focus” user processes will all get a quantum sooner rather than later.

The longer, fixed quantum of Windows Server means that fewer system resources are wasted on context switching, and that all processes have the same quantum. More total system throughput can be realized with such a scheme, and it’s a more of a fair scheduler. It also explains the higher benchmark numbers when running the scheduler in “background services” mode. It’s obviously best for systems that are running a few intensive processes that can benefit from the longer quantum (and, believe it or not, games and pro audio apps run better like this).

Note that I/O-bound threads (processes waiting on disk, mouse, screen and keyboard I/O) are given priority over CPU-bound threads anyway, which explains why the longer quantum doesn’t harm interactivity much. Try it - have 4 winzip/winrar/7zip sessions running concurrently. You CAN still move your mouse :) Here’s a great primer on internal windows architecture: elqui.dcsc.utfsm.cl/apuntes/guias-free/Windows.pdf. Another, deeper dive: download.microsoft.com/download/5/b/3/5b38800c-ba6e-4023-9078-6e9ce2383e65/C06X1116607.pdf.

Of course, there are ways to tune the timeslice in a more fine-grained fashion. In the registry, check out HKLM\SYSTEM\CurrentControlSet\Control\PriorityControl\Win32PrioritySeparation . Here are some explanations about how it works: www.microsoft.com/technet/prodtechnol/windows2000serv/reskit/regentry/29623.mspx?mfr=true and www.microsoft.com/mspress/books/sampchap/4354c.aspx are great.

For instance - what if you don’t care to increase the quantum on the foreground window but, instead, just want short, fixed quanta (effectively around 60ms) for all processes to improve response time on a system with a lot of processes? Setting Win32PrioritySeparation to 0×28 will take care of that.

Here’s a useful Win32PrioritySeparation chart from forums.guru3d.com/showthread.php?p=1451631#post1451631:

2A Hex = Short, Fixed , High foreground boost.
29 Hex = Short, Fixed , Medium foreground boost.
28 Hex = Short, Fixed , No foreground boost.

26 Hex = Short, Variable , High foreground boost.
25 Hex = Short, Variable , Medium foreground boost.
24 Hex = Short, Variable , No foreground boost.

1A Hex = Long, Fixed, High foreground boost.
19 Hex = Long, Fixed, Medium foreground boost.
18 Hex = Long, Fixed, No foreground boost.

16 Hex = Long, Variable, High foreground boost.
15 Hex = Long, Variable, Medium foreground boost.
14 Hex = Long, Variable, No foreground boost.

Here are some other pages where others have figured out the effective quanta (and remember the numbers are not in ms): blogs.msdn.com/embedded/archive/2006/03/04/543141.aspx (for embedded Windows, I have doubts about the accuracy of his calculations regarding the effective quantum but still interesting), www.microsoft.com/technet/sysinternals/information/windows2000quantums.mspx (for Windows 2000, probably still valid).

Here’s a really nice article on the effects of schedulers and I/O-bound processes on virtualization: regions.cmg.org/regions/mcmg/m102006_files/6187_Mark_Friedman_Virtualization.doc

Linux, on the other hand, has not one but several totally different CPU schedulers and I/O elevators available. Just see this page, comparing 2.6.22 with Vista’s kernel, and note how many non-standard features are available as patches: widefox.pbwiki.com/Scheduler . You can get schedulers with cool names such as genetic, anticipatory, etc. Linux used to suffer on the desktop, but with recent patches interactivity has improved tremendously, and is now far more viable as a desktop OS. Here’s some cool info on anticipatory schedulers: www.cs.rice.edu/~ssiyer/r/antsched/. Anticipatory schedulers can help systems with slower I/O (laptops and desktops, especially) feel more interactive, and was the default I/O elevator for a while (CFQ is the current default for I/O, though can have issues with desktop users, see ubuntuforums.org/showthread.php?t=456692). A list of all the I/O elevators in the kernel: ebergen.net/wordpress/2006/01/26/io-scheduling/. Whitepapers: www.cs.ccu.edu.tw/%7Elhr89/linux-kernel/Linux%20IO%20Schedulers.pdf, www.linuxinsight.com/files/ols2004/pratt-reprint.pdf, www.linuxinsight.com/files/ols2005/seelam-reprint.pdf .

Recently, Linux moved to the Completely Fair Scheduler model (www.osnews.com/story.php/18240/Linux-Switches-to-CFS-Scheduler-in-2.6.23), sparking a lot of controversy (www.osnews.com/story.php/18350/Linus-On-CFS-vs.-SD) since it’s not quite done yet (kerneltrap.org/node/14055). More info on CFS: immike.net/blog/2007/08/01/what-is-the-completely-fair-scheduler/.

Interesting benchmarks showing the effects of scheduling on Linux performance: developer.osdl.org/craiger/hackbench/, math.nmu.edu/~randy/Research/Speaches/Disk%20Scheduling%20In%20Linux.ppt.

For anyone wishing to test the various Linux schedulers’ impact on interactivity, Con Kolivas has something: members.optusnet.com.au/ckolivas/interbench/. Con’s Staircase/Deadline (SD) scheduler (lwn.net/Articles/224865/) didn’t make it to the mainline kernel, unfortunately, and a miffed Con announced he’s dropping out of kernel development. Pity, since I think he single-handedly contributed more to the advancement of Linux interactivity on the desktop than anyone else. It’s great to have the choice of schedulers depending on how you’re planning to use your system - it’s already done with the I/O elevator, let it be done with the CPU scheduler. Instead, Linus invoked his Papal-like powers and made what I consider to be an unsound decision.

The real issue with Linux though is the userland. Here’s a great paper showing issues with the userland and how it robs us of speed: ols2006.108.redhat.com/reprints/jones-reprint.pdf . A lot of the CPU and I/O scheduler design is workarounds for those issues. Unless one deliberately chooses a stripped-down Linux distribution, the amount of bloat in the current code is incredible.

Finally, Solaris 10 also comes with a bunch of different schedulers, which you can assign globally or on a per-process/project basis. Tons more info: www.princeton.edu/~unix/Solaris/troubleshoot/schedule.html, blogs.sun.com/andrei/date/20050131, wiki.its.queensu.ca/display/JES/Solaris+10+Containers+and+Fair+Share+Scheduling, docs.sun.com/app/docs/doc/816-0222/6m6nmlsug?l=en&a=view.

Heady reading, no?

D

Thu
9
Aug '07

Ate at AJ Maxwell’s in Manhattan

Once more, dear reader, I place my colon’s health at peril for your reading pleasure and culinary edification.

I could have gone to Via Brazil for a proper feijoada by walking a few yards from my hotel but, instead, I sacrificed variety on the altar of dedication and had another bone-in ribeye. It is my mission to eat at all the decent NYC steakhouses.

For those who don’t know me (and many who do): I don’t eat steak all the time… indeed, I consider myself a veritable gourmand (and I do know the difference between gourmand and gourmet, as do my belts).

Anyway: ordered a medium-rare ribeye. They chargrill their steaks at AJ Maxwell’s so if you don’t like them that way don’t go. If you do, the steaks are good. The meat was tender and flavorful. It looks colossal but it is (they say) just 22oz. It looked huge and was over 2in thick. Probably 22oz after cooking.

I read some reviews and typically the people that complain asked for medium or medium well. If the piece is that thick and they chargrill it, rest assured the exterior will be pretty crispy if you want medium. By the same token, getting medium rare could mean some parts are pretty rare indeed. Not the place to be if you like medium and above.

I actually thought it was better than Bobby Van’s though still not as good as Flames. However, eating once someplace is not enough of a statistical sample. It’s beef after all, not purified water. Not the easiest thing in the world to be consistent with. Hence the incredulity of most people when I tell them that I had the best steak of my life at Wollensky’s. Maybe I got lucky. Hey, at least I said Wollensky’s, not Appleby’s… it’s a legitimate steakhouse.

After a few months I’ll definitely need colonics to get rid of the barnacles.

BTW, if you just want to read about technology you can select the topics at the top of the screen so you don’t have to read about my steak-eating adventures. Or vice versa.

D

Wed
8
Aug '07

Ate at Bobby Van’s in Manhattan

After the glowing reviews of a colleague I ate at Bobby Van’s on 230 Park. It’s considered to be one of the better NYC steakhouses (there are 4 in the chain, most in NYC).

I got a bone-in ribeye and some mushrooms.

I asked for a 145°F internal temperature and the decrepit waiter looked at me like I had three heads. “What does that mean?” I said medium rare…

The steak was pretty good, slightly overcooked but not as flavorful as what I had at Flames. It was also a bit dry for a ribeye and totally unseasoned. Still, not a bad cut.

The mushrooms provided some lubrication.

Not a religious experience, I’ll try the Old Homestead tomorrow hopefully.

D

Mon
30
Jul '07

Just how much is your antivirus harming your I/O?

I just got a new corporate laptop, a nice, shiny T60 (OK, it’s IBM black and therefore thoroughly incapable of reflecting on any part of the spectrum).

I noticed that doing disk-intensive work was much slower than I’ve been used to. I configured it as a server (see previous posts) and that helped a bit but not as much as I’d like to.

It seems the antivirus software is checking each and every file, and takes 100% of a CPU to do so. Were this not a dual-core box it would be begging for mercy.

Taking an entire CPU is unacceptable IMO. So I ran some benchmarks - the trusty postmark once more to the rescue:

 

After tweaking as a server, antivirus running, 100% CPU utilization while bench running:

Time:
344 seconds total
230 seconds of transactions (86 per second)

Files:
20092 created (58 per second)
Creation alone: 10000 files (95 per second)
Mixed with transactions: 10092 files (43 per second)
9935 read (43 per second)
10064 appended (43 per second)
20092 deleted (58 per second)
Deletion alone: 10184 files (1131 per second)
Mixed with transactions: 9908 files (43 per second)

Data:
548.25 megabytes read (1.59 megabytes per second)
1158.00 megabytes written (3.37 megabytes per second)

 

With a more efficient antivirus program instead, variable CPU utilization (from 10%-100%):

Time:
276 seconds total
174 seconds of transactions (114 per second)

Files:
20092 created (72 per second)
Creation alone: 10000 files (123 per second)
Mixed with transactions: 10092 files (58 per second)
9935 read (57 per second)
10064 appended (57 per second)
20092 deleted (72 per second)
Deletion alone: 10184 files (484 per second)
Mixed with transactions: 9908 files (56 per second)

Data:
548.25 megabytes read (1.99 megabytes per second)
1158.00 megabytes written (4.20 megabytes per second)

 

Disabling the antivirus makes it way faster for transactions:

Time:
174 seconds total
91 seconds of transactions (219 per second)

Files:
20092 created (115 per second)
Creation alone: 10000 files (222 per second)
Mixed with transactions: 10092 files (110 per second)
9935 read (109 per second)
10064 appended (110 per second)
20092 deleted (115 per second)
Deletion alone: 10184 files (268 per second)
Mixed with transactions: 9908 files (108 per second)

Data:
548.25 megabytes read (3.15 megabytes per second)
1158.00 megabytes written (6.66 megabytes per second)

Caching with UpTempo for a nice 50% boost in performance:

Time:
121 seconds total
65 seconds of transactions (307 per second)

Files:
20092 created (166 per second)
Creation alone: 10000 files (277 per second)
Mixed with transactions: 10092 files (155 per second)
9935 read (152 per second)
10064 appended (154 per second)
20092 deleted (166 per second)
Deletion alone: 10184 files (509 per second)
Mixed with transactions: 9908 files (152 per second)

Data:
548.25 megabytes read (4.53 megabytes per second)
1158.00 megabytes written (9.57 megabytes per second)

Not tweaking the laptop as a server resulted in > 400s runtimes in the default config (sometimes 500s). FYI, the drive is a smaller, 5400 RPM jobbie, not the 200GB 7200 RPM SATA I have my eye on.

One could extrapolate these results. On a bigger box the end results will differ but everything will remain relatively similar.

Obviously, antivirus is sorely needed in this day and age, but if you’re planning on doing heavy I/O be careful what antivirus program you pick and how it’s configured. Depending on the server, I’d gladly trade some protection in exchange for a bunch more performance. Or you can go Unix/Linux and not really have to bother.

I’d say setting up an antivirus program to only scan extensions that can be infected and only scan on creates/modifies and not reads, can boost performance significantly.

Interestingly, caching didn’t help much with antivirus enabled - most of the bottleneck was the antivirus since everything had to go through it first. What if this was a database/email/fileserver with heavy activity?

D

Wed
13
Jun '07

Ate at Murphy’s Style Grill, in Red Bank, NJ

Will be demonstrating Cisco’s WAAS tomorrow in NYC, so today we spent some time going through a testing protocol so we can show people different things.

After we finished we had dinner at Murphy’s in NJ. Strange place. It’s not a classy steakhouse or anything - nor does it have aspirations to be one.

The menu is, to quote Kipling, as immutable as the hills. Apparently any substitutions or deviations are swiftly and sternly stamped out, as though they signify an impending revolution that threatens all that we hold holy. Dressing on the side? Heresy! Burn!

I got the 24oz Delmonico. I was urged not to ask anything about it, lest they bring out someone to take me to the back. He also suggested generous amounts of A1.

At least it was inexpensive (about $17) and properly cooked. If you’re looking for flavor and marbling, look elsewhere. Much of it looked like solid marble, though. Had to surgically remove a good amount of gristle.

Better than the steak at Bowling Green, I have to admit.

D

Fri
8
Jun '07

This has been one of the worst trips ever - because of one of the silliest DR exercises ever

Well, aside from visiting Flames and helping fix a severe customer problem. Those were rewarding. I still haven’t pooped that steak, BTW.

I was supposed to only stay for 1 day in Manhattan, fix the issue, ba da bing. I ended up staying an extra day - had no extra clothes and no time to get anything. Washed my undies on my own and used the hair dryer over a period of hours to dry them. I learned my lesson now and will always have extra stuff with me.

So I try to go back home today and guess what - Air Traffic Control computers had a major glitch (abcnews.go.com/Business/wireStory?id=3259992) that messed up the whole country’s air travel. Thousands of flights delayed and canceled. Mine was canceled, after I spent about 10 hours in the airport. Another 2 hours in the line to simply rebook the flight since they had 3 people trying to serve hordes. And all because, at least according to the report, a system failed and the failover system didn’t have the capacity to sustain the whole load.

So, while I wait in the airport to catch a stand-by flight tomorrow morning, unbathed and frankly looking a bit menacing, I decided to vent a bit. No hotels, no cars.

Maybe this is too much conjecture and if I’m wrong please enlighten me, but let’s enumerate some of the things wrong with this picture:

  1. First things first: While it’s cool to fail over to a completely separate location, typically you want a robust local cluster first so you can fail over to another system in the original location.
  2. If the original location is SO screwed up (meaning that a local cluster has failed, which typically means something really ominous for most places) ONLY THEN do you fail over to another facility altogether.
  3. Last but not least: Whatever facility you fail over to has to have enough capacity (demostrated during tests) to sustain enough load to let operations proceed. Ideally, for critical systems, the loss of any one site should hardly be noticeable.

According to the report none of the aforementioned simple rules were followed. Someone made the decision to fail over to another facility, which promptly caved under the load. A cascade effect ensued.

I mean, seriously: One of the most important computer systems in the country does not have a well-thought-out and -tested DR implementation. Guys, those are rookie mistakes. Like some airports having 1 link to the outside world, or 2 links but with the same provider. Use some common sense!

So, I guess I’ll put that in the list together with using what’s tantamount to unskilled labor securing our airports instead of highly trained and well-paid personnel that’s been screened extremely intensely and actually takes pride in the job. Maybe some of those unskilled people are running the computers, it might be like the Clone Army in Star Wars. A mass of cheap, expendable labor that collectively has the IQ of my left nut (I’m not being overly harsh - my left nut is quite formidable). The armed forces heading the same way isn’t the most reassuring thought, either.

Yes, I’m upset!!!

wallpapers images animal gorilla

D

Thu
7
Jun '07

ZFS in OSX

Not amazing news but an official announcement nonetheless: Saw this (www.macnn.com/articles/07/06/06/zfs.in.leopard/) and I couldn’t resist posting. This means a few things:

  1. Sun figured out how to make ZFS bootable (at least on OSX)
  2. Someone figured out how to deal with ZFS and resource forks (I can’t believe they are willing to break compatibility with so much software otherwise).

Now I just need a Mac so I can run some benchmarks before and after. I have some buddies that might oblige… finally the Macs get a decent FS.

Now if only Apple could lose the silly Mach legacy, it’s a common misconception that the kernel in OSX is FreeBSD - it ain’t. Run lmbench (www.bitmover.com/lmbench/) on different platforms and compare results such as context switching, thread creation and whatnot. Then you’ll see why OSX can’t always make a decent server OS.

D

'

Ate at Flames in Manhattan

I was helping a client in the Wall Street District today with some rather obscure CIFS performance issues (Opportunistic Locks anyone? Berzerk BDCs causing issues? Multi-user Access DBs over WAN?)

Had to stay overnight (unplanned) so after putting in some solid hours I decided to get some steak, and NYC is the place to get decent steak.

Did some research and found out that Flames was walking distance from my hotel, so I went.

Got a T-Bone this time (usually go for strip or ribeye but the waiter insisted, even though they had far more expensive cuts on offer). Some creamed spinach and a small salad and I was set.

Flames is one of those fancy places where they cut your steak for you. At least they don’t feed you or, indeed, help you masticate.

Not that they would need to - the dry-aged steak had fantastic flavor and was reasonably tender (not the most tender but good). I wish it had been a tad less cooked but it was still great, and I devoured it in atavistic glory, almost beating the man-pelt on my chest in ecstasy. It’s been a while since I’ve had proper dry-aged beef.

The creamed spinach wasn’t too creamy or salty. The salad was just OK, I typically use salads for intestinal lubrication anyway and it served the purpose.

I did overhear some patrons asking for well done steaks, this is one of those places where they won’t try to talk you out of it, sadly. I think steakhouses should make you actually sign a waiver if you want to commit such culinary atrocity.

I also overheard a waiter trying to sell some $100 “Kobe” steak to some ladies, telling them how they massage the cows 4 times a day. I discreetly shook my head at them and they got the message.

Anyway - long story short, strongly recommended, and don’t dare order anything beyond medium-rare.

Now back to washing and drying my Superman underoos - I had no change of clothes and I’m writing this naked. It kinda is an appropriate image for this review though…

D

Sat
2
Jun '07

IBRIX at EMC World

I’ve known about IBRIX for a while, but it was refreshing to talk to a decent techie that knew the product. They have improved it a lot over the past year.

For the uninitiated, IBRIX can be either

  1. A network-based filesystem using the IBRIX client and protocol
  2. Also accessible using NFS or CIFS
  3. SAN-based parallel filesystem

The product’s claim to fame is it’s scalability and performance (realized by adding extra nodes “hot”). Their most famous client is probably Pixar, they replaced a ton of NetApp boxes with an IBRIX cluster and realized huge performance benefits and vastly reduced costs. I always liked cool filesystem technologies and this definitely falls under the realm of “cool”. Some highlights based on notes I took on my Blackberry during the session and questions I asked:

  • No limits on filesystem size (they have deployed single namespace filesystems several PB in size).
  • 300mb/s read, 200mb/s write on small box per node. Bigger boxes can do 1.2GB/s per node, of course your storage needs to be able to keep up.
  • No limit on the number of nodes.
  • Automatic rebalancing of data over time. When you add new disk you rebalance to keep things humming.
  • Dedicated ibrix backup node, works with 3rd party backup SW, can have many backup servers for backup speed.
  • Has snaps now (global), this was a failing of the product before since it was lacking snapshots.
  • No real limit on the number of files per FS.
  • Biggest file size they have tested on production is an 8TB file, no software limit.
  • Nodes use FC to access storage, clients use Ethernet.
  • Client on Windows or Linux, otherwise general NFS and CIFS. Client is fastest.
  • Your prod servers can be the ibrix nodes but very compute-intensive. They recommend the client (IP-based, bonded). or get an 8-core box.
  • There is no single lock manager - this is the coolest thing. There is global metadata and global locking, all nodes participate equally.
  • How are node failures handled? All nodes interchangeable. All see same storage. Storage allocated to remaining servers if you lose a node.
    Can lose all but 1 server.
  • Back-end storage size per node? Unlimited.
  • Multipathing per node? Powerpath works. Can do bonded GigE up to 8 ports per.
  • How are files allocated? The file inode contains the info concerning which node it needs to go to. Round-robin allocation or preferred servers per file type. Also if server over 50% full then it’s skipped.
  • All volumes accessible by all nodes.
  • Can stripe huge files across many nodes.

I’m stoked! I can think of so many uses for this product:

  1. Data mining
  2. Digital media
  3. Oil and gas
  4. Backups

D

'

Ate at Trotter’s Tavern in Bowling Green, OH

I had some great customer meetings in OH this week. One meeting took me to Bowling Green, cute town.

The locals like to eat steak at Trotter’s Tavern. They only serve fist-sized and -shaped chunks of sirloin in some weird sauce that has at least some Worcestershire in it but is more tangy. No other cut choices, you get either 10 or 16 ounces and that’s it.

I asked the waitress how it was aged and got a blank stare back. I could almost read her mind: “we just defrost it in the microwave”.

Well, had it been cooked properly it might have been OK, but mine was well-done (which I hadn’t asked for). Ate it anyway, as is my idiom, but I can’t say I recommend the place. Maybe if you get the 10-ouncer and ask for medium rare it might be medium by the time you get it. It’s tough to cook a thick piece of meat properly.

At least the place is relatively inexpensive, their most expensive piece is $25 and comes with all the trimmings.

There was one weird thing though: The restroom was festooned with carvings (yes, carvings) asserting the gayness of various people.

D

Wed
23
May '07

Data Domain Update

I’m not known for retractions and I’m not posting one. I did however check out the new DD boxes and the really big ones are far more capable than the old ones.

So, the techies (hats off for enduring a half hour with me) explained to me a few things:

  1. The smallest block is 4K
  2. The highest possible performance for the biggest box is 200MB/s
  3. The biggest box can do a bit over 30TB raw
  4. They scrub the disk continuously so it’s effectively defragged (see below for caveat) - they did admit performance totally sucks over time if you don’t do it (finally vindicated!)

This is good news, since it’s obviously far bigger than the old ones.

Some issues though (based on what the techies told me):

  1. It scrubs the disk by virtue of NBU deleting the old images, then it knows what to get rid of. If your retentions are long then you will have performance problems. They suggested just dumping it all to tape and starting afresh once in a while. Which just confirms my suspicions on how the stuff truly works.
  2. Each “controller” is really a separate box. The 16 controller limit does not mean it’s a larger appliance, it’s the limit of the management software.
  3. Ergo, each controller can be a separate VTL or separate NFS mount. You cannot aggregate all your controllers in one large VTL. This sucks since if you need to do backups at 1GB/s or so, you’ll need at least 5-6 boxes, and you will have to define a separate library and drives per box. If you do NFS, you need to define 1-2 shares per box. This is a management nightmare. Make it all a single library! Copan has the same issue. I don’t know how they can do it though based on their architecture.

So, it looks to me like it may be a fit for some people, though I have no idea about the price points. If you want performance then you’ll need a ton of the boxes, and you’ll need to spend time configuring them. If 10 maxed-out boxes cost the same (or, worse, more) than a big EMC DL4400 (that can do 2.2GB/s) then it’s not an easy sell. Especially since EMC will be adding dedupe to their VTL - plus, you won’t have to define a bunch of separate libraries. Will EMC’s dedupe be similar? No idea, but if it doesn’t impact performance then it’s pretty compelling.

Thoughts? You know the drill.

D

'

Storage Virtualization - is there a point?

This has been bothering me for a while, and I think I’m not alone.

Hitachi has been making great progress with their virtualization gear, as has IBM, Falconstor before them, etc.

They claim you’ll be freed from the vendors’ shackles, achieve greater utilization of your arrays, simplify administration, cure cancer etc.

Well, here’s what I think:

  1. You will instead be shackled to the virtualization provider
  2. You won’t have a clue where your stuff is
  3. If you want to retire an array you could have problems (imagine creating a LUN composed of LUNs from 3 different arrays)
  4. You STILL have to use the management interfaces of the back-end arrays, since you still have to provision the storage. Instead of provisioning to hosts you provision to the virtualizer.

 

So, what have you gained, exactly?

D

'

Ate at Del Frisco’s steakhouse in Orlando

Superb.

Not much fanfare, steaks wet-aged 21 days.

Got the strip. So much better than Charley’s it wasn’t even funny. Great flavor, tender, perfectly cooked. 8/10. (Charley’s claim double the aging time but their stuff just wasn’t that good).

Sides were maybe too rich (the spinach could clog a Yak’s arteries). Bisque too thick and nowhere near my sublime experience in Savannah, GA. At least they gave me sherry to put in it, I can’t believe it’s not SOP in any place serving bisque. Heathens!

Dessert was just OK.

Something tells me (maybe it’s my impacted colon) that I should not eat steak again tonight.

D

'

Should EMC move to more multi-functional devices?

Here’s the deal: EMC has a lot of cool stuff. Lots of it came through acquisitions. Lots of it runs on the x86 platform, believe it or not.

At the moment one needs to buy multiple boxes from EMC to do NAS, SAN, archiving, etc.

Imagine if you got instead generic boxes (with their power relative to their cost, there could be a few models).

In each box you could run a Clariion, Centera, Celerra, a print server, WAAS (even though it’s Cisco it’s really a Linux box), something like Recoverpoint, and so on.

All the products could be custom Virtual Machine Appliances, possibly running on a modified ESX platform (so you can’t just run them anywhere). You’d get all the benefits of cool technologies such as VMotion and HA. You could easily add to it.

This doesn’t preclude the use of specialized hardware to accelerate certain functions, though in this age of quad-core CPUs even that may be unnecessary.

Think about it. EMC owns the IP for all that technology.

They don’t need to make less money - if anything, since all the platforms would be virtual, production would be greatly streamlined. They could even have a single type of box (say a quad quad with tons of RAM and expansion capability) as the hardware. You need more speed for NAS? Add an extra box, an extra license and load-balance a new virtual data mover.

This of course is unattainable at the moment - I don’t think VMware can provide such low latency and high throughput but maybe I’m wrong.

Such a move won’t fix the proliferation of management interfaces, but EMC could build a common interface.

Thoughts?

D

Tue
22
May '07

Netbackup best practices for ridiculously busy environments (but not exclusively).

While waiting for another EMC World session to start (this one is at “Guru” level, let’s see) I thought I might share some of my experience regarding running Netbackup on very large setups - nothing like learning through pain.

Don’t get me wrong - NBU has its marketshare for a reason. However, I want to make sure I dispel everyone’s deluded romantic notions about NBU being the be-all, end-all backup tool. It can work well, but only if you truly know its idiosyncrasies.

I can’t say I was tending the busiest NBU systems but, at one point, just one of my environments was doing about 15,000 backups jobs a day. Which is way too much - we fixed that pronto…

I won’t go too deep into each point. If anyone cares then post a comment and I will expand on it.

If you have a small shop running NBU on a single server, much of this is not for you - but there may still be a nugget or two in there… However, if you don’t at least use barcodes, I will go after you. Use tar or Windows backup, or even a rusty abacus, go to your corner and be quiet.

 

  1. Have a dedicated master server - if there are many jobs, the last thing you want is your master also being busy doing backups and vaults. It’s the half-witted brains of the operation, don’t stress it.
  2. Go way beyond the tuning recommendations in the manual - if you know what you’re doing. For instance, I have some voodoo tunings for Solaris (up to 9) that make a huge difference. Prepare for comments from Veritas (Symantec, whatever) support… “no sir it’s not like in the book sir, we can’t guarantee it will work sir…” whatever, I’ve gotten such ridiculously bad advice from their support I still cringe (and sometimes pee a little) every time I get a flashback, not to mention the endless dreams and the screaming that wake me up at night.
  3. Separate HBA ports for disk and tape. No exceptions. I don’t care what vendors say.
  4. Separate TAN (Tape Area Network), if you can swing it.
  5. Separate backup LAN. And/or Ethernet port bonding/trunking/teaming (whatever nomenclature appears in your systems). 4 gig ports per media server. 10G if you have the dough. 4 10G ports teamed and I will do the Wayne’s World “we’re not worthy” bit in front of you. Offer ends Dec 2007.
  6. Experiment with TOE cards, such as the Alacritech ones. You will get closer to full gig, though they’re expensive. Bonding is way cheaper and effective if you have many clients.
  7. Try to use port bonding that works at the switch level, too - 802.3ad is the standard, Cisco’s Etherchannel is Cisco’s. The software on the server and the setting on the switch have to jive. Half-assed intermediate approaches are just that.
  8. Don’t use weak switches at the core. I’m tired of seeing people with Cisco 4506 switches (6509 wannabe) and 8:1 oversubscribed 48-port cards. YOU WILL HAVE PROBLEMS!!!! Do your homework, find out whether or not the switch is oversubscribed, find out the total backplane throughput, figure out the blade throughput, don’t plug everything in the same port octet if you’re going to be oversubscribed - i.e. a 4-port team going to the octet that shares 1Gbit in a 4506 will not give you 4Gbits, it will give you, at best, a thoroughly blocked 150Mbits per port, tops, with problems. Did you know that if one of the 8 ports starts out before the rest and continues pumping, the rest will NOT make the first port reduce its speed but will instead trickle along at 10Mbits sometimes? Even after the initial transfer that was fast is finished and there’s nothing else going on? As Rutger Hauer said in Blade Runner, “I have… seen things you people wouldn’t believe”. Figure THAT one out when you’re having throughput problems.
  9. Use jumbo frames if you can. Bigger is better in this case. Do your homework, there are caveats.
  10. Use the right block size for your tape devices. Windows users, beware. Patches are necessary. SP1 broke block sizes over 64K on 2003 Server.
  11. Don’t go nuts with SSO! Among the myriad things Veritas doesn’t tell you unless you know the right people is that at around 250 instances of devices you will have weird device problems (25 tape drives shared among 10 media servers would make 250 instances). The safe number is closer to 150. Ignore this at your peril. If you use VTL just make more virtual drives.
  12. Use snapshots as much as possible.
  13. If you have more than a couple of media servers, consider a VTL.
  14. If you have DBAs that insist on flushing the redo logs to tape every few seconds, get a heavy-gauge jumpstart cable and a power supply that can put out, say, 20KV, a coat hanger, and wearing nothing but a stained leather apron go to work on them until they regain their senses (or not). Good times.
  15. If the DBAs can’t be persuaded even after their various body parts have been charred by high voltage, try to send the smaller backups to disk. Do NOT send frequent backups to tape. If a job is going to take less than 10min send it to disk.
  16. As a corollary to #15, only use tape for large jobs that will actually stream your tape drives.
  17. Know what your boxes can push. Most servers, even very large ones, will be hard-pressed to push 2 LTO3 drives, let alone LTO4. FYI, I’ve gotten LTO3 to go as fast as 130MB/s, sustained. Do the math. Beat the score! I cheated, BTW.
  18. Know what expansion slots to use - not all are equal, even if they look the same.
  19. Don’t push too much backup traffic over switch ISLs. Preferably don’t push any.
  20. Be super-careful with command-line manipulation of the NBU DB. Perfectly legitimate commands will not function as you might think due to silly heuristics (or lack thereof). Stay tuned, there will be a large post outing NBU in the future. The amount of dirt I have is beyond staggering. Maybe I shouldn’t have said that, I might have to look out for contract killers or Veritas people offering payola, not sure which is preferable. I’m 5 feet tall, with a goatee, skinny and blond, by the way. You can’t miss me. I also have a pronounced limp.
  21. Beware of multiplexing. Too much and restores take forever. Too little and you can’t stream your devices. Disk is your friend. Anything beyond 4-way multiplexing on tape is not.
  22. Do not send tapes offsite only once a week. You are asking for pervy uncle Murphy to pay you a visit, and he is a known repeat sex offender. He won’t discriminate, either.
  23. If you use tapes, have 2 copies of everything.
  24. Replicate to remote sites if at all possible. Tape should be a last resort.
  25. Use VMWare if at all possible. Along with #12 and #24, this helps quick recovery.
  26. Do at least 2-3 different backups of the NBU catalog. In really busy systems it’s impossible to do it after each session - there’s just no quiet time. Just have a copy on disk and 2 on tape (you can do the ones on tape inline, will create 2 at the same time, it works), then send the ones on tape to 2 different offsite locations. Have NBU email you the tape(s) barcodes it used for the catalog if you’re doing a non-standard catalog backup. Send an extra email to an externally available address. You’re not paranoid if they’re really out to get you!
  27. Can you even read from disk as fast as you can write to your backup medium? Benchmark.
  28. What’s your current network throughput if you max out all the media servers? Benchmark.
  29. Don’t use your production systems as media servers. You are inviting uncle Murphy again and he’s feeling randy.
  30. Use storage unit groups. Why on earth would you not?
  31. Cluster the master.
  32. Do NOT put media traffic through firewalls, it’s too much. ACLs on switches can work just fine.
  33. Do NOT put a dedicated media server for a subset of your boxes that are secured from the main network. If they lose access to that media server, backups fail. At any rate you’ll have to allow a few ports for the master to communicate with the media server, might as well let media server traffic through. If it seems that #32 and #33 are somewhat self-contradictory, give yourself a cigar.
  34. Simplify your life. Elaborate and numerous policies are more ways to invite uncle Murphy.

 

That’s all I have for now. Is there more? Tons, but I need to pee.

D

'

EMC World: Replication Manager and Exchange 2007

Just attended a session. Seems like the new rev of RM supports 2007 fully. They also support Recoverpoint clones (or will, later this week).

For whoever is not aware of it, EMC Replication Manager is like a front-end that manages local replicas of your salient Exchange data for the purposes of backup and restore.

Can be fiddly to set up but if you have EMC gear and Exchange, you really should look at it.

D

Mon
21
May '07

Just ate at Charley’s steakhouse in Orlando

As has been my idiom lately, I will comment on food.

Went to Charley’s steakhouse while attending EMC World.

They made a huge deal of showing off their steaks - which looked good. Wet-aged, 6 weeks for the bone-in ones, 4 weeks for the rest. Aged in-house. I prefer dry-aged but it’s hard to find outside NYC.

So I had a chunky strip, medium-rare.

Observations:

  1. Too seared on the outside, too rare on the inside (would be classified as rare in other places)
  2. Really not that tender
  3. Way too stringy
  4. Others complained theirs was too salty, mine was OK.
  5. Shoulda gone for the ribeye or porterhouse.

Escargot were OK but needed more salt and garlic.

Next time I’m getting fish, or maybe a fillet (which is too boring a cut but at least it’s hard to screw up).

D

'

At EMC World

Currently attending EMC World. The first day bored me to tears, I hope the rest will be more exciting (though it utterly depends on the presenters). Some of the material is too introductory, even if one attends the advanced sessions they’re not that advanced.

More to follow.

D

Tue
8
May '07

I wonder when dedup will make it to the arrays

Anyone feel that deduplication is not finding its final resting place in backups and WAN accelerators?

It’s only a matter of time before the algorithms are run as a matter of choice on the array processors.

Of course, that means fewer disk sales, but also bigger/faster/more expensive processors.

Replication will also become more efficient - see EMC’s recent acquisition of Kashya (now RecoverPoint - one of its functions is dedup during replication from array to array, how long do you think it will take them to move this functionality to the array processors?)

Just some random thoughts…

D

Fri
4
May '07

Another windows tuning I forgot to mention

I use my laptop so much that I sometimes forget about some server-type tunings.

I resuscitated my hot-rod AMD box - it’s a grossly overclocked monster but only has 1GB RAM (since it’s hard to find that kind of fast RAM in bigger sizes, and using 4 sticks prohibits me from overclocking it so much). Let’s just say the CPU is running a full GHz faster than stock, and with air, not water or peltier coolers.

Anyway, since it only has 1GB RAM and I use it for Photoshop and games, I can’t really use something like Supercache or Uptempo on it.

So I tried O&O Software’s Clevercache. By far not as good as the other 2 products - however, it does a decent job of automatically managing cache so you always have enough free RAM.

Then I tried the DisablePagingExecutive registry tweak - not that obscure, tons of references around.

BTW, there is a way to stop postmark from using caching - set buffering false is the command. However, I want to see the benchmark run on a system that would run normally, not measure the raw speed of my disks. Nobody cares about that anyway, especially in the big leagues (unless the config is truly moronic, of course). Cache is everything. But I digress.

So - postmark once more.

Stock:

Time:
177 seconds total
144 seconds of transactions (138 per second)

Files:
20092 created (113 per second)
Creation alone: 10000 files (333 per second)
Mixed with transactions: 10092 files (70 per second)
9935 read (68 per second)
10064 appended (69 per second)
20092 deleted (113 per second)
Deletion alone: 10184 files (3394 per second)
Mixed with transactions: 9908 files (68 per second)

Data:
548.25 megabytes read (3.10 megabytes per second)
1158.00 megabytes written (6.54 megabytes per second)

after tuning as server with the background process, large cache and fsutil as described previously:

Time:
107 seconds total
85 seconds of transactions (235 per second)

Files:
20092 created (187 per second)
Creation alone: 10000 files (526 per second)
Mixed with transactions: 10092 files (118 per second)
9935 read (116 per second)
10064 appended (118 per second)
20092 deleted (187 per second)
Deletion alone: 10184 files (3394 per second)
Mixed with transactions: 9908 files (116 per second)

Data:
548.25 megabytes read (5.12 megabytes per second)
1158.00 megabytes written (10.82 megabytes per second)

with clevercache:

Time:
97 seconds total
71 seconds of transactions (281 per second)

Files:
20092 created (207 per second)
Creation alone: 10000 files (454 per second)
Mixed with transactions: 10092 files (142 per second)
9935 read (139 per second)
10064 appended (141 per second)
20092 deleted (207 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (139 per second)

Data:
548.25 megabytes read (5.65 megabytes per second)
1158.00 megabytes written (11.94 megabytes per second)

Hell, I guess I might get Clevercache for this system - sped it up a bit and manages memory consumption.

But look at this:

All the above plus using the DisablePagingExecutive registry tweak: BOOYA!

Time:
45 seconds total
28 seconds of transactions (714 per second)

Files:
20092 created (446 per second)
Creation alone: 10000 files (1111 per second)
Mixed with transactions: 10092 files (360 per second)
9935 read (354 per second)
10064 appended (359 per second)
20092 deleted (446 per second)
Deletion alone: 10184 files (1273 per second)
Mixed with transactions: 9908 files (353 per second)

Data:
548.25 megabytes read (12.18 megabytes per second)
1158.00 megabytes written (25.73 megabytes per second)

I guess the box is staying this way.

More info on the registry tweak:

http://technet2.microsoft.com/windowsserver/en/library/3d3b3c16-c901-46de-8485-166a819af3ad1033.mspx?mfr=true

In a nutshell, it disables the paging of kernel and driver code, so it’s always memory-resident. Makes sense in some cases, as you can see above :)

It’s so unusual that it gave me that much of a boost, though. I’d tried it a long time ago and it wasn’t quite as dramatic, but that was on a much older system.

One would argue that postmark lied but using a stopwatch and just eyeballing the sucker it was way quicker doing the transactions.

On servers I just didn’t normally set it because I figured they had enough RAM. Maybe I should start doing it on boxes that do a lot of transactional I/O. Damn, I need to try this with Supercache.

Obviously, your mileage may vary.

WARNING: DO NOT DO THIS ON ANY MACHINE THAT NEEDS TO SUSPEND!!!

Which is why I just didn’t do it on the laptop.

D

Wed
2
May '07

Cisco WAAS benchmarks, and WAN optimizers in general

Lately I’ve been dealing with WAN accelerators a lot, with the emphasis on Cisco’s WAAS (some other, smaller players are Riverbed, Juniper, Bluecoat, Tacit/Packeteer and Silverpeak). The premise is simple and compelling: Instead of having all those servers at your edge locations, move your users’ data to the core and make accessing the data feel almost as fast as having it locally, by deploying appliances that act as proxies. At the same time, you will actually decrease the WAN utilization, enabling you to use cheaper pipes, or at least not have to upgrade, where in the past you were planning to anyway.

There are significant other benefits (massive MAPI acceleration, HTTP, ftp, and indeed any TCP-based application will be optimized). Many Microsoft protocols are especially chatty, and the WAN accelerators pretty much remove the chattiness, optimize the TCP connection (automatically resizing Send/Receive windows based on latency, for instance), LZ-compress the data, and to top it all will not transfer data blocks that have already been transferred.

At this point I need to point out that there is a lot of similarity with deduplication technologies - for example, Cisco’s DRE (Data Redundancy Elimination) is, at heart, a dedup algorithm not unlike Avamar’s or Data Domain’s. So, if a Powerpoint file has gone through the DRE cache already, and someone modifies the file and sends it over the WAN again, only the modified parts will really go through. It really works and it’s really fast (and I’m about the most jaded technophile you’re likely to meet).

The reason I’m not opposed to this use of dedup (see previous posts) is that the datasets are kept at a reasonable size. For instance, at the edge you’re typically talking about under 200GB of cache, not several TB. Doing the hash calculations is not as time-consuming with a smaller dataset and, indeed, it’s set up so that the hashes are kept in-memory. You see, the whole point of this appliance is to reduce latency, not increase it with unnecessary calculations. Compare this to the multi-TB deals of the “proper” dedup solutions used for backups…

Indeed, why the hell would you need dedup-based backup solutions if you deploy a WAN accelerator? Chances are there won’t be anything at the edge sites to back up, so the whole argument behind dedup-based backups for remote sites sort of evaporates. Dedup now only makes sense in VTLs, just so you can store a bit more.

On Dedup VTLs: Refreshingly, Quantum doesn’t quote crazy compression ratios - I’ve seen figures of about 9:1 as an average, which is still pretty good (and totally dependent on what kind of data you have). I just cringe when I see the 100:1, 1000:1 or whatever insanity Data Domain typically states. I’m still worried about the effect on restore times, but I digress. See previous posts.

Anyway, back to WAN accelerators. So how do these boxes work? All fairly similarly. Cisco’s, for instance, does 3 main kinds of optimizations: TFO, DRE and LZ. TFO means TCP Flow Optimizations, and takes care of snd/rcv window scaling, enables large initial windows, enables SACK and BIC TCP (the latter 2 help with packet loss).

DRE is the dedup part of the equation, as mentioned before.

LZ is simply LZ compression of data, in addition to everything else mentioned above.

Other vendors may call their features something else, but at the end there aren’t too many ways to do this. It all boils down to:

  1. Who has the best implementation speed-wise

  2. Who is the best administration-wise

  3. Who is the most stable in an enterprise setting

  4. What company has the highest chance of staying alive (like it or not, Cisco destroys the other players here)

  5. What company is committed to the product the most

  6. As a corollary to #5, what company does the most R&D for the product

Since Cisco is, by far, the largest company of any that provide WAN accelerators (indeed, they probably spend more on light bulbs per year than the net worth of the other companies provided), in my opinion they’re the obvious force to be reckoned with, not someone like Riverbed (as cool as Riverbed is, they’re too small, and will either fizzle out or get bought - though Cisco didn’t buy them, which is food for thought. If Riverbed is so great, why would Cisco simply not acquire them?)

Case in point: When Cisco bought Actona (which is the progenitor of the current WAAS product) they only really had the Windows file-caching part shipping (WAFS). It was great for CIFS but not much else. Back then, they were actually lagging compared to the other players when it came to complete application acceleration. Fast forward a mere few months: They now accelerate anything going over TCP, their WAFS portion is still there but it’s even better and more transparent, the product works with WCCP and inline cards (making deployment at the low-end easy) and is now significantly faster than the competitors. Helps to have deep pockets.

For an enterprise, here are the main benefits of going with Cisco the way I see them:

  1. Your switches and routers are probably already Cisco so you have a relationship.

  2. WAAS interfaces seamlessly with the other Cisco gear.

  3. The best way to interface a WAN accelerator is WCCP. And it was actually developed by Cisco.

  4. The Cisco appliances are tunnel-less and totally transparent (I met someone that had Riverbed everywhere - a software glitch rendered ALL WAN traffic inoperable, instead of having it go through unaccelerated which is the way it is supposed to work. He’s now looking at Cisco).

  5. WAAS appliances don’t mess with QoS you may have already set.

  6. The WAAS boxes are actually faster in almost anything compared to the competition.

And now for the inevitable benchmarks:

Depending on the latency, you can get more or less of a speed-up. For a comprehensive test see this: http://www.cisco.com/application/pdf/en/us/guest/products/ps6870/c1031/cdccont_0900aecd8054f827.pdf

Another, longer rev: http://www.cisco.com/web/CA/channels/pdf/Miercom-on-Cisco-WAAS-Riverbed-Juniper-competitive.pdf

Yes, this is on Cisco’s website but it’s kinda hard to find any performance statistics on the other players’ sites showing Cisco’s WAAS (any references to WAFS are for an obsolete product). At least this one compares truly recent codebases of Cisco, Riverbed and Juniper. For me, the most telling numbers were the ones showing how much traffic the server at the datacenter actually sees. Cisco was almost 100x better than the competition - where the other products passed several Mbits through to the server, Cisco only needed to pass 50Kbits or so.

It is kinda weird that the other vendors don’t have any public-facing benchmarks like this, don’t you think?

However, since I tend to not completely believe vendor-sponsored benchmark numbers as much as I may like the vendor in question, I ran my own.

I used NISTnet (a free WAN simulator, http://www-x.antd.nist.gov/nistnet/) to emulate latency and throughput indicative of standard telco links (i.e. a T1). The fact that the simulator is freely available and can be used by anyone is compelling since it allows testing without disrupting production networks (for the record, I also tested on a few production networks with similar results, though the latency was lower than with the simulator).

The first test scenario is that of the typical T1 connection (approx. 1.5Mbits/s or 170KB/s at best) and 40ms of round-trip delay. I tested with zero packet loss, which is not totally realistic but it makes the benchmarks even more compelling. Usually there is a little packet loss, which makes transfer speeds even worse. This is one of the most common connections to remote sites one will encounter in production environments.

The second scenario is that of a bigger pipe (3Mbit) but much higher latency (300ms), emulating a long-distance link such as a remote site in Asia over which developers do their work. I injected a 0.2% packet loss (a small number, given the distance).

It is important to note that, in the interests of simplicity and expediency, these tests are not comprehensive. A comprehensive WAAS test consists of:

  • Performance without WAAS but with latency

  • Performance with WAAS but data not already in cache (cold cache hits). Such a test shows the real-time efficiency of the TFO, DRE and LZ algorithms.

  • Performance with the data already in the cache (hot cache hits).

  • Performance with pre-positioning of fileserver data. This would be the fastest a WAAS solution would perform, almost like a local fileserver.

  • Performance without WAAS and without latency (local server). This would be the absolute fastest performance in general.

The one cold cache test I performed involved downloading a large ISO file (400MB) using HTTP over the simulated T1 link. The performance ranged from 1.5-1.8MB/s (a full 10 times faster than without WAAS) for a cold cache hit. After the file was transferred (and was therefore in cache) the performance went to 2.5MB/s. The amazing performance might have been due to a highly compressible ISO image but, nevertheless, is quite impressive. The ISO was a full Windows 2000 install CD with SP4 slipstreamed - a realistic test with realistic data, since one might conceivably want to distribute such CD images over a WAN. Frankly this went through so quickly that I keep thinking I did something wrong.

T1 results
ftp without WAAS:
ftp: 3367936 bytes received in 19.53Seconds 168.40Kbytes/sec

Very normal T1 behavior with the simulator (for a good-quality T1).

ftp with WAAS:
ftp: 3367936 bytes received in 1.34Seconds 2505.90Kbytes/sec (15x improvement ).

Sending data was even faster:
ftp: 3367936 bytes sent in 0.36Seconds 9381.44Kbytes/sec.

waasT1

 

High Latency/High Bandwidth results

The high latency (300ms) link, even though it had double the theoretical throughput of the T1 link, suffers significantly:

ftp without WAAS
ftp: 3367936 bytes received in 125.73Seconds 26.79Kbytes/sec.

I was surprised at how much the high latency hurt the ftp transfers. I ran the test several times with similar results.

ftp with WAAS
ftp: 3367936 bytes received in 2.16Seconds 1562.12Kbytes/sec. (58x improvement ).

waaslat

 

I have more results with office-type apps but they will make for too big of a blog entry, not that this isn’t big. In any case, the thing works as advertised. I need to build a test Exchange server so I can see how much stuff like attachments are accelerated. Watch this space. Oh, and there’s another set of results at http://www.gotitsolutions.org/2007/05/18/cisco-waas-performance-benchmarks.html

Comments? Complaints? You know what to do.

D

Mon
30
Apr '07

On traveling lately

Been a while since I updated this blog. Too busy running around, evangelizing cool technologies, eating rich food, not exercising and spending WAY too much time in airports delayed due to bad weather. Someone needs to either:

  1. Change the rules so that planes fly even under more adverse conditions (which is, technically, possible)

  2. Improve the planes (long shot)

  3. Give testosterone shots to all involved since, on occasion, I think they’re being way too conservative. I used to work for an airline, and though many rules are sound, others just piss me off.

The other thing that neess to happen is someone needs to figure out how to deal with the middle seats in planes. At the moment they’re only comfortable for seriously emaciated people, let alone anyone normal. I look like I could wrestle a gorilla so I’m decidedly uncomfortable in middle seats, but that’s another subject. Suffice it to say that dieting would not help in my case - in the shoulder area, I’d need a meat cleaver and/or bone saw to see any lateral reduction. I know I’m not the first to complain but come on! Once I had the middle seat and to either side of me were people of similar (if not larger) stature. At the end of the flight I felt like we’d been married. Here are some suggestions that are not politically correct but I’m not known for that…

  1. Collect biometric info on all passengers (namely, weight and body dimensions, not necessarily security-related biometrics but that would be an easier way to get the info if it became a government mandate)

  2. Using the biometric info figure out where people should sit so that:

    1. Weight is balanced

    2. People of similar sizes are not sitting together

    3. Middle seats are assigned to slim people and/or

    4. Only 1 large person per 3 seats

    5. Still try to sit families together

    6. While checking in, offer the option of more comfort (not just legroom). Initially, maybe charge for it!

I think this makes sense. Really, algorithmically it’s not too bad.

D

Wed
28
Feb '07

Just ate at Keens Steakhouse in NYC

Well, just finished the meal. Steak was ordered medium-rare, arrived medium, a bit chewy (but still tasty) and not hot. I was too tired to complain and ate military style (i.e. it was gone in a minute).

The 26oz ribeye I had at Wollensky’s a couple years ago was a religious experience, comparatively. That thing needed a butterknife, at most. Sometimes staring at it hard enough was sufficient to lop off a piece.

I admit I don’t have enough of a statistical sample for either joint.

Just thought I’d share this.

D

Mon
19
Feb '07

It’s all about data classification and searching

I don’t know if this has been discussed elsewhere but I felt like I had an epiphany so there…

They way I see it, in a decade or two the most important technology regarding data will be data classification and search technologies.

Consider this: At the moment, all the rage is archiving and storage tiers. The reason is that it simply is too expensive to buy the fastest disks, and even if you do buy them they’re smaller than the slower-spinning drives.

Imagine if speed and size were not issues. I know that’s a big assumption but let’s play along for a second… (let’s just say that there are plenty of revolutionary advances in the storage space coming our way within, say, 10-20 years, that will make this concept not seem that far-fetched).

Nobody would really care any longer about storage tiers or archiving. Backups would simply consist of extra copies of everything, to be kept forever if needed, and replicated to multiple locations (this is already happening, it’s just expensive, so it’s not common). Indeed, everyone would just leave all kinds of data accumulate and scrubbing would not be quite as frequent as it is now. Multiple storage islands would also be clustered seamlessly so they present a single, coherent space, compounding the problem further.

Within such a chaotic architecture, the only real problems are data classification and mining. I.e. figuring out what you have and actually getting at it. The where it is is not quite such an issue - nobody cares, as long as they can get to it in a timely fashion.

I can tell that OS designers are catching on. Microsoft, of all companies, wanted a next-gen filesystem for Vista/Longhorn, that would really be SQL on top of NTFS, with files stored as BLOBs. It got delayed so we didn’t get it, but they’re saying it should be out in a few years (there were issues with scalability and speed).

Let’s forget about the Microsoft-specific implementation and just think about the concept instead (I’d use something like a decent database on raw disk and not NTFS, for instance). No more real file structure as we know it - it’s just a huge database occupying the entire drive.

Think of the advantages:

  1. Far more resilient to failures
  2. Proper rollbacks in case of problems, and easy rebuilding using redo logs if need be
  3. Replication via log shipping
  4. Amazing indexing
  5. Easy expandability
  6. The potential for great performance, if done right
  7. Lots of tuning options (maybe too many for some).

With such a technology, you need a lot more metadata for each file so you can present it in different ways and also search for it efficiently. Let’s consider a simple text document - you’re trying to sell some storage, so you write a proposal for a new client. You could have metadata on:

  • Author
  • Filename
  • Client name
  • Type of document - proposal
  • Project name
  • Excerpt
  • Salesperson’s name
  • Solution keywords, such as EMC DMX with McData (sorry, Brocade) switches
  • Document revision (possible automatically generated)

… and so on. A lot of these fields already are to be found in the properties of any MS Word document.

The database would index the metadata at the very least, when the file is created, and any time the metadata changes. Searches would be possible based on any of the fields. Then, a virtual directory structure could be created:

  • Create a virtual directory with all files pertaining to that specific client (most common way people would organize it)
  • Show all the material for this specific project
  • Show all proposals that have to do with this salesperson

… and so on.

Virtual folders exist now for Mac OSX (can be created after a Spotlight search), Vista (saved searches) and even Gnome 2.14, but the underlying engine is simply not as powerful as what I just described. Normal searches are used, and metadata is not that extensive for most files anyway (mp3 files being an exception since metadata creation is almost forced when you rip a CD).

It should be obvious by now that to enable this kind of functionality properly you need really good ways of classifying and indexing your data and actually create all the metadata that needs to be there, as automatically as possible. Future software will probably force you to create the metadata in some way, of course.

Existing software that does this classification is fairly poor, in my opinion. Please correct me if I’m wrong.

The other piece that needs to be there is extremely robust search and indexing capabilities. Some of that technology is there (google desktop and its ilk) but natural language search has to be - well, natural, but unambiguous at the same time.

I hope you can now see why I believe these technologies are important. If Google continues the way it’s going, it may well become the most important company in the next decade (some might argue it’s the most important one already).

For any sci-fi fans out there, this is a good novel that’s a bit related to the chaotic storage systems of the future: http://www.scifi.com/sfw/books/sfw7677.html

D

'

Some clarification on the caching

Re the previous post:

If you want to use supercache or uptempo the idea is that you take AWAY from windows/SQL/exchange cache and add to the fancy cache.

So, even on windows server, in “file and print sharing for microsoft windows” (in the properties for your network card, under file and printer sharing for Microsoft networks, bizarrely enough), you could say “maximize throughput for network applications”. In the various apps you’d just minimize the cache (i.e. only 10MB for Exchange) and just give the rest to supercache/uptempo.

Be aware that supercache is on a PER VOLUME basis, not global (its blessing and its curse at the same time). If you have a lot of volumes maybe just cache a few key data volumes, tempdb and the pagefile partition, or use uptempo, which allows you to allocate a single global cache pool that is then shared among the volumes you choose.

For SQL, using a RAM disk for tempdb seems to work even better.

Having seen the products work wonders only with 128MB dedicated to them, and bearing in mind that most servers have 4GB RAM or more, I’d say go nuts. I’d buy 4GB of RAM and make it cache in a heartbeat.

D


'

On deduplication and Data Domain appliances

One subject I keep hearing about is deduplication. The idea being that you save a ton of space since a lot of your computers have identical data.
One way to do it is with an appliance-based solution such as Data Domain. Effectively, they put a little server and a cheap-but-not-cheerful, non-expandable 6TB RAID together, then charge a lot for it, claiming it can hold 90TB or whatever. Use many of them to scale.

The technology chops up incoming files into pieces. Then, the server calculates a unique numeric ID using a hash algorithm.

The ID is then associated with the block and both are stored.

If the ID of another block matches one already stored, the new block is NOT stored, but it’s ID is, as is the association with the rest of the blocks in the file (so that deleting a file won’t adversely affect common blocks with other fles).

This is what allows dedup technologies to store a lot of data.

Now, why it depends how much you can store:

If you’re backing up many different unique files (like images), there will be almost no similarity, so everything will be backed up.
If you’re backing up 1000 identical windows servers (including the windows directory) then there WILL be a lot of similarity, and great efficiencies.

Now the drawbacks (and why I never bought it):

The thing relies on a weak server and a small database. As you’re backing up more and more, there will be millions (maybe billions) of IDs in the database (remember, a single file may have multiple IDs).

Imagine you have 2 billion entries.

Imagine you’re trying to back up someone’s 1GB PST, or other large file, that stays mostly the same over time (ideal dedup scenario). The file gets chopped up in, say, 100 blocks.

Each block has it’s ID calculated (CPU-intensive).

Then, EACH ID has to be compared with the ENTIRE database to determine whether there’s a match or not.

This can take a while, depending on what search/sort/store algorithms they use.

I asked data domain about this and all they kept telling me was “try it, we can’t predict your performance”. I asked them whether they had even tested the box to see what the limits were, and they hadn’t. Hmmm.

I did find out that, at best, the thing works at 50MB/s (slower than an LTO3 tape drive), unless you use tons of them.

Now, imagine you’re trying to RECOVER your 1GB PST.

Say you try to recover from a “full” backup on the data domain, but that file has been living in it for a year, with the new blocks being added to it.

When requesting the file, the data domain box has to synthesize the file (remember, even the “full” doesn’t include the whole file). It will read the IDs needed to recreate it and put the blocks together so it can present the final file, as it should have looked.

This is CPU- and disk-intensive. Takes a while.

The whole point of doing backups to disk is to back up and restore faster and more reliably. If you’re slowing things down in order to compress your disk as much as possible, you’re doing yourself a disservice.

Don’t get me wrong, dedup tech has it’s place, but I just don’t like the appliance model for performance and scalability reasons.
EMC just purchased Avamar, a dedup company that does the exact same thing but lets you install the software on whatever you want.

There are also Asigra and Evault, both great backup/dedup products that can be installed on ANY server and work with ANY disk, not just the el cheapo quasi-JBOD data domain sells.

So, you can leverage your investment in disk and load the software of a beefy box that will actually work properly.

Another tack would be to use virtual tape - doesn’t do dedup (yet, but it will since EMC bought Avamar and Adic, now Quantum, also acquired another dedup company and will put the stuff in their VTL, you can get the best of both worlds) but it does compression just like real tape.

Plus, even the cheapest EMC virtual tape box works at over 300MB/s.

I sort of detest the “drop at the customer site” model data domain (and a bunch of the smaller storage vendors) use. They expect you to put the box in and if it works OK to make it easier to keep it than send it back.

Most people will keep the first thing they try (unless it fails horrifically), since they don’t want to go through the trouble of testing 5 different products (unless we’re talking about huge companies that have dedicated testing staff).

Let me know what you think…

D

'

Do you need a VTL or not?

I first posted this as a comment on http://www.gotitsolutions.org but this is its rightful place.

Having deployed what was, at the time, the largest VTL in the world, and subsequently numerous other VTL and ATA Solutions, I think I can offer a somewhat different perspective:

It depends on the number of data movers you have and how much manual work you’re prepared to do. Oh, and speed.

Licensing for VTL is now capacity-based for most packages (at least the famous/infamous/important ones like CommVault, Networker and NetBackup, not respectively).

Also, I’d forget about using VTL features such as replication and using the VTL to write directly to tape (unless you’re retarded, insane or the backup software is running ON the VTL, as is the case now with EMC’s CDL). Just use the VTL like tape. I’ve been so vehement about this that even the very stubborn and opinionated Curtis Preston is now afraid to say otherwise with me in the room… (I shut him up REALLY effectively during one Veritas Vision session we were co-presenting a couple years ago. I like Curtis but he’s too far removed from the real world. Great presenter, though, and funny).

Even dedup features are suspect in my opinion, since they rely on hashes and searches of databases of hashes, which progressively get slower the more you store in them. Most companies selling dedup (data domain, avamar, to name a couple major names) are sorta cagey when you confront them with questions such as “I have 5 servers with 50 million files each, how well will this thing work?”

Answer is, it won’t, even for far fewer files. Just get some raw-based backup method that also indexes, such as Networker’s snapimage or NBU’s flashbackup.

Dedup also fails with very large files such as database files.
I can expand on any of the above comments if anyone cares.

But back on the data movers (Media Agents, Storage Nodes, Media Servers):

Whether you use VTL or ATA, you effectively need to divvy up the available space.

With ATA, you either allocate a fixed amount of space to each data mover, or use a cluster filesystem (such as Adic’s Stornext) to allow all data movers to see the same disk.

With VTL, the smallest quantum of space you can allocate to a data mover is, simply, a virtual tape. A virtual tape, just like a real tape, gets automatically alocated, as needed.

So, imagine you have a large datacenter, with maybe 40 data movers and multiple backup masters.

Imagine you have a 64TB ATA array.

You can either:

1. Split the array into 40 chunks, and have a management nightmare
2. Deploy stornext so all servers see a SINGLE 64TB filesystem (at an extra 3-4K per server, plus probably 50K more for maintenance, central servers and failover) - easy to deal with but complex to deploy and more software on your boxes)
3. Deploy VTL and be done with it.

For such a large environment, option #3 is the best choice, hands down.

With filesystems, you have to worry about space, fragmentation, mount options, filesystem creation-time tunables, runtime tunables, esoteric kernel tunings, fancy disk layouts, and so on. If you’re weird like me and thoroughly enjoy such things, then go for it. As time goes by though, the novelty factor diminishes greatly. Been there, done that, smashed some speed records on the way.

What’s needed in the larger shops, aside from performance, is scalability, ease of use and deployment, and simplicity.

With VTL, you get all of that.

The other issue with disk is that backup vendors, while they’re getting better, impose restrictions on the # streams in/out, copy to tape and so on. No such restrictions on tape.

One issue with VTL: depending on your backup software, setting up all those new virtual drives etc. can be a pain (esp. on NBU).
for a small shop (less than 2 data movers), a VTL is probably overkill.

D

'

So who am I?

Hello everyone,

My name is Dimitris Krekoukias.

This blog used to be on another server, I moved it here - hopefully this hosting facility will be more stable.

I resemble a silverback gorila more than a monkey (man-pelt and all), and could probably wrestle one (and have a fair chance of winning).
I have extensive experience in the backup and recovery arena, and indeed know far more about certain products than I (or the vendors) would like to.
This blog will not be just about recovery - I have other interests, such as storage, OS design, tuning, filesystems, HPC, and other exotica. Plus a ton of non-IT-related hobbies - but that’s a story for another day.
Hopefully everyone will find this blog stimulating, controversial and, at times, annoying - in which case, tough.

D

'

On windows filesystem tuning and funky cache mechanisms

Edited: I just realized I must have used different postmark settings for vista and XP. Do NOT use the following numbers to compare Vista to XP performance.

I won’t go into a diatribe on how to tune Windows - there are excellent guides on Microsoft’s and IBM’s sites, among others.

But I wanted to share some goodness based on some recent findings of mine.

First, the part that most probably know (works on XP and 2003):

From a command window do

fsutil behavior set disablelastaccess 1

This will disable access time recording, which IMO is useless unless you really do care when a file was accessed and/or there isn’t much going on with your disk (or are on some fancy EMC box with tons of cache). If you have busy disks, this typically helps a bit.

On 2003, you can also increase the size of the lookaside buffer if you have many concurrent file operations:

fsutil behavior set memoryusage 2

This also works on Vista but not XP, sadly. See more here: http://technet2.microsoft.com/WindowsServer/en/library/9fcf44c8-68f4-4204-b403-0282273bc7b31033.mspx?mfr=true

Now, for the interesting part. I use a laptop that’s pretty decent (100GB 7200RPM drive, 2GB RAM). I hammer my disk since I use the laptop for vmware and other duties (music software with thousands of files, for instance).

I like postmark and iozone for measuring performance. Here’s how I configure postmark:

set number 10000

set transactions 20000

set subdirectories 5

set size 500 100000

set read 4096

set write 4096

run

This will create 10,000 files, then perform 20,000 transactions on them. The files will range from 500 bytes to 100KB in size. This is brutal on CPU, cache and disk. If you want different-sized files you just specify the min and max sizes, just be careful with the number (if you leave it at 10,000 and tell it to make 100GB files, better make sure you have the space).

Anyway, here are some results:

Vista untweaked (10000 files and transactions, 512 byte I/O):

Time:
181 seconds total
51 seconds of transactions (196 per second)

Files:
15047 created (83 per second)
Creation alone: 10000 files (121 per second)
Mixed with transactions: 5047 files (98 per second)
4945 read (96 per second)
5055 appended (99 per second)
15047 deleted (83 per second)
Deletion alone: 10094 files (210 per second)
Mixed with transactions: 4953 files (97 per second)

Data:
257.93 megabytes read (1.43 megabytes per second)
826.79 megabytes written (4.57 megabytes per second)

Vista tweaked with fsutil as described above:

Time:
159 seconds total
51 seconds of transactions (196 per second)

Files:
15047 created (94 per second)
Creation alone: 10000 files (158 per second)
Mixed with transactions: 5047 files (98 per second)
4945 read (96 per second)
5055 appended (99 per second)
15047 deleted (94 per second)
Deletion alone: 10094 files (224 per second)
Mixed with transactions: 4953 files (97 per second)

Data:
257.93 megabytes read (1.62 megabytes per second)
826.79 megabytes written (5.20 megabytes per second)

So it’s a bit better.

Another thing you can do is set the processor quanta to be fixed 120ms chunks (simply done by right clicking on “My Computer”, properties, advanced, performance, settings, advanced, processor scheduling for background services. Yes, I’ve had by far the best luck with XP by tuning it like a server. Your mileage may vary but this also increases postmark results a bit.

You can also play with increasing the cache (in that advanced pane again select “system cache” and, with regedit, go to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters\size and make it a 3. This is all if you have XP. In 2003 it comes just like that. Unless you want to run SQL, IIS or Exchange, in which case there’s a setting, “maximize throughput for network applications”. This limits cache to 512MB, and lets the apps cache on their own.
OR, you can actually spend some money and ridiculously increase performance by getting a caching product like Superspeed’s Supercache or Datacore’s Uptempo (I tried O&O Clevercache as well and was thoroughly underwhelmed).
Here are results with 20,000 transactions and 4K I/O, XP tuned just like a server:

Time:
386 seconds total
308 seconds of transactions (64 per second)

Files:
20092 created (52 per second)
Creation alone: 10000 files (142 per second)
Mixed with transactions: 10092 files (32 per second)
9935 read (32 per second)
10064 appended (32 per second)
20092 deleted (52 per second)
Deletion alone: 10184 files (1273 per second)
Mixed with transactions: 9908 files (32 per second)

Data:
548.25 megabytes read (1.42 megabytes per second)
1158.00 megabytes written (3.00 megabytes per second)

And here are results with the exact same settings but with 256MB of Supercache on that volume, lazy writes on:

Time:
196 seconds total
163 seconds of transactions (122 per second)

Files:
20092 created (102 per second)
Creation alone: 10000 files (344 per second)
Mixed with transactions: 10092 files (61 per second)
9935 read (60 per second)
10064 appended (61 per second)
20092 deleted (102 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (60 per second)

Data:
548.25 megabytes read (2.80 megabytes per second)
1158.00 megabytes written (5.91 megabytes per second)
I am a believer. The size of the dataset far exceeded the capacity of supercache, but it helped tremendously regardless.
Since I don’t believe all benchmarks, I also ran iozone.

4096 8192 16384
64
128
256
512
1024
2048
4096 70011
8192 29264 50257
16384 26229 33289 37198
32768 27578 28827 34778
65536 26982 27890 28997
131072 20901 21680 22223
262144 21769 20789 22249
524288 23076 25270 26258

The top row shows record size, the left column file size. The above is without the cache. Now with cache:

4096 8192 16384
64
128
256
512
1024
2048
4096 279746
8192 264110 262117
16384 250322 249355 238230
32768 233373 238932 233980
65536 204786 232418 234544
131072 234552 230336 225731
262144 164434 227792 222540
524288 35515 31533 41262

These results are for writes, in both cases. Iozone’s output is too large to include here but I’ll gladly send the entire file to anyone that wants it. I would ignore record sizes under 4K since windows will coalesce writes to 4K and up anyway (up to 64K).
It seems that these products are worth a serious look. In most cases, significant benefits will be realized by caching the volume that holds the swapfile, even if only using 128MB. In one case I went from 124 seconds for a postmark run to 70s by caching the swap volume. Even though I had ample memory and windows shouldn’t be using swap.

Unix is generally a bit more robust for caching and virtual memory, so you don’t need extra products. Looks like Windows needs a bit of help. Indeed, Microsoft uses Supercache on the servers that host MSN, I found out…
Anyway, you can see that up to 256MB supercache kicks windows’ cache ass. Now remember, this is a box tuned just like a server, it was using like 1GB of cache even without supercache. After you exceed the size of the cache by using the large 512MB test file, you still realize some benefits, as you can see.

Datacore’s uptempo produced similar results, is far less tunable, uses a unified cache (instead of a chunk per partition), is easier to configure and can be more or less expensive - Supercache for 4 CPUs is like $1K, but half that for 2 CPUs. UpTempo is about $700 regardless. Another difference is that UpTempo is 32-bit only at the moment.

D