Okay, I've officially been taking care of this application long enough to feel justified in making these observations. Everything I'm going to say after this paragraph is a criticism. Where I even have a conception of how to fix the situation, those will be constructive criticisms. Please don't take this as my casting aspersions on the ability or intelligence of the people who make this product, especially after having met many of them, and found them to be roundly likeable, able, and smart, though I'll happily cast some aspersions on some (but definitely not all) of the people who answer calls for customers with questions, never mind the sales people who lie in ways you can't quite label as outright dishonesty, but that's not what I'm here to do right now.
There are a lot of words here, and some of them might make sentences you don't understand. This is the only apology you're going to get for it. Sorry!
-
Multi-homed hosts. This may seem like a trivial point, especially relative to some of the items I'm going to discuss below, but it's the one that has bit me most frequently, that I believe bites the most people, and that I have found to be the most infuriating of NetBackup's flaws. The fundamental problem is that NetBackup simultaneously encourages the use of systems with connections to multiple networks and fails to understand how those hosts, under any operating system on a TCP/IP network, function. It's long been a standard configuration to have a separate network connection on most hosts dedicated to backup traffic. That way the rather large amount of data transferred in a relatively short time–ideally, bottlnecking on the speed of the network connection–to back systems up doesn't get in the way of actual application information, which often has to keep on chugging while backups are happening, not just for the system being backed up at that time but for all the other systems on the same network. Fine.
So, how do you make sure your backup traffic works this way? Well, Unix (and thus, all other) operating systems have a way to do this on TCP/IP networks: the longest-match routing principle.[1] That is, some times, not actually what you want, especially if there are two equally-probable candidates based on the standard routing algorithm (where, in large part, which to use is up to the operating system, but it almost always chooses the first one it sees in the routing table). NetBackup has a response to that particular problem, the REQUIRED_INTERFACE configuration variable. This says, “Open a connection from this IP address, you rat bastard.”
This works just fine most of the time. I've got a backup network, and I send my backup traffic across it, great, that's my REQUIRED_INTERFACE. I've got a system that writes directly to tape drives with its big data, but it's metadata (communication with the NetBackup master server) that I would rather not have lost in the shuffle, since it's how I'll know where the real data is when I want to restore it, fine, my REQUIRED_INTERFACE is not the backup IP address, but the regular one, since that network isn't swamped with everyone's backup data.
There are, however, some trivial ways in which this Just Does Not Work. Let's say I'd like some of my backups (the really big stuff) to go straight to a tape drive attached locally, but I'd like some (relatively small, but still a noticeable quantity of data) to go across the network to another system, where it'll be written to disk and transferred off to tape in a large-chunk write later. For the first half, I want to use the production network for NetBackup, so that my metadata doesn't get lost. For the second half, I want to use the backup network for NetBackup, so that I don't flood the production network with backup data. With NetBackup you simply Can Not Do This. If you believe that this is a rare, corner case and that I'm whinging without cause, I'd like to introduce you one of my many friends, the 10 TB-heavy Relational Database Management System, and point out to you the redo/archive logs sprouting all over their faces like pimples, requiring regular, frequent (often, every five minutes or less) Oxy-scale cleansing by the backup subsystem.
Why can't NetBackup do this? Because a given system is, for NetBackup, exactly the name that you tell the system it is, modulo what the local name/address resolver says it is. You can skirt this slightly with REQUIRED_INTERFACE, but then all traffic will use that alternate. There is no way, internal to NetBackup, to differentiate between where you want metadata and where you want data to go, much less where you want some types of real data to go and where you want other types of real data to go (disregarding my previous example, think backups of installed software that anybody could pirate much more easily than by ganking your backups versus customer's CCNs you'd like to send over a hardware-enciphered link).
It's been… over ten years since I've seen a real server that had fewer than two network interfaces; nay, I'm used to ones that have four to eight, and assume that I'll get at least three out of the box without adding any hardware. Why are we still assuming that there's a one-to-one relationship between IP addresses and hosts? That's the less common case, for enterprise systems anyway, not the more common one (note that SSL has the same fucking problem, but that's a different rant). Look, I know all the IP addresses that are that system, you dumb computer, give me a way to tell you that. Or, hey, better yet, let's assume that most of your customers are stupid (judging by some of the support calls described to me by my co-worker who used to be with Symantas, they probably are)… how 'bout we just generate a unique identity for this computer based on whatever you like (pass it through even a weak hash function like SHA1, and you won't have any collisions in the forseeable future) when the software is installed, and identify the system internally by that after first contact, rather than by its hostname/IP address. Pass that, after the TCP/IP handshake, on the wire. Sure, that's spoofable, not that IP addresses aren't, so toss in some PKI authentication (no need for enciphering the stream unless I ask for it), and you're done. This is a very much a Done Thing. Look at fucking SSH. Doing this would do away with, just a rough and conservative guess here, 20% of the support calls over failed backups (when hostnames change or DNS is broken), never mind letting me have some sane control over where backup data flows when.
-
Scaling I'm willing to let a lot more go here, because scaling is not what you'd call an easy problem. Things that seem to work well enough over small data sets don't work well over large data sets and, here's the kicker that CS degree doesn't usually teach you, vice versa. But there are some truly egregious errors in the currently-used versions of NetBackup–the newly released version 6 doesn't count; noboby who matters is using that in production for another couple of months. None of this has been fixed, though some of it has been made a lot better. (Scaling is one of those things you can't really Fix; you're optimizing for different things at different scales.) The first concern is with the NetBackup master's behavior with regard to media servers. Basically the fundamental flaw is that whether a given media server's resources (”storage units”; usually tape drives, but also hard drives dedicated to backup data, and rarely optical drives) are useable is driving by the master server. This means that it must query each media server periodically to verify that the tape drives reported the last time are still there, and update its databases of where it can send backup data appropriately. This is fine if you've got, say, three or four media servers (taking backups from small systems across the network) and about the same number of SAN media servers (for really big systems, writing their own data straight to tape). It falls flat on its face if you've got, say, ten or more media servers over which you lack direct administrative control.
Because, you see, the daemon[2], as of NetBackup 5, that schedules backup jobs to run is the same daemon that does this polling of resources on media servers. This makes a certain logical sense, in a very old-fashioned way, given that you want the thing doing the scheduling to know what resources are available. But that logic stopped making sense in the mid-80s (long before this software was written, even if you count the BackupPro days before Veritas bought it), when we chose (and have, idiotically, stuck with) a default timeout of about two minutes without initial response before we decided some other host on the Internet was inaccessible. Sure, you can change that timeout at the operating system layer, and NetBackup lets you change it even on by roughly what type of communication you send (we've knocked it down to thirty seconds for media servers, though I think it should be way less than that, like maybe five; if they can't cough up a response by then, they're too lagged to do any useful backing up anyway), but no matter how short you make this timeout[3] (while still leaving enough time for a system on the other end to actually send a response) is way too long to hold up things like scheduling backups, for some number of media servers. By default, NetBackup initiates one of these “scan the media servers” passes ever ten minutes… and the parent scheduling daemon? Until it hears back from a given host, it will not schedule backups using that host's resources. No, really, I'm not joking.
This is compounded by a resource allocation process that works as follows. Scheduling daemon sees a backup job to run. Scheduling daemon checks its cached list of resources necessary for that backup (host to be backed up is available, host that writes the backup to tape drives is available, second host has tape drives available in sufficient quantity to do the backup, there are any tapes at all available to write to). Scheduling daemon tells the host that writes the backups to start the job. Host that writes the backups contacts the host that's being backed up and tells it to start sending data. Then the host that writes the backup actually reserves the resources (tape drives, and the first tape of what may very well be more than one, but it can't know that at the time) necessary for the backup. If you are unfamiliar with the term “race condition,” let this be your canonical example.
So, they fixed this a bit in the new release, though I think they've stolen several new boxes from nice girls named Pandora and opened them right the fuck up in the process. Now, the bits that do the resource allocation are separate from the bits that do the scheduling. You can even put them on a different host, and you can have that host do resource allocation for more than one backup environment (if you aren't very scared by that concept, you aren't thinking hard enough about walking into some future employer where some twit decided that was a good idea, then got fired). They also made it so that the resource allocations take place at the time of scheduling, by way of the scheduling daemon asking the resource allocation daemon if, you know, it's possible to do this shit and reserving what is necessary then for this backup job's use. That last part's nice, but it just moves the race condition. Instead of “stuff's no longer available because you started the job before you asked,” we've got “stuff's no longer available because it broke/went away while you were busy starting the job.” I think I like the new way better, but I haven't really had the opportunity to see it explode all over me yet, so, hey, who knows.
This doesn't doesn't fix the fundamental problem, though, which is the lack of distribution. Why should this “resources available” checking be driven by a central system polling each of the systems with resources in succession? That simply doesn't make sense, and architectures of that design have been proven to fail at scaling for quite some time now. I've got sixty-plus perfectly good installed media servers… what's wrong with their reporting their status to my central server? Sure, there's a denial of service attack here… but if we're going to start on things that run on enterprise networks that I can DoS, this is pretty far down the list. Hell, if I were going to DoS NetBackup, I wouldn't waste my time pretending to be a server, I'd just issue a bunch of requests to list all files from forever for several hundred valid hosts and then ignore the responses while bpdbm chugged its sorry ass off on the master.
There are two more of these, where describing the problem probably takes longer than describing the fix, but this is long as it is. So, next time, we'll cover:
- Pathological behavior under robotic device failure.
- Unnecessarily complex sever/client architecture.
[2] A “daemon” is just a process, program, whatever that runs on a server and does something not immediately apparent to the user. Like you got this far in this post without knowing that. (But, hey, congrats if you did!)
[3] … and don't even get me started on firewalls that simply drop packets on the ground rather than, say, sending an RST. Yeah, I really wanted to hang around on that socket for TWO FUCKING MINUTES that you knew full well you were never going to pass data back along, when you could have just said, “Sorry, you lose,” and initiated TCP teardown, you fucking bastard Cisco Catalyst.
Post a Comment