Skip to content

They're called users for a reason.

There's a particular segment of the “clients” at $CURRENT_EMPLOYER who are particularly demanding. They're not the only segment like this, but they're my segment to deal with. They're not bad people, really, but they've got a very warped set of expectations. It's not that they consciously expect me to work harder than them, it's just that they don't understand what limitations are imposed by the relevant technologies (and, in places, which technologies are logically their problem and which are mine).

So, I walked into this moderately innocent of some history. I got a bit of a briefing from the boss, but one of her major problems was that the guy who took care of backups for this crowd before me just did things for them, put out more than he actually had time to do, which resulted in not maintaining other things, and in aiding to establish this warped set of expectations.

The warped expectation for tonight is the concept of the Refresh.

This is a perfect example of how insidious these people are, because what they want here is completely sensible. They've multiple database systems, copies of each of which exist for production, quality assurance, and development. Some times, they'd like to refresh their QA or devel systems with the most current data from the prod systems, so they can test with something like the real world without breaking the real world. (Other times, they'd like to move design changes from development to production, but that doesn't figure here, really, since it's comparatively a small amount of data, and not moved this way.) They do not, however, want to take their prod systems down, or even load them more than ordinary business would, to do this. So they've concluded that the right way to deal with this is to pick up one of the scheduled backups of the prod system and restore it over top a QA or devel system. There are some details of their procedure with which I don't totally agree, but these are essentially points of preference: the theory's sound.

Right up till that's not what was implemented in Veritas NetBackup. It's not clear to me whether they didn't ask for this clearly, the person doing the implementation didn't listen, didn't actually understand how to fulfill these wishes (they are a bit out of the ordinary way that NBU views the world, but they're fulfillable wishes even given the clumsy tools to hand), or just didn't give a rusty fuck. But what exists here is really only set up to do regular old backups and recovery. There's some hackery on top of it (including an undocumented cron job that runs on the NBU master server in production-land and rsyncs the NBU volume database for select NBU clients over onto the master server in development/DR-land, just as an example), but these are costumes, they are not designer clothes, and that shows if you look at them close. Like, say, trying to make this thing happen.

But now, all of a sudden, it devolves upon me to “do a refresh”. I look over some notes (a section, I note ripped out of a complete set of disaster recovery documentation… a full copy of which I've yet to be furnished, I note at the time–and which I still lack, btw) and see what's entailed. It doesn't seem like that big a deal. I need to ask for some tapes to be physically moved from our data center in one city (prod) to another of our data centers in a town out in the boonies where the terrorists don't go (never you mind the NRA), I need to see that those tapes are placed in a tape library. That's about it.

Perception, won't you please meet Reality?

They've somehow caught the drift, from my polite but firm nature over the past couple of months (and probably more from my boss's backing of that firmness), that I won't be having anything to do with tools that are, functionally, the DBA's responsibility. Yes, brrestore is a tool used for backup and recovery… but it's a tool provided by SAP, and it's intrinsically linked to SAP, a software product with which I'm not in the habit of professing intimacy. (You wouldn't go about telling people you had genital warts, especially when you didn't, right?) So they'll do that bit: I just have to monitor the job. Beg pardon?

This is insane, since the expectation is that I will monitor their software (with which I'm unfamiliar), and their implied assumption is that I will do this manually (though that may seem insane, remember: they're DBAs, every one of them). I opt for not arguing the point (since I understand that past problems with this process have been caused by tape hardware breaking or NBU being a flaming pile of shit) and not bothering to point out that “monitor” means “five minutes of Perl” in my world.

Boss has insisted, after past difficulties, that this process never be given anything less than a five business day service level agreement, to which this lot has agreed. Let's remember a couple of points here, because they're going to come up again: “five (5) days,” “business day”. One of them comes up right now. They want to start this thing on a Friday (there is a special place reserved in my future torture cellar for IT people who start projects relying, after initiation of the process, on members of departments other than their own on Fridays). And, quite obviously, I'll be monitoring it over the weekend. Right? Right.

It's around 23:30, my local time, on Friday night when, after several hours trying to help these people make their DB's restore component successfully read data off my perfectly useable tapes right there in the library that I throw caution (and sanity) to the wind and suggest that, since this was an offline backup of their database (as in, effectly just taking unchanging and consistent files on disk and putting them on tape), I just restore the files for them. “Sure, that'll be fine. Let us know how it goes. Bye.”

I kick off a job to suck the 1878 files their program claimed it couldn't find off, they start streaming in, I point the monitoring script at it, stroke it lightly, and walk away. Vaguely twenty-four hours later, that job finished… having missed a shit-ton of files. Like, a touch over six hundred. Because here's one of the ways that NetBackup sucks boulders through a coffee stirrer: it speeds backups up through a process called “multiplexing”. This means that an individual file may be broken into multiple chunks and written to multiple tapes at the same time. This concept is neither original, nor unique. Nor is the thorny bit: getting that data back off. NBU really, really sucks at restores of multiplexed backups. As near as I can tell, if one tape drive has even a transient (as in, non-fatal) read error (Which happens a lot with tape drives! They're mechanical devices!), whatever file NBU's in the middle of is either not restored or (more insidiously) incompletely restored.

Fine, whatever, I've got a list of the files I want, including where they're supposed to be, so getting a list of exactly what's missing is just a for i in `cat list` ; do [ -f $i ] || echo $i ; done. Finding incomplete files is similarly simple, since all these DB files are owned by the DB user, but NBU runs as root, and alters ownership on the files as the last step of a restore.

The rerun jobs (this has to be done repeatedly because NBU sucks) are puttering along on Monday. When I am asked for a status by no fewer than four people and on no fewer than seven occassions. And I'm sitting there going “Um… five what nows?” But I play politics, and they end up with what they were assured they couldn't have in under five business days in three real days or… wait for it… one business day.

I think I'm done, and they go about this business of making their production database into their devel database, ask for a backup of that out of the regular schedule (which request they have finally learned to send to the people responsible for job scheduling of all types for the whole organization, since I don't even have access to kick off the appropriate job, which goes off and shuts down their DB cleanly–using a script they provided–before running the backup), and then proceed along with whatever loading and testing they need to do to ready this database for their users. And it is, at this point, that they declare that some of the data files are corrupt and demand an immediate restore from the backup they'd requested.

Ignoring the SLA on recoveries (hint: it will probably take upwards of two hours and a one-time fee of $125 to even get your tape back from Iron Mountain, kids), I go ahead and produce the file for them from that backup. Also corrupt. They ask for the same filename from the backup of their prod database that I'd originally restored for their Refresh. I bite my tongue on pointing out that, given that these are database files, there is literally zero guarantee that the data in that file back three weeks ago is the same as what Oracle (because SAP is just a gauzy undergarment over the raging tumesence of Oracle, for those who hadn't noticed) decided to put there after their changes, and produce that file. Not corrupt. (My shock is a roaring ocean at high tide in a hurricane.)

So… what this sounds like to me is them telling me that they fucked up their own data, right? Really, I'm the picture of empathy here, I promise. Which is a good thing, since now they want me to do that thing (the one with the five day SLA, remember that one?) that I did in three days… in one day. This time frame based on a manager's need to have the system available two days later, and weakly justified by the assertion of a member of their group (who has been uninvolved up to this point, but is actually no fool) that the restore ought to take only twelve hours. Oh, no wait… let me translate that from post-managerial-lobotamy back to English: What he really said was that, when the database was a different (smaller) size, and when the tape drives used were different, and when the backup software product used was OmniBack, not NetBackup, restores took about 1.5x as long as backups, which he seems to recall was twelve hours.

This was all happening on Tuesday afternoon. They needed the DB to be available for “clients” (if I haven't made it sufficiently clear, that means “other people who work for the same employer”, as opposed to “customers” who are regular people who would like their money kept safe, their credit card charged only for things they actually purchased, stuff like that) Thursday morning, according to the original schedule. We can do that, right? No, of course not. Misguided twelve hour estimates be damned, it took twenty-four hours just to spin the tape past read heads once before, and that number just isn't going to change by force of will (and, what's more, there are going to be some missing files). The more-skilled DBA (who is, I must reiterate, refreshingly not an idiot) kicks off the brrestore… around 02:00 in my time zone, after it takes the SA staff about eight hours to unmount and remount a few partitions (apparently, it was assumed that there was a need to destroy the Veritas Volume Manger logical disks and recreate them, and some of the capacity got lost along the way, and had to be found again).

All through the next day, I'm asked for Status. “That job is still running. It is working, exactly as I said it would, and exactly as it did the last time. If this changes, I'll be sure to tell you right away. I do, you know, have some duties other than this Refresh of yours which is, when it comes down to it, not actually the most important thing in the universe.”

And, of course, it missed some files (way fewer this time; must have lubricated the tape the first time or something… if we do this one more time, it may just work!), which I kicked off another restore to pick up. “How long will that take?” I cringe at this question. I feel for the one asking it, even if I don't especially like them–in general, and certainly not for asking it–because it's an obvious question and it should be possible to have an answer… but it just isn't. There isn't any way to know, without excessive logging that NBU doesn't keep, where specific files are on tape. It'll definitely take less time than the first pass, but, “I just can't tell you how long it will take.” Lather, rinse, status.

I think I'm done now, since they checked for corrupt files (my favorite part is how they make this sound like they're doing something complex, when what they're actually doing is find . -name \*.dbf -exec db_verify {} \;) and found a few, which I happily re-re-restored, and they dubbed them valid.

So… what are the chances that their database is broken again tomorrow, after they proceed to make whatever changes they were supposed to make over the course of this past week as quickly as they possibly can to only miss their due date by a day? More to the point, what are the chances that that'll somehow be my problem?

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*