Tuesday, June 26, 2007

The Day from Hell

I know it is cliche, but Mondays can really be hell sometimes. On Sunday, I received an email from the mail/web server at work, notifying me that one of the hard drives in the RAID-1 array (mirrored) had failed. So, I readied myself for some server work on Monday.

I pulled out the faulty HD, and replaced it with a backup drive. I then start up the server, and watch the handy messages go by, until the system freezes at the partition check. Oh... no... Try a few more ideas, and still no go. Both hard drives died at the same time? Come on!

Time for the backup plan. I had prepared a Debian Etch server just for this event, and it was time to press it into service. I got it up and running, accepting email and serving web pages. Unfortunately, the server was not serving its role as a gateway to the regular desktop systems behind it. Frak! Something was (not) going on with the ip masquerading functionality. I installed the ipmasq package, but all packets were being dropped by the server, so no outgoing connections were possible. The server could access both subnets, but no joy for the desktops.

Damn, damn, damn! Now what? 6 hours later, and I am now seriously screwed sideways and upside down. I am sweating profusely, and am almost ready to give up. I call a local "Debian" consultant, but it turns out he's not real familiar with iptables/nat. What to do... what to do?

I finally get the idea to try the hard drive from the additional server in another box, as I kept coming back to the fact that having two hard drives fail simultaneously seemed a bit suspect. I powered up the box, and voila! She booted, and everything looked good! I moved the NIC cards over from the old box, placed the spare hard drive in her, and fired it up. After partitioning the new drive and marking said partitions for linux raid auto-detect, I was able to add the new disk partitions to the existing raid array, and everything seems good so far.

Needless to say, I learned a few lessons:

  1. My backups of the mail from the mail server worked well. However, there was additional information that should have been backed up but wasn't. This has been remedied.
  2. Even if you go to the trouble of making a backup server, you need to test it in exactly the same kind of conditions in which you need it to function. I tested out the services and such, but not the iptables/nat requirement (thinking it would be a breeze - ha!)
  3. I need to find a way to move my OS from one set of hardware to another. I'd rather avoid the reinstall and subsequent system build up if at all possible, but this might be too much to ask. If you have any ideas, lemme know!
  4. I *really* need to buy some new server hardware.
So, things are bck to normal now. I figure I lost at least a year of my life in those eight hours, and it wasn't until T+24 hours that I actually started feeling normal again. Maybe I should look into Google hosting our email...

4 comments:

  1. Crapola! Glad to hear you found a fix!

    Is it possible that there might be some NIC MAC address based rules in the IP forwarding chains/tables?

    ReplyDelete
  2. No MAC based rules in the iptables script... I'm thinking that it was probably something silly/stupid and that I was *this* close to getting the backup server working, but I was probably too whacked out to see it at the time.

    I'll revisit the backup server sometime later, I guess.

    ReplyDelete
  3. I can only imagine the verbal flogging you received. :)

    ReplyDelete
  4. Somewhat surprisingly, he was totally cool about it. Perhaps due to the fact that he couldn't flog me worse than I was flogging myself at the time.

    ReplyDelete