A cautionary tale: Or why you should always test upgrades

General Discussion of atomic repo and development projects.

Ask for help here with anything else not covered by other forums.
User avatar
mikeshinn
Atomicorp Staff - Site Admin
Atomicorp Staff - Site Admin
Posts: 4149
Joined: Thu Feb 07, 2008 7:49 pm
Location: Chantilly, VA

A cautionary tale: Or why you should always test upgrades

Unread post by mikeshinn »

I just spent three days helping a distraught customer. I don't like it when someone is distraught, especially one of our customers. Nobody puts our customers in the corner (if are old enough to get this reference, I'll buy you a scotch if you are in the DC area). I much prefer it when people can get their computers fixed and get on with life, even if its not something related to our products. Well we never turn away a customer in need, and who knows maybe it was our fault!

Well this all started when the customer did one of those full blown upgrades of their entire server, you know the old:

yum upgrade

Hey, we all like shiny new stuff. I'm no stranger to that, I do it sometimes too and I bet a lot of other people do as well. In general most of the time its probably not going to bite you. But sometimes this isn't always the best thing to do and thats what this story is really all about. So first, the flaming wreck that we were presented with:

For a whole day we couldnt log into the customers system and the customer also could not sort out why. After giving us control of the box at the hosting center we re-booted it and were able to log in. Barely. Yes, just barely. You see the load on the box was wildly out of control, 892. Thats not a typo, 892 within two minutes of being rebooted. My hat is off to Linux for working with what I would image to be possible the highest load ever seen by human eyes:

08:02:19 up 1:41, 1 user, load average: 892.65, 831.99, 571.57

I have to marvel at that, and that we even had a shell and it would "respond" at all is a miracle. So a bunch of kill -9s later and we had a sorta responsive system. mysql wasnt running on the system either, and the box kept slowing back to a crawl so we couldnt even get to that problem. So, we started killing more things off and isolating processes so we could figure out what was happening.

The first thing that leaped out at us was a fabulous qmail loop from hell, forking off thousands of qmail-scanners. Boffo! Well theres a spontaneous source of load. Hey, I can appreciate a killer fork bomb when I see one, and this one was marvellous, but we use qmail-scanner too and we've never ever seen or heard of a monster like this happening on a box. But, just in case, we ripped out qmail-scanner. Well that helped, but it didnt fix the loop. Gotta have mail flowing too, so a little closer look and it turns out psa-qmail is in a double bounce loop-de-loop. Yep, a bug. So a quick reinstall of psa-qmail and thats sorted. qmail-scanner back on the box, yep all is right with qmail but the load... it keeps blowing out of control.

So, we kill off the next source apache, which for some reason was forking like crazy itself and of course doing its own admirable job of killing the box. Alright we thought, apache is dead for the moment and the box is marginally responsive. So on to fixing the bigger issue with mysql, and maybe we can kill two birds with one stone hoping that these out of control apaches are just miserable because they can't talk to their friend mysql. It is a sad day when a web server cant talk to its database. Well that was a major upgrade too, so half the users my.cnf settings didnt work anymore and they didnt have logging setup properly so mysql wasn't telling them much (dandy that, a daemon doesnt start and it doesnt tell you why). So, time to sort all that out and got mysql running again. We got that fixed and brought apache back up to get the websites running. The system started to slowly climb and climb in load until apache was just chugging away and killing this poor machine. At first we thought this box must just be the talk of the town and just that popular and that mysql maybe wasnt tuned right. So we threw some kung fu at that, got it in tip top form but not much of a chance for apache.

OK, we thought, this has got to be traffic. This box is just that popular. Nope, not much traffic at all, and yet a ton of apaches are running - but they arent pegged. Just %10 of the CPU for each one. Well thats not right, so we pulled out our good ole friend strace and what does the first thread say "I'm opening this PHP file over and over again in this domain".

Hrmmm.... well that looks odd. That kind looks like an infinite loop. Wait a minute, well that cant be every thread can it? Yes this one is doing it... and this one, wow another one! OK, thats all four of the processes we just looked at, thats not right. So on a hunch (or is this more than a hunch? What do you call a bright light shining in your eye?) we just moved that entire websites httpdocs contents to a tmp directory, rekicked apache and what do you know everything is peachy keen with apache! (Hopefully the customer will us tar up the code so we can see what caused this in the PHP code, that sounds like a nasty DOS vulnerability to me!)

And the box, why its load right now is just 0.11.

So, whats the moral of the story? Save yourself the heart ache, always always test your upgrades on a non-production machine, or have a really good plan for in place debugging and plenty of spare time.

Me, I'll take a test box anyday.

Happy Holidays everyone!

And watch out for the drones launched as part of the war on Christmas!
biggles
Forum Regular
Forum Regular
Posts: 806
Joined: Tue Jul 15, 2008 2:38 pm
Location: Sweden
Contact:

Re: A cautionary tale: Or why you should always test upgrade

Unread post by biggles »

mikeshinn wrote:I just spent three days helping a distraught customer. I don't like it when someone is distraught, especially one of our customers. Nobody puts our customers in the corner (if are old enough to get this reference, I'll buy you a scotch if you are in the DC area).
Looking forward to the scotch when I'll visit DC, Mr Swayze! Thanks a lot for the inspiring story BTW!
mikerice60
Forum User
Forum User
Posts: 6
Joined: Fri Nov 30, 2007 8:46 pm

Re: A cautionary tale: Or why you should always test upgrade

Unread post by mikerice60 »

This sounds like my normal day everytime I upgrade Plesk.

Just went from 10.3.1 to 10.4.4 Plesk CP last night and of course got the dreaded qmail-scanner runaway processes until the server maxed out memory and locked up at about 2400 running processes.

Many thanks to Bruen and his work around of yum removing qmail-scanner, installing Postfix, and then reinstalling Qmail via Plesk autoinstaller, then reinstalling qmail-scanner. This worked, whereas previously, just removing and reinstalling qmail-scanner did the trick.

Here what helped me through the mess:
http://atomicorp.com/forums/viewtopic.p ... il+scanner
Post Reply