High load but no obvious cause: some tips

Community support for Plesk, CPanel, WebMin and others with insight from two of the founders of Plesk. Ask for help here! No question is too simple or complicated. :-)
faris
Long Time Forum Regular
Long Time Forum Regular
Posts: 2321
Joined: Thu Dec 09, 2004 11:19 am

High load but no obvious cause: some tips

Unread post by faris »

I thought I'd share an experience I had over the weekend in case it helps someone in the future.

Over the weekend, load on one of my systems mysteriously went from the normal 0.5-ish to a whopping 14. Asterisk and DNS started playing up, but everything else seemed to be working just fine.

I expected to find the culprit using "top" or "ps" but there were no processes constantly using lots of CPU or memory, nothing seemed out of control, there was no spam outbreak, and basically I did not notice anything amiss. Quite frankly I was initially stumped.

Then I remembered that I'd made a change to the cache location for junglediskserver (this is a auto-deduplicating file backup application that allows me to backup to S3 storage) and decided to take a closer look at what that service might be up to.

Using top and ps, limited to this process, I noticed it was occasionally bursting to 99% CPU, which for some reason I hadn't seen when looking at all processes instead of just this one. But I still couldn't figure out why it was causing so much load. I needed another tool.

For this purpose I used iotop, which can show which threads/processes are doing what sort of disk IO and whether there is a lot of iowait going on (see "IO>" column). (iotop -oaP is my favourite invocation, but you can toggle the a [accumulative] with the "a" key when it is running)

This revealed jungledisk experiencing massive iowait (99%) and doing some seemingly slow but constant disk writes.

I shut down the service and load went down immediately. Starting the service again cause the load to go up. Hmm...

Luckily Jungledisk has a great online support chat service, and they suggested the most likely cause was junglediskserver re-creating a particular backup database due to the cache location move, and that is something that is known to cause high load.

So there was my culprit. But what to do about it? And why was only DNS and VoIP being affected?

I knew about "nice" and "ionice" - I use both when doing a clamdscan, as per Scott/Mike's suggested invocation when scanning large numbers of files. But I wasn't aware that it was possible to use ionice on an already running process, despite reading the man page over an over!!

But thanks to some advice my deep-level Linux Guru, it seems you can indeed use ionice to change the IO priority of a running process.

It is simple enough. There are a couple of options.

ionice -c 3 -p <PID> (where -c 3 means use "idle" priority class, i.e. do your stuff when nothing else needs IO)

or

ionice -n <num> -p <PID> (where <num> is between 0 and 7, 7 being lowest priority and 0 being highest)

Having never used anything other than "-c 3" before (for clamdscan) I thought I'd try "-n 4" and BOOM, load went way down, iowaits reduced massively, everything was happy again and jungledisk just continued to do its database rebuild - but a little slower, no doubt.

As to why DNS and VoIP were the things that were mainly affected -- my Linux guru suggested that it might be down to them using UDP as opposed to TCP and of course once he said it, it made sense. UDP doesn't care if something gets lost, and if something is hammering the disk then time-critical stuff like VoIP will be affected and DNS will, I guess, just not respond quickly enough. Mind you, I'm assuming that these applications need to do some disk IO which is being slowed down massively and therefore causing the application to respond too slowly and packets to be "lost". Maybe that's the wrong assumption.)

Another thing my guru suggested was to look out for processes in "D" state (waiting for IO, most likely) using top or ps - but unfortunately I'd solved the problem before specifically checking to see if anything was in such a state. This is probably a critical step, and one that I missed out. It might have helped me find the problem more quickly.


***

If anyone here has any further suggestions for trouble-shooting mysterious high load situations, or improvements or comments on my suggestions, please chime in!!
--------------------------------
<advert>
If you want to rent a UK-based VPS that comes with friendly advice and support from a fellow ART fan, please get in touch.
</advert>
Post Reply