spamd child 100% CPU

Forum for getting help with Project Gamera, Spamassassin, Clamav, qmail-scanner and other anti-spam tools.
faris
Long Time Forum Regular
Long Time Forum Regular
Posts: 2321
Joined: Thu Dec 09, 2004 11:19 am

spamd child 100% CPU

Unread post by faris »

I have a really weird situation on a particular server (Centos 5, Plesk 10, qmail-scanner, spamassassin, clamav).

A few minutes after restarting spamassassin, a spamd child appears to hang at 100% CPU.

Code: Select all

11052 qscand    20   0  188m  65m 2572 R 100.0  0.8   3:25.59 spamd child
Spamassassin continues to work, and other children don't hang. Only one child hangs like this.

Restarting spamassassin resolves the problem for a short time, but then it just happens again.

"spamassassin --lint" gives no errors.

This all started after reinstalling qmail-scanner following the installation of Plesk MU#49
Initially qmail-scanner was not using spamassassin at all, probably because of something related to all this. I had to reinstall it twice and manually run qmail-scanner-reconfigure a couple of times before it started using spamassassin. It also took a very long time (many minutes) for the reconfigure to complete.

So it is as though something is getting "stuck" but I don't know what.

local.cf has nothing much in it and there are no custom rules.

I see plenty of people posting about similar issues on google, but no real resolution of the issue.

Any suggestions as to where I might start looking to try to debug this?
--------------------------------
<advert>
If you want to rent a UK-based VPS that comes with friendly advice and support from a fellow ART fan, please get in touch.
</advert>
faris
Long Time Forum Regular
Long Time Forum Regular
Posts: 2321
Joined: Thu Dec 09, 2004 11:19 am

Re: spamd child 100% CPU

Unread post by faris »

Monitoring this for longer, I see that the PID actually changes after some time (5 mins, 10 mins maybe?)
So it isn't the same process that sits there and hangs at 100% forever.

But this should not be happening and I can't figure out what to do about it.

sa-update says everything is up to date, so it isn't old rules or whatever.
--------------------------------
<advert>
If you want to rent a UK-based VPS that comes with friendly advice and support from a fellow ART fan, please get in touch.
</advert>
prupert
Forum Regular
Forum Regular
Posts: 573
Joined: Tue Aug 01, 2006 2:45 pm
Location: Netherlands

Re: spamd child 100% CPU

Unread post by prupert »

What does 'strace' tell you the process is doing? Any messages in the queue that might be related to this issue?
Lemonbit Internet Dedicated Server Management
faris
Long Time Forum Regular
Long Time Forum Regular
Posts: 2321
Joined: Thu Dec 09, 2004 11:19 am

Re: spamd child 100% CPU

Unread post by faris »

I will do a strace shortly.

However, I see more now:

Code: Select all

Tue, 16 Apr 2013 13:17:58 BST:22320: error_condition: X-Qmail-Scanner-2.10st: spamd not returning valid content - try again
Tue, 16 Apr 2013 13:17:58 BST:22320: ------ Process 22320 finished. Total of 361.557767 secs
Tue, 16 Apr 2013 13:18:01 BST:22333: error_condition: X-Qmail-Scanner-2.10st: spamd not returning valid content - try again
Tue, 16 Apr 2013 13:18:01 BST:22333: ------ Process 22333 finished. Total of 363.877405 secs
When this happens, maillog fills with spamassasin clean or identified dpam messages, but no delivery to local mailbox.

Restarting spamassassin causes things to return to normal.
--------------------------------
<advert>
If you want to rent a UK-based VPS that comes with friendly advice and support from a fellow ART fan, please get in touch.
</advert>
faris
Long Time Forum Regular
Long Time Forum Regular
Posts: 2321
Joined: Thu Dec 09, 2004 11:19 am

Re: spamd child 100% CPU

Unread post by faris »

Actually how do I strace a *service*?
--------------------------------
<advert>
If you want to rent a UK-based VPS that comes with friendly advice and support from a fellow ART fan, please get in touch.
</advert>
faris
Long Time Forum Regular
Long Time Forum Regular
Posts: 2321
Joined: Thu Dec 09, 2004 11:19 am

Re: spamd child 100% CPU

Unread post by faris »

I'm just adding other oddities from the log as I go along:

Code: Select all

Tue, 16 Apr 2013 13:35:41 BST:23390: error_condition: X-Qmail-Scanner-2.10st: Requeuing: Maximum time exceeded. Something cannot handle this message. at /var/qmail/bin/qmail-scanner-queue.pl line 685.
--------------------------------
<advert>
If you want to rent a UK-based VPS that comes with friendly advice and support from a fellow ART fan, please get in touch.
</advert>
scott
Atomicorp Staff - Site Admin
Atomicorp Staff - Site Admin
Posts: 8355
Joined: Wed Dec 31, 1969 8:00 pm
Location: earth
Contact:

Re: spamd child 100% CPU

Unread post by scott »

DNS lookup maybe? Youd need to strace that child to see what its doing, and hope you dont a cryptic wait()
User avatar
mikeshinn
Atomicorp Staff - Site Admin
Atomicorp Staff - Site Admin
Posts: 4149
Joined: Thu Feb 07, 2008 7:49 pm
Location: Chantilly, VA

Re: spamd child 100% CPU

Unread post by mikeshinn »

strace eitther the parent, or the child if you know its stuck

strace -fF -p pid

Example:

[root@ac0 ~]# ps auxwww | grep spamd
qscand 1350 3.3 4.0 265684 79920 ? S 11:16 2:53 spamd child
root 1481 0.0 0.4 251332 8388 ? Ss 04:39 0:11 /usr/bin/spamd -d -x -q -c -m30 -u qscand -r /var/run/spamd.pid
qscand 2172 2.4 4.0 266176 79960 ? S 11:30 1:46 spamd child
qscand 2587 0.3 3.2 258948 64072 ? S 11:35 0:14 spamd child
qscand 2589 0.0 3.1 255780 60992 ? S 11:35 0:03 spamd child
qscand 2607 0.1 3.3 259192 65384 ? S 11:35 0:07 spamd child
qscand 2608 0.1 3.1 257476 61908 ? S 11:35 0:05 spamd child
qscand 2615 0.2 3.1 258900 61840 ? S 11:35 0:09 spamd child
root 10356 0.0 0.0 126700 820 pts/4 S+ 12:42 0:00 grep spamd

For parent:

strace -fF -p 1481

For child, lets say 2589 is killing the CPU:

strace -fF -p 2589

The downside is with the child you'll miss the cause if you strace after its stuck, but maybe you can deduce that once it unsticks.
faris
Long Time Forum Regular
Long Time Forum Regular
Posts: 2321
Joined: Thu Dec 09, 2004 11:19 am

Re: spamd child 100% CPU

Unread post by faris »

Great thanks guys.

I've not traced -- I thought I'd try something first (not good diagnostic practice, but I figred if it didn't work I could trace later)

The bayes dbs in /var/spool/qscan/.spamassassin was at 200Mb+ and there were scores and scores of smaller files in there too, bayes_toks.expire* all dated in the past few days (I don't have these on my other systems).

Now the critical thing was that bayes_expire had been set to 0, but I had removed that as part of a config file that was failing to lint.

I wonder if one of these processes was now trying to expire tokens, and although 200Mb isn't much, it is still kinda big I suppose. Maybe it was going nuts, or one of the files was slightly iffy?

More to the point, on clearing the bayes db and putting bayes_expire back to 0, I'm not seeing the stuck process. It normally pops up within 10 mins and it has been some time since I did this.

Unfortunately the server isn't as busy at this time of day, so if it is load-related it may come back tomorrow.
At least I know how to trace it now, and hope it isn't one of those ones that have no obvious cause.
--------------------------------
<advert>
If you want to rent a UK-based VPS that comes with friendly advice and support from a fellow ART fan, please get in touch.
</advert>
prupert
Forum Regular
Forum Regular
Posts: 573
Joined: Tue Aug 01, 2006 2:45 pm
Location: Netherlands

Re: spamd child 100% CPU

Unread post by prupert »

faris wrote:I wonder if one of these processes was now trying to expire tokens, and although 200Mb isn't much, it is still kinda big I suppose.
It's not particularly big, although we mostly see them being smaller, we have servers running perfectly fine with much larger bayes files. The file contains some of the AI magic database of SpamAssassin that significantly helps to determine wether or not mail is spam. Removing it means throwing away learned data. The bigger your bayes database, the bigger RAM usage will be.
Lemonbit Internet Dedicated Server Management
faris
Long Time Forum Regular
Long Time Forum Regular
Posts: 2321
Joined: Thu Dec 09, 2004 11:19 am

Re: spamd child 100% CPU

Unread post by faris »

I'm going to cross my fingers that removing it has solved the problem. There's still no sign of a stuck process.
I do have a backup of it anyway.
--------------------------------
<advert>
If you want to rent a UK-based VPS that comes with friendly advice and support from a fellow ART fan, please get in touch.
</advert>
faris
Long Time Forum Regular
Long Time Forum Regular
Posts: 2321
Joined: Thu Dec 09, 2004 11:19 am

Re: spamd child 100% CPU

Unread post by faris »

I'm pleased to say that killing the bayes really does seem to have resolved the problem.
Not only that, but typical message scantime (according to qmail-scanner) has dropped from about 10 seconds to about 1 second, and average system load (according to top) has dropped by around 50%. I hope this isn't a case of "famous last words" but it has now been more than 12 hours without an issue.

Let me also explain why I rushed ahead to kill the bayes before bothering to strace the process:
If you do a Google search on this issue, along with scores of unanswered posts relating to the matter, if you look hard enough (which I did after initially finding nothing) you'll see one or two posts here and there where the problem has been found to be a result of a problem of one sort of another in the bayes database. And since I had been stressing like anything over this issue all day, I wanted to immediately see if my fix was going to work or not

In addition, the only configuration change between working and hanging was the bayes expire setting.

And finally, no spamassassin installation on any of my systems has any of the bays_toks.expiredXXXXX I saw on the affected system. (If anybody is bored, please have a look at your /var/spool/qmail/.spamassassin directory to see if you have any of these or not - I'd be very interested to know).

Searching on Google, these files are the result of a bayes expire process taking too long and being killed. Given that the default spamd timout is 300 seconds, and given that it was immediately respawning, I'm as sure as I can be, without stracing, that my diagnosis was right.

What I have not figured out how to do is to run sa-update --force-expire for the qscand user. The "-u username" option doesn't seem to work - it keeps trying to expire root's bayes. I guess I'll have to use the trick breun showed me to run something as a user that doesn't have shell.
--------------------------------
<advert>
If you want to rent a UK-based VPS that comes with friendly advice and support from a fellow ART fan, please get in touch.
</advert>
Post Reply