Recommendations on setting up spam bayesian filtering?

whoopingboy · Unread post by **whoopingboy** » Thu Dec 23, 2004 3:43 pm

We want to start using the spam bayesian feature in spamassassin. Any recommendations on how to set it up on gateway servers? Should we create a local address on the gateway to collect our spam and run sa-learn, or should we just try out of the box and wait for it to catch up?

Also does anyone konw if it works with ms-sql, or should we just use mysql?

Unread post by **scott** » Fri Dec 24, 2004 12:10 am

First off, it will learn on its own. Spam scoring above 12 is autolearned as spam, and ham below 0.1 is autolearned as ham. What I use on a PSA box (and squirrelmail will do this) is set up 2 IMAP folders, Learn Spam, and Learn Ham. Then I run a nightly sa-learn cronjob against those folders to train the database.

Botham · Unread post by **Botham** » Wed Feb 14, 2007 7:25 pm

I have set up the two folders as suggested and it is working very well.

Over a few months the SPAM folder now has over 5000 messages in it.

My question is: Should I leave all of the emails in the SPAM box and have SA read through them each time I run the command (takes a long time) or can I remove them so the process is less resource intensive?

And if I clear the SPAM folder will SpamAssassin remember all of the patterns that it has previously learnt or will these patterns be removed on the next run?

Griffith · Unread post by **Griffith** » Thu Feb 15, 2007 5:49 am

Scott: I've had problems with sa not learning ham.. it does learn spam, but not ham. Even when score is below zero it does not learn it as ham. I've added over 200 hams manually to get bayes running. Ideas what to do?

martin_68 · Unread post by **martin_68** » Thu Feb 15, 2007 10:48 am

my configuration:

bayes_auto_learn_threshold_nonspam 1.0
bayes_auto_learn_threshold_spam 5.0

Unread post by **scott** » Thu Feb 15, 2007 2:03 pm

Bayes will automatically expire older tokens from its database, UNLESS the total number of tokens exceeds 100,000 (I think). This got me more than once, that means if you train your DB on a big pool of spam/ham then its never going to expire tokens. Bayes gets bigger and bigger, and slower and slower as it goes by. Its really noticable when you're storing via mysql. The bayesian data Ive read so far is to not learn old mail as spam, they recommend that you get rid of the older stuff, so that is what I do. Theres value in that old spam in my opinion, I'd start by submitting it with the --report option to the various razor/dcc/pyzor servers. I run several spam traps for just this reason, with the ultimate goal of setting up some dcc/razor/pyzor servers to be shared among Project Gamera servers. The guts are all there with the dcc and pyzor rpms now, I just havent gotten them all glued together yet.

Second question, If its not autolearning learning ham at all, then I'd talk to the spamassassin folks about it. Theres something wrong with your system, if you've installed any CPAN modules that'd be the first thing I'd look at.