Recommendations on setting up spam bayesian filtering?

Forum for getting help with Project Gamera, Spamassassin, Clamav, qmail-scanner and other anti-spam tools.
whoopingboy

Recommendations on setting up spam bayesian filtering?

Unread post by whoopingboy »

We want to start using the spam bayesian feature in spamassassin. Any recommendations on how to set it up on gateway servers? Should we create a local address on the gateway to collect our spam and run sa-learn, or should we just try out of the box and wait for it to catch up?

Also does anyone konw if it works with ms-sql, or should we just use mysql?
scott
Atomicorp Staff - Site Admin
Atomicorp Staff - Site Admin
Posts: 8355
Joined: Wed Dec 31, 1969 8:00 pm
Location: earth
Contact:

Unread post by scott »

First off, it will learn on its own. Spam scoring above 12 is autolearned as spam, and ham below 0.1 is autolearned as ham. What I use on a PSA box (and squirrelmail will do this) is set up 2 IMAP folders, Learn Spam, and Learn Ham. Then I run a nightly sa-learn cronjob against those folders to train the database.
Botham
Forum User
Forum User
Posts: 24
Joined: Sun Jun 18, 2006 2:34 am

Unread post by Botham »

I have set up the two folders as suggested and it is working very well.

Over a few months the SPAM folder now has over 5000 messages in it.

My question is: Should I leave all of the emails in the SPAM box and have SA read through them each time I run the command (takes a long time) or can I remove them so the process is less resource intensive?

And if I clear the SPAM folder will SpamAssassin remember all of the patterns that it has previously learnt or will these patterns be removed on the next run?
Griffith
Forum User
Forum User
Posts: 95
Joined: Tue Dec 07, 2004 1:32 pm

Unread post by Griffith »

Scott: I've had problems with sa not learning ham.. it does learn spam, but not ham. Even when score is below zero it does not learn it as ham. I've added over 200 hams manually to get bayes running. Ideas what to do? :)
martin_68
Forum User
Forum User
Posts: 9
Joined: Sat Jan 27, 2007 5:42 pm

Unread post by martin_68 »

my configuration:

bayes_auto_learn_threshold_nonspam 1.0
bayes_auto_learn_threshold_spam 5.0
scott
Atomicorp Staff - Site Admin
Atomicorp Staff - Site Admin
Posts: 8355
Joined: Wed Dec 31, 1969 8:00 pm
Location: earth
Contact:

Unread post by scott »

Bayes will automatically expire older tokens from its database, UNLESS the total number of tokens exceeds 100,000 (I think). This got me more than once, that means if you train your DB on a big pool of spam/ham then its never going to expire tokens. Bayes gets bigger and bigger, and slower and slower as it goes by. Its really noticable when you're storing via mysql. The bayesian data Ive read so far is to not learn old mail as spam, they recommend that you get rid of the older stuff, so that is what I do. Theres value in that old spam in my opinion, I'd start by submitting it with the --report option to the various razor/dcc/pyzor servers. I run several spam traps for just this reason, with the ultimate goal of setting up some dcc/razor/pyzor servers to be shared among Project Gamera servers. The guts are all there with the dcc and pyzor rpms now, I just havent gotten them all glued together yet.

Second question, If its not autolearning learning ham at all, then I'd talk to the spamassassin folks about it. Theres something wrong with your system, if you've installed any CPAN modules that'd be the first thing I'd look at.
Post Reply