Gamera and Training Bayesian filter via email

hotgazpacho · Unread post by **hotgazpacho** » Wed Jan 26, 2005 12:24 pm

Does Gamera offer an interface to allow clients to report messages that manage to get through the filters as spam? I was thinking something along the lines of permitting users to forward such mail to an account, like spam@example.com, which would take the forwarded message and submit it to Spamassassin for training purposes.

This seems the most apropriate method to me, since mail coming in to Gamera doesn't usually have a system account (rather it is forwarded to another host for delivery), so users cannot train Spamassassin in the usual way.

Update:
Done some digging, and I have found that there is a problem with forwarding messages, in that the email cleint adds headers, which has a very high potential to taint the Bayesian engine against your users. Also, not all email clients include all the original headers.

I have found one approach that may be worth investigating, outlined here:

http://lists.gnu.org/archive/html/spama ... 00015.html

Essentially, you store all incoming emails in a database. Then, when someone forwards a message to spam@example.com, it triggers a script tthat checks the database for the message (based on various criteria, including headers, content, timestamp, message id, etc), then uses the message stored in the database (if found) to feed to sa-learn. The same could be done for notspam@example.com.

Of course, you'd have to set some reasonable limits on the database... purging messages older than a certain age, purging messages after they have been processed, etc.

I'll keep digging, but I'd be interrested to hear input from others.

Unread post by **scott** » Wed Jan 26, 2005 9:46 pm

Sort of, squirrelmail has some things in it, which you can tie into a PG server the same way you tie it into PSA. There are some other tools out there like a remote version of sa-learn with imap support.

hotgazpacho · Unread post by **hotgazpacho** » Wed Jan 26, 2005 11:12 pm

Thanks, Scott.

I looked at the remote shared IMAP folders solution, and found it to be inadequate for my needs. This solution requires all clients to either access their mailboxes over IMAP, or use POP3 with leaving mail on the server. I cannot force my clients to use IMAP (as some may use email cleints that don't support IMAP), nor can I force them to leave POP3 mail (specifically, mail they wish to report as spam or ham) on the server.

I think what I am going to do is, for each mail accpeted for delivery by my Gamera server, store a copy in a MySQL database. I'll also set up spam and ham reporting accounts. Then, when a client sends mail to the spam@example.com account or the notspam@example.com account, the server will extract as much of the original messages as possible, and try to find a match for it in the database. If it is within a configurable threshold, the original message will be extracted from the database and fed to sa-learn.

Storing it in a database means that we always have (or, rather, can reconstruct) a clean copy of the original message, weather or not the forwarded message from the user contains all the headers.

To combat the spammers attempting to poison the Bayes filter, a unique SHA-1 hash could be generated using data from the original message and a secret on the Gamera server. This hash would be added as both a header and a footer in the email (to ensure that this data gets transmitted in the forwarded message, regardless of the client) before it gets delivered or a copy saved in the database. When the mail is sent for retraining, the hash will be searched for. If it is found in the DB and the message, the mail MUST have traversed my mserver at one point, so the original message is reconstructed from the database (sans hash), and fed to sa-learn. Otherwise, it is silently rejected (and possibly logged).

I'll also need a mechanism to purge messages stored in the database after X days.

What do you think?

Unread post by **scott** » Thu Jan 27, 2005 12:31 am

This is sort of where Im heading with the next generation of PG, which is based on postfix, where the quarantine is in mysql. Ham training is very problematic using this method however, since best case you can only get False positives in training. You need at least 200 messages in the ham classifer for bayes to even function, and it doesnt get particularly useful until it gets up to about 2000 messages. Training in SA is automatic, based on the threshold. Defaults are 12+ is autolearned as spam, 0.1 and below as ham. The trick with sa-learn is not to use it on spam, you get that for free with autolearning, its to use it on ham. And you need a lot of it, poisoning isnt an issue at all when for example you have a 2000 message ham corpus. The trick is, how do you train that quickly. For my users, Ive found the best way to do it is with learning folders in imap, or an abstraction layer (like PSA does, or a plugin to the MUA) so the user doesnt have to do more than just drag and drop messages into a folder.

The quarantine piece of this really hasnt played into learning at all with my user testing, most people just want spam to not show up in their mailbox, but still have the ability to look at it (even through a web interface) at their leisure. Generally I get this with business clients.