Gamera and Training Bayesian filter via email
Posted: Wed Jan 26, 2005 12:24 pm
Does Gamera offer an interface to allow clients to report messages that manage to get through the filters as spam? I was thinking something along the lines of permitting users to forward such mail to an account, like spam@example.com, which would take the forwarded message and submit it to Spamassassin for training purposes.
This seems the most apropriate method to me, since mail coming in to Gamera doesn't usually have a system account (rather it is forwarded to another host for delivery), so users cannot train Spamassassin in the usual way.
Update:
Done some digging, and I have found that there is a problem with forwarding messages, in that the email cleint adds headers, which has a very high potential to taint the Bayesian engine against your users. Also, not all email clients include all the original headers.
I have found one approach that may be worth investigating, outlined here:
http://lists.gnu.org/archive/html/spama ... 00015.html
Essentially, you store all incoming emails in a database. Then, when someone forwards a message to spam@example.com, it triggers a script tthat checks the database for the message (based on various criteria, including headers, content, timestamp, message id, etc), then uses the message stored in the database (if found) to feed to sa-learn. The same could be done for notspam@example.com.
Of course, you'd have to set some reasonable limits on the database... purging messages older than a certain age, purging messages after they have been processed, etc.
I'll keep digging, but I'd be interrested to hear input from others.
This seems the most apropriate method to me, since mail coming in to Gamera doesn't usually have a system account (rather it is forwarded to another host for delivery), so users cannot train Spamassassin in the usual way.
Update:
Done some digging, and I have found that there is a problem with forwarding messages, in that the email cleint adds headers, which has a very high potential to taint the Bayesian engine against your users. Also, not all email clients include all the original headers.
I have found one approach that may be worth investigating, outlined here:
http://lists.gnu.org/archive/html/spama ... 00015.html
Essentially, you store all incoming emails in a database. Then, when someone forwards a message to spam@example.com, it triggers a script tthat checks the database for the message (based on various criteria, including headers, content, timestamp, message id, etc), then uses the message stored in the database (if found) to feed to sa-learn. The same could be done for notspam@example.com.
Of course, you'd have to set some reasonable limits on the database... purging messages older than a certain age, purging messages after they have been processed, etc.
I'll keep digging, but I'd be interrested to hear input from others.