Corpus testing and symbol rescoring

This document is a work in progress!

This requires rspamd >= 1.7

This will detail the procedures for testing a corpus of ham/spam mail into ham and spam logs for the purposes of generating the optimum symbol scores for maximum accuracy.

Why?

When developing new spam or non-spam rules, it's difficult to know what to score rules at, or even if the rule you've just written is any good at all.

So, to aid development of new rules we have to test them against a corpus of known-good non-spam messages and known-bad spam messages. Once this is done, we can see exactly how many spam .vs. non-spam messages a new rule hits.

This also allows up to inspect which non-spam messages are being hit (if any) and to modify the rule to only hit less non-spam.

Once this is done we can then run a rescore to calculate the optimum scores for all rules/symbols to minimise false-positives and maximise true-positives for each class of mail.

How you can help

To make rspamd as accurate as possible, we need as much hand classified mail as possible. This includes:

Messages reported by users as false-negatives (e.g. spam that isn't being correctly identified by rspamd). Although we have to be careful as many users class any message that they didn't want (even if they agreed to receive it) as spam.
Messages reported by users as false-positives (e.g. messages that users wanted, but was classified as spam by rspamd).
Non-spam messages.

For these three items, non-english messages as also highly desirable

Domains that you no longer use that we can re-purpose into spamtrap domains.

Obviously, there are privacy issues that are raised by the above, particularly concerning false-positives and non-spam messages. You can still help by collating these on your own server and running the corpus testing on your own hardware and then submitting your logs to us to use. The logs only contain: <class> <score> <action> <symbols> <scantime> <filename>, so there are no privacy issues to submitting these.

Running your own corpus test and generating logs

TODO

If you can help with any of the following, then get in touch!

Set-up a repository for corpus_test users that are submitting logs to us to pull experimental rules from.
Set-up nightly corpus testing and automatic symbol rescoring with score updates pushed out via rpamd_update.
Create procedure for rspamd users to be able to upload their ham and spam logs for inclusion into nightly corpus testing and rescoring.
Publish the nightly results of the corpus testing to rspamd.com and track the accuracy over time.