From a2706116f96b82ce9bbf86b252babb8b5c98a3e9 Mon Sep 17 00:00:00 2001 From: Vsevolod Stakhov Date: Wed, 13 Jan 2016 07:55:24 +0000 Subject: Add documentation for new statistics --- doc/markdown/configuration/statistic.md | 152 +++++++++++++++++++++++++++++++- 1 file changed, 150 insertions(+), 2 deletions(-) diff --git a/doc/markdown/configuration/statistic.md b/doc/markdown/configuration/statistic.md index 49d36eeb8..c7f811331 100644 --- a/doc/markdown/configuration/statistic.md +++ b/doc/markdown/configuration/statistic.md @@ -9,6 +9,8 @@ base on the following factors: - the probability of a specific token to be spam or ham (which means efficiently count of a token's occurences in spam and ham messages) - the probability of a specific token to appear in a message (which efficiently means frequency of a token divided by a number of tokens in a message) +## Statistics Architecture + However, rspamd uses more advanced techniques to combine probabilities, such as sparsed bigramms (OSB) and inverse chi-square distribution. The key idea of `OSB` algorithm is to use not merely single words as tokens but combinations of words weighted by theirs positions. This schema is displayed in the following picture: @@ -22,6 +24,8 @@ Statistical tokens are stored in statfiles which, in turn, are mapped to specifi ![Statistics architecture](https://rspamd.com/img/rspamd-schemes.005.png "Rspamd statistics architecture") +## Statistics Configuration + Starting from rspamd 1.0, we propose to use `sqlite3` as backed and `osb` as tokenizer. That also enables additional features, such as tokens normalization and metainformation in statistics. The following configuration demonstrates the recommended statistics configuration: @@ -52,7 +56,151 @@ classifier { It is also possible to organize per-user statistics using sqlite3 backend. However, you should ensure that rspamd is called at the finally delivery stage (e.g. LDA mode) to avoid multi-recipients messages. In case of a multi-recipient message, rspamd would just use the -first recipient for user-based statistics which might be inappropriate for your configuration (however, rspamd merely uses SMTP recipients, not MIME ones and prefer +first recipient for user-based statistics which might be inappropriate for your configuration (however, rspamd preferes SMTP recipients over MIME ones and prioritize the special LDA header called `Deliver-To` that can be appended by `-d` options for `rspamc`). To enable per-user statistics, just add `users_enabled = true` property to the **classifier** configuration. You can use per-user and per-language statistics simulataneously. For both types of spearation, rspamd also -looks to the default language and default user's statistics allowing to have the common set of tokens shared for all users/languages. \ No newline at end of file +looks to the default language and default user's statistics allowing to have the common set of tokens shared for all users/languages. + +## Using lua scripts for `per_user` classifier + +It is also possible to create custom lua scripts to use customized user or language for a specific task. Here is an example +of such a script for extracting domain names from recipients organizing thus per-domain statistics: + +~~~nginx + classifier { + tokenizer { + name = "osb"; + } + name = "bayes2"; + min_tokens = 11; + backend = "sqlite3"; + per_language = true; + per_user = < `10` in this case) +* `autolearn = "return function(task) ... end"`: use the following lua function to detect if autolearn is needed (function should return 'ham' if learn as ham is needed and string 'spam' if learn as spam is needed, if no learn is needed then a function can return anything including `nil`) + +Redis backend is highly recommended for autolearning purposes since it's the only backend with high concurrency level when multiple writers are properly synchronized. \ No newline at end of file -- cgit v1.2.3