aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--doc/markdown/configuration/statistic.md53
1 files changed, 44 insertions, 9 deletions
diff --git a/doc/markdown/configuration/statistic.md b/doc/markdown/configuration/statistic.md
index 1f46b4f10..49d36eeb8 100644
--- a/doc/markdown/configuration/statistic.md
+++ b/doc/markdown/configuration/statistic.md
@@ -2,22 +2,57 @@
## Introduction
+Statistics is used by rspamd to define the `class` of message: either spam or ham. The overall algorithm is based on Bayesian theorem
+that defines probabilities combination. In general, it defines the probability of that a message belongs to the specified class (namely, `spam` or `ham`)
+base on the following factors:
+
+- the probability of a specific token to be spam or ham (which means efficiently count of a token's occurences in spam and ham messages)
+- the probability of a specific token to appear in a message (which efficiently means frequency of a token divided by a number of tokens in a message)
+
+However, rspamd uses more advanced techniques to combine probabilities, such as sparsed bigramms (OSB) and inverse chi-square distribution.
+The key idea of `OSB` algorithm is to use not merely single words as tokens but combinations of words weighted by theirs positions.
+This schema is displayed in the following picture:
+
+![OSB algorithm](https://rspamd.com/img/rspamd-schemes.004.png "Rspamd OSB scheme")
+
+The main disadvantage is the amount of tokens which is multiplied by size of window. In rspamd, we use a window of 5 tokens that means that
+the number of tokens is about 5 times larger than the amount of words.
+
+Statistical tokens are stored in statfiles which, in turn, are mapped to specific backends. This architecture is displayed in the following image:
+
+![Statistics architecture](https://rspamd.com/img/rspamd-schemes.005.png "Rspamd statistics architecture")
+
+Starting from rspamd 1.0, we propose to use `sqlite3` as backed and `osb` as tokenizer. That also enables additional features, such as tokens normalization and
+metainformation in statistics. The following configuration demonstrates the recommended statistics configuration:
+
~~~nginx
classifier {
type = "bayes";
- tokenizer = "osb-text";
- metric = "default";
- min_tokens = 10;
- max_tokens = 1000;
+ tokenizer {
+ name = "osb";
+ }
+ cache {
+ path = "${DBDIR}/learn_cache.sqlite";
+ }
+ min_tokens = 11;
+ backend = "sqlite3";
+ languages_enabled = true;
statfile {
symbol = "BAYES_HAM";
- size = 50Mb;
- path = "$DBDIR/bayes.ham";
+ path = "${DBDIR}/bayes.ham.sqlite";
+ spam = false;
}
statfile {
symbol = "BAYES_SPAM";
- size = 50Mb;
- path = "$DBDIR/bayes.spam";
+ path = "${DBDIR}/bayes.spam.sqlite";
+ spam = true;
}
}
-~~~ \ No newline at end of file
+~~~
+
+It is also possible to organize per-user statistics using sqlite3 backend. However, you should ensure that rspamd is called at the
+finally delivery stage (e.g. LDA mode) to avoid multi-recipients messages. In case of a multi-recipient message, rspamd would just use the
+first recipient for user-based statistics which might be inappropriate for your configuration (however, rspamd merely uses SMTP recipients, not MIME ones and prefer
+the special LDA header called `Deliver-To` that can be appended by `-d` options for `rspamc`). To enable per-user statistics, just add `users_enabled = true` property
+to the **classifier** configuration. You can use per-user and per-language statistics simulataneously. For both types of spearation, rspamd also
+looks to the default language and default user's statistics allowing to have the common set of tokens shared for all users/languages. \ No newline at end of file