diff options
Diffstat (limited to 'doc/markdown/configuration/statistic.md')
-rw-r--r-- | doc/markdown/configuration/statistic.md | 24 |
1 files changed, 12 insertions, 12 deletions
diff --git a/doc/markdown/configuration/statistic.md b/doc/markdown/configuration/statistic.md index 18e870652..26b2b70e7 100644 --- a/doc/markdown/configuration/statistic.md +++ b/doc/markdown/configuration/statistic.md @@ -2,7 +2,7 @@ ## Introduction -Statistics is used by rspamd to define the `class` of message: either spam or ham. The overall algorithm is based on Bayesian theorem +Statistics is used by Rspamd to define the `class` of message: either spam or ham. The overall algorithm is based on Bayesian theorem that defines probabilities combination. In general, it defines the probability of that a message belongs to the specified class (namely, `spam` or `ham`) base on the following factors: @@ -11,13 +11,13 @@ base on the following factors: ## Statistics Architecture -However, rspamd uses more advanced techniques to combine probabilities, such as sparsed bigramms (OSB) and inverse chi-square distribution. +However, Rspamd uses more advanced techniques to combine probabilities, such as sparsed bigramms (OSB) and inverse chi-square distribution. The key idea of `OSB` algorithm is to use not merely single words as tokens but combinations of words weighted by theirs positions. This schema is displayed in the following picture: ![OSB algorithm](https://rspamd.com/img/rspamd-schemes.004.png "Rspamd OSB scheme") -The main disadvantage is the amount of tokens which is multiplied by size of window. In rspamd, we use a window of 5 tokens that means that +The main disadvantage is the amount of tokens which is multiplied by size of window. In Rspamd, we use a window of 5 tokens that means that the number of tokens is about 5 times larger than the amount of words. Statistical tokens are stored in statfiles which, in turn, are mapped to specific backends. This architecture is displayed in the following image: @@ -26,7 +26,7 @@ Statistical tokens are stored in statfiles which, in turn, are mapped to specifi ## Statistics Configuration -Starting from rspamd 1.0, we propose to use `sqlite3` as backed and `osb` as tokenizer. That also enables additional features, such as tokens normalization and +Starting from Rspamd 1.0, we propose to use `sqlite3` as backed and `osb` as tokenizer. That also enables additional features, such as tokens normalization and metainformation in statistics. The following configuration demonstrates the recommended statistics configuration: ~~~ucl @@ -63,11 +63,11 @@ classifier "bayes" { } ~~~ -It is also possible to organize per-user statistics using sqlite3 backend. However, you should ensure that rspamd is called at the -finally delivery stage (e.g. LDA mode) to avoid multi-recipients messages. In case of a multi-recipient message, rspamd would just use the -first recipient for user-based statistics which might be inappropriate for your configuration (however, rspamd preferes SMTP recipients over MIME ones and prioritize +It is also possible to organize per-user statistics using sqlite3 backend. However, you should ensure that Rspamd is called at the +finally delivery stage (e.g. LDA mode) to avoid multi-recipients messages. In case of a multi-recipient message, Rspamd would just use the +first recipient for user-based statistics which might be inappropriate for your configuration (however, Rspamd preferes SMTP recipients over MIME ones and prioritize the special LDA header called `Deliver-To` that can be appended by `-d` options for `rspamc`). To enable per-user statistics, just add `users_enabled = true` property -to the **classifier** configuration. You can use per-user and per-language statistics simulataneously. For both types of spearation, rspamd also +to the **classifier** configuration. You can use per-user and per-language statistics simulataneously. For both types of spearation, Rspamd also looks to the default language and default user's statistics allowing to have the common set of tokens shared for all users/languages. ## Using lua scripts for `per_user` classifier @@ -115,10 +115,10 @@ EOD ## Applying per-user and per-language statistics -From version 1.1, rspamd uses independent statistics for users and joint statistics for languages. That means the following: +From version 1.1, Rspamd uses independent statistics for users and joint statistics for languages. That means the following: -* If `per_user` is enabled then rspamd looks for users statistics **only** -* If `per_language` is enabled then rspamd looks for language specific statistics **plus** language independent statistics +* If `per_user` is enabled then Rspamd looks for users statistics **only** +* If `per_language` is enabled then Rspamd looks for language specific statistics **plus** language independent statistics It is different from 1.0 version where the second approach was used for both cases. @@ -215,7 +215,7 @@ Where the last number is priority used to distinguish master from slave. ## Autolearning -From version 1.1, rspamd supports autolearning for statfiles. Autolearning is applied after all rules are processed (including statistics) if and only if the same symbol has not been inserted. E.g. a message won't be learned as spam if `BAYES_SPAM` is already in the results of checking. +From version 1.1, Rspamd supports autolearning for statfiles. Autolearning is applied after all rules are processed (including statistics) if and only if the same symbol has not been inserted. E.g. a message won't be learned as spam if `BAYES_SPAM` is already in the results of checking. There are 3 possibilities to specify autolearning: |