diff options
author | Vsevolod Stakhov <vsevolod@highsecure.ru> | 2013-12-30 01:29:42 +0000 |
---|---|---|
committer | Vsevolod Stakhov <vsevolod@highsecure.ru> | 2013-12-30 01:29:42 +0000 |
commit | 9ebfb824efb4c1fd5325c9451669bf9c5cb5c544 (patch) | |
tree | 14de4b42e04fb49c9a9313f4c72b14934365b06f /doc/markdown/architecture | |
parent | 2c3794e8f11308be3d53e3761810bb2e9901970b (diff) | |
download | rspamd-9ebfb824efb4c1fd5325c9451669bf9c5cb5c544.tar.gz rspamd-9ebfb824efb4c1fd5325c9451669bf9c5cb5c544.zip |
More documentation.
Diffstat (limited to 'doc/markdown/architecture')
-rw-r--r-- | doc/markdown/architecture/index.md | 32 |
1 files changed, 28 insertions, 4 deletions
diff --git a/doc/markdown/architecture/index.md b/doc/markdown/architecture/index.md index 2b1503933..33669f2d5 100644 --- a/doc/markdown/architecture/index.md +++ b/doc/markdown/architecture/index.md @@ -3,7 +3,7 @@ ## Introduction Rspamd is a universal spam filtering system based on event-driven processing -model. It means that rspamd is intented not to block anywhere in the code. To +model. It means that rspamd is intended not to block anywhere in the code. To process messages rspamd uses a set of so called `rules`. Each `rule` is a symbolic name associated with some message property. For example, we can define the following rules: @@ -13,7 +13,7 @@ rules: - FORGED_OUTLOOK_MID - message ID seems to be forged for Outlook MUA. Rules are defined by [modules](../modules/). So far, if there is a module that -performs SPF checks it may define several rules accroding to SPF policy: +performs SPF checks it may define several rules according to SPF policy: - SPF_ALLOW - a sender is allowed to send messages for this domain; - SPF_DENY - a sender is denied by SPF policy; @@ -49,8 +49,9 @@ means the opposite. ### Rules scheduler -To avoid unnecessary checks rspamd uses scheduler of rules for each message. This -scheduler is rather naive and it performs the following logic: +To avoid unnecessary checks rspamd uses scheduler of rules for each message. So far, +if a message is considered as `definite spam` then further checks are not performed. +This scheduler is rather naive and it performs the following logic: - select negative rules *before* positive ones to prevent false positives; - prefer rules with the following characteristics: @@ -91,3 +92,26 @@ a resulting symbol with weight from 0 to 5.0. To distribute values in the proper way, rspamd usually uses some sort of Sigma function to provide fair distribution curve. Nevertheless, the most of rspamd rules uses static weights with the exception of fuzzy rules. + +## Statistic + +Rspamd uses statistic algorithms to precise the final score of a message. Currently, +the only algorithm defined is OSB-Bayes. You may find the concrete details of this +algorithm in the following [paper](http://osbf-lua.luaforge.net/papers/osbf-eddc.pdf). +Rspamd uses window size of 5 words in its classification. During classification procedure, +rspamd split a message to a set of tokens. + +Tokens are separated by punctiation or space characters. Short tokens (less than 3 symbols) are ignored. For each token rspamd +calculates two non-cryptographic hashes used subsequently as indices. All these tokens +are stored in memory-mapped files called `statistic files` (or `statfiles`). Each statfile +is a set of token chains, indexed by the first hash. A new token may be inserted to some +chain, and if this chain is full then rspamd tries to expire less significant tokens to +insert a new one. It is possible to obtain the current state of tokens by running + + rspamc stat` + +command that asks controller for free and used tokens in each statfile. +Please note that if a statfile is close to be completely filled then during subsequent +learning you will loose existing data. Therefore, it is recommended to increase size for +such statfiles. + |