aboutsummaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
authorVsevolod Stakhov <vsevolod@highsecure.ru>2016-04-19 15:56:41 +0100
committerVsevolod Stakhov <vsevolod@highsecure.ru>2016-04-19 15:56:41 +0100
commit7e4b88f7bc3ccccbf462a62de865beff388deae6 (patch)
tree55b57231071f28385c5eef37c3680dd1d667dead /doc
parenta897287fbc34772d549cfa0d2aa31e23533edbe5 (diff)
downloadrspamd-7e4b88f7bc3ccccbf462a62de865beff388deae6.tar.gz
rspamd-7e4b88f7bc3ccccbf462a62de865beff388deae6.zip
[Doc] Improve classifiers documentation
Diffstat (limited to 'doc')
-rw-r--r--doc/markdown/configuration/statistic.md129
1 files changed, 75 insertions, 54 deletions
diff --git a/doc/markdown/configuration/statistic.md b/doc/markdown/configuration/statistic.md
index f314a31a6..18e870652 100644
--- a/doc/markdown/configuration/statistic.md
+++ b/doc/markdown/configuration/statistic.md
@@ -17,7 +17,7 @@ This schema is displayed in the following picture:
![OSB algorithm](https://rspamd.com/img/rspamd-schemes.004.png "Rspamd OSB scheme")
-The main disadvantage is the amount of tokens which is multiplied by size of window. In rspamd, we use a window of 5 tokens that means that
+The main disadvantage is the amount of tokens which is multiplied by size of window. In rspamd, we use a window of 5 tokens that means that
the number of tokens is about 5 times larger than the amount of words.
Statistical tokens are stored in statfiles which, in turn, are mapped to specific backends. This architecture is displayed in the following image:
@@ -30,15 +30,24 @@ Starting from rspamd 1.0, we propose to use `sqlite3` as backed and `osb` as tok
metainformation in statistics. The following configuration demonstrates the recommended statistics configuration:
~~~ucl
-classifier {
- type = "bayes";
+# Classifier's algorith is BAYES
+classifier "bayes" {
tokenizer {
name = "osb";
}
+
+ # Unique name used to learn the specific classifier
+ name = "common_bayes";
+
cache {
path = "${DBDIR}/learn_cache.sqlite";
}
+
+ # Minimum number of words required for statistics processing
min_tokens = 11;
+ # Minimum learn count for both spam and ham classes to perform classification
+ min_learns = 200;
+
backend = "sqlite3";
languages_enabled = true;
statfile {
@@ -67,15 +76,19 @@ It is also possible to create custom lua scripts to use customized user or langu
of such a script for extracting domain names from recipients organizing thus per-domain statistics:
~~~ucl
- classifier {
- tokenizer {
- name = "osb";
- }
- name = "bayes2";
- min_tokens = 11;
- backend = "sqlite3";
- per_language = true;
- per_user = <<EOD
+classifier "bayes" {
+ tokenizer {
+ name = "osb";
+ }
+
+ name = "bayes2";
+
+ min_tokens = 11;
+ min_learns = 200;
+
+ backend = "sqlite3";
+ per_language = true;
+ per_user = <<EOD
return function(task)
local rcpt = task:get_recipients(1)
@@ -89,15 +102,15 @@ end
return nil
end
EOD
- statfile {
- path = "/tmp/bayes2.spam.sqlite";
- symbol = "BAYES_SPAM2";
- }
- statfile {
- path = "/tmp/bayes2.ham.sqlite";
- symbol = "BAYES_HAM2";
- }
+ statfile {
+ path = "/tmp/bayes2.spam.sqlite";
+ symbol = "BAYES_SPAM2";
+ }
+ statfile {
+ path = "/tmp/bayes2.ham.sqlite";
+ symbol = "BAYES_HAM2";
}
+}
~~~
## Applying per-user and per-language statistics
@@ -114,42 +127,48 @@ It is different from 1.0 version where the second approach was used for both cas
Rspamd allows to learn and to check multiple classifiers for a single messages. This might be useful, for example, if you have common and per user statistics. It is even possible to use the same statfiles for these purposes. Classifiers **might** have the same symbols (thought it is not recommended) and they should have a **unique** `name` attribute that is used for learning. Here is an example of such a configuration:
~~~ucl
- classifier {
- tokenizer {
- name = "osb";
- }
- name = "bayes_user";
- min_tokens = 11;
- backend = "sqlite3";
- per_language = true;
- per_user = true;
- statfile {
- path = "/tmp/bayes.spam.sqlite";
- symbol = "BAYES_SPAM_USER";
- }
- statfile {
- path = "/tmp/bayes.ham.sqlite";
- symbol = "BAYES_HAM_USER";
- }
+classifier "bayes" {
+ tokenizer {
+ name = "osb";
}
- classifier {
- tokenizer {
- name = "osb";
- }
- name = "bayes";
- min_tokens = 11;
- backend = "sqlite3";
- per_language = true;
- statfile {
- path = "/tmp/bayes.spam.sqlite";
- symbol = "BAYES_SPAM";
- }
- statfile {
- path = "/tmp/bayes.ham.sqlite";
- symbol = "BAYES_HAM";
- }
+ name = "users";
+ min_tokens = 11;
+ min_learns = 200;
+ backend = "sqlite3";
+ per_language = true;
+ per_user = true;
+
+ statfile {
+ path = "/tmp/bayes.spam.sqlite";
+ symbol = "BAYES_SPAM_USER";
+ }
+ statfile {
+ path = "/tmp/bayes.ham.sqlite";
+ symbol = "BAYES_HAM_USER";
}
+}
+
+classifier "bayes" {
+ tokenizer {
+ name = "osb";
+ }
+
+ name = "common";
+ min_tokens = 11;
+ min_learns = 200;
+ backend = "sqlite3";
+ per_language = true;
+
+ statfile {
+ path = "/tmp/bayes.spam.sqlite";
+ symbol = "BAYES_SPAM";
+ }
+ statfile {
+ path = "/tmp/bayes.ham.sqlite";
+ symbol = "BAYES_HAM";
+ }
+}
~~~
To learn specific classifier, you can use `-c` option for `rspamc` (or `Classifier` HTTP header):
@@ -162,12 +181,14 @@ To learn specific classifier, you can use `-c` option for `rspamc` (or `Classifi
From version 1.1, it is also possible to specify redis as a backend for statistics and cache of learned messages. Redis is recommended for clustered configurations as it allows simultaneous learn and checks and, besides, is very fast. To setup redis, you could use `redis` backend for a classifier (cache is set to the same servers accordingly).
~~~ucl
- classifier {
+ classifier "bayes" {
tokenizer {
name = "osb";
}
+
name = "bayes";
min_tokens = 11;
+ min_learns = 200;
backend = "redis";
servers = "localhost:6379";
#write_servers = "localhost:6379"; # If needed another servers for learning