From 6d44c6b5ce1e47674248dd074c77d847b3b8f8f2 Mon Sep 17 00:00:00 2001 From: Vsevolod Stakhov Date: Mon, 17 May 2010 18:47:36 +0400 Subject: [PATCH] * Fix default config file * Add chapters about configuration of modules, classifiers and about rspamc protocol --- doc/rspamd.texi | 349 ++++++++++++++++++++++++++++++++++++++++++++-- rspamd.xml.sample | 22 ++- 2 files changed, 345 insertions(+), 26 deletions(-) diff --git a/doc/rspamd.texi b/doc/rspamd.texi index 125c01a3e..16292ea70 100644 --- a/doc/rspamd.texi +++ b/doc/rspamd.texi @@ -621,9 +621,9 @@ hashes. These types of workers has some common parameters: @multitable @columnfractions .2 .8 @headitem Parameter @tab Mean -@item type +@item @emph{} @tab Type of worker (normal, controller, lmtp or fuzzy) -@item bind_socket +@item @emph{} @tab Socket credits to bind this worker to. Inet and unix sockets are supported: @example localhost:11333 @@ -636,12 +636,12 @@ available inet interfaces: *:11333 @end example @noindent -@item count +@item @emph{} @tab Number of worker processes of this type. By default this number is equialent to number of logical processors in system. -@item maxfiles +@item @emph{} @tab Maximum number of file descriptors available to this worker process. -@item maxcore +@item @emph{} @tab Maximum size of core file that would be dumped in cause of critical errors (in mega/kilo/giga bytes). @end multitable @@ -650,25 +650,25 @@ Also each of workers types can have specific parameters: @itemize @bullet @item Normal worker: @itemize @bullet -@item @var{custom_filters} - path to dynamically loaded plugins that would do real +@item @var{} - path to dynamically loaded plugins that would do real check of incoming messages. These modules are described further. -@item @var{mime} - if this parameter is "no" than this worker assumes that incoming +@item @var{} - if this parameter is "no" than this worker assumes that incoming messages are in non-mime format (e.g. forum's messages) and standart mime headers are added to them. @end itemize @item Controller worker: @itemize @bullet -@item @var{password} - a password that would be used to access to contorller's +@item @var{} - a password that would be used to access to contorller's privilleged commands. @end itemize @item Fuzzy worker: @itemize @bullet -@item @var{hashfile} - a path to file where fuzzy hashes would be permamently stored. -@item @var{use_judy} - if libJudy is present in system use it for faster storage. -@item @var{frequent_score} - if judy is not turned on use this score to place hashes +@item @var{} - a path to file where fuzzy hashes would be permamently stored. +@item @var{} - if libJudy is present in system use it for faster storage. +@item @var{} - if judy is not turned on use this score to place hashes with score that is more than this value to special faster list (this is designed to increase lookup speed for frequent hashes). -@item @var{expire} - time to expire of fuzzy hashes after their placement in storage. +@item @var{} - time to expire of fuzzy hashes after their placement in storage. @end itemize @end itemize @@ -694,5 +694,330 @@ controller's commands and parameters for fuzzy storage. Default config provides reasonable values of this parameters (except password of course), so for basic configuration you may just replace controller's password to more secure one. +@section Classifiers configuration. + +@subsection Common classifiers options. + +Each classifier has mandatory option @var{type} that defines internal algorithm +that is used for classifying. Currently only @code{winnow} is supported. You can +read theoretical description of algorithm used here: +@url{http://www.siefkes.net/papers/winnow-spam.pdf} + +The common classifier configuration consists of base classifier parameters and +definitions of two (or more than two) statfiles. During classify process rspamd +check each statfile in classifier and select those that has more +probability/weight than others. If all statfiles has zero weight this classifier +do not add any symbols. Among common classifiers options are: +@multitable @columnfractions .2 .8 +@headitem Tag @tab Mean +@item @var{} +@tab Tokenizer to extract tokens from messages. Currently only @emph{osb} +tokenizer is supported +@item @var{} +@tab Metric to which this classifier would insert symbol. +@end multitable + +Also option @var{min_tokens} is supported to specify minimum number of tokens to +work with (this is usefull to avoid classifying of short messages as statistic +is practically useless for small amount of tokens). Here is example of base +classifier config: +@example + + osb-text + default + + + ... + + +@end example + +@subsection Statfiles options. + +The most common statfile options are @var{symbol} and @var{size}. The first one defines +which symbol would be inserted if this statfile would have maximal weight inside +classifier and size defines statfile size on disk and in memory. Note that +statfiles are mapped directly to memory and you should practically note +parameter @var{statfile_pool_size} of main section which defines maximum ammount +of memory for mapping statistic files. Also note that statistic files are +of constant size: if you defines 100 megabytes statfile it would occupy 100 +megabytes of disc space and 100 megabytes of memory when it is used (mapped). +Each statfile is indexed by tokens and contains so called "token chains". This +mechanizm would be described further but note that each statfile has parameter +"free tokens" that defines how much space is available for new tokens. If +statfile has no free space the most unused tokens would be removed from +statfile. + +Here is list of common options of statfiles: +@multitable @columnfractions .2 .8 +@headitem Tag @tab Mean +@item @var{} +@tab Defines symbol to insert for this statfile. +@item @var{} +@tab Size of this statfile in bytes (kilo/mega/giga bytes). +@item @var{} +@tab Filesystem path to statistic file. +@item @var{} +@tab Defines weight normalization structure. Can be lua function name or +internal normalizer. Internal normalizer is defined in format: +"internal:" where max_weight is fractional number that limits the +maximum weight of this statfile's symbol (this is so called dynamic weight). +@item @var{} +@tab Defines binlog affinity: master or slave. This option is used for statfiles +binary sync that would be described further. +@item @var{} +@tab Defines credits of binlog master for this statfile. +@item @var{} +@tab Defines rotate time for binlog. +@end multitable + +Internal normalization of statfile weight works in this way: +@itemize @bullet +@item @math{R_{score} = 1} when @math{W_{statfile} < 1} +@item @math{R_{score} = W_statfile ^ 2} when @math{1 < W_{statfile} < max / 2} +@item @math{R_{score} = W_statfile} when @math{max / 2 < W_{statfile} < max} +@item @math{R_{score} = max} when @math{W_{statfile} > max} +@end itemize + +The final result weight would be: @math{weight = R_{score} * W_{factor}}. +Here is sample classifier configuration with two statfiles that can be used for +spam/ham classifying: + +@example + + -1.00 + 1.00 +... + + + + + osb-text + default + + + WINNOW_HAM + 100M + /var/run/rspamd/data.ham + internal:3 + + + WINNOW_SPAM + 100M + /var/run/rspamd/data.spam + internal:3 + + + +@end example +@noindent +In this sample we define classifier that contains two statfiles: +@emph{WINNOW_SPAM} and @emph{WINNOW_HAM}. Each statfile has 100 megabytes size +(so they would occupy 200Mb while classifying). Also each statfile has maximum +weight of 3 so with such factors (-1 for WINNOW_HAM and 1 for WINNOW_SPAM) the +result weight of symbols would be 0..3 for @emph{WINNOW_SPAM} and 0..-3 for +@emph{WINNOW_HAM}. + +@section Modules config. + +@subsection Lua modules loading. +For loading custom lua modules you should use @emph{} section: +@example + + /usr/local/etc/rspamd/plugins/lua + +@end example +@noindent +Each @emph{} directive defines path to lua modules. If this is a +directory so all @code{*.lua} files inside that directory would be loaded. If +this is a file it would be loaded directly. + +@subsection Modules configuration. +Each module can have its own config section (this is true not only for internal +module but also for lua modules). Such section is called @emph{} with +mandatory attribute @emph{"name"}. Each module can be configured by +@emph{