doc/rspamd.texi


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565

\input texinfo
@settitle "Rspamd Spam Filtering System"
@titlepage

@title Rspamd Spam Filtering System
@subtitle A User's Guide for Rspamd

@author Vsevolod Stakhov


@end titlepage
@contents

@chapter Rspamd purposes and features.

@node introduction
@section Introduction.
Rspamd filtering system is created as a replacement of popular
@code{spamassassin}
spamd and is designed to be fast, modular and easily extendable system. Rspamd
core is written in @code{C} language using event driven paradigma. Plugins for rspamd
can be written in @code{lua}. Rspamd is designed to process connections
completely asynchronous and do not block anywhere in code. Spam filtering system
contains of several processes among them are:
@itemize @bullet
@item Main process
@item Workers processes
@item Controller process
@item Other processes
@end itemize
Main process manages all other processes, accepting signals from OS (for example
SIGHUP) and spawn all types of processes if any of them die. Workers processes
do all tasks for filtering e-mail (or HTML messages in case of using rspamd as
non-MIME filter). Controller process is designed to manage rspamd itself (for
example get statistics or learning rspamd). Other processes can do different
jobs among them now are implemented @code{LMTP} worker that implements
@code{LMTP} protocol for filtering mail and fuzzy hashes storage server. 

@node features
@section Features.
The main features of rspamd are:
@itemize @bullet
@item Completely asynchronous filtering that allows a big number of simultenious
connections.
@item Easily extendable architecture that can be extended by plugins written in
@code{lua} and by dynamicaly loaded plugins written in @code{c}.
@item Ability to work in cluster: rspamd is able to perform statfiles
synchronization, dynamic load of lists via HTTP, to use distributed fuzzy hashes
storage.
@item Advanced statistics: rspamd now is shipped with winnow-osb classifier that
provides more accurate statistics than traditional bayesian algorithms based on
single words.
@item Internal optimizer: rspamd first of all try to check rules that were met
more often, so for huge spam storms it works very fast as it just checks only
that rules that @emph{can} happen and skip all others.
@item Ability to manage the whole cluster by using controller process.
@item Compatibility with existing @code{spamassassin} SPAMC protocol.
@item Extended @code{RSPAMC} protocol that allows to pass many additional data
from SMTP dialog to rspamd filter.
@item Internal support of IMAP in rspamc client for automated learning.
@item Internal support of many anti-spam technologies, among them are
@code{SPF} and @code{SURBL}.
@item Active support and development of new features.
@end itemize

@chapter Installation of rspamd.

@node obtaining
@section Obtaining of rspamd.

The main rspamd site is @url{http://rspamd.sourceforge.net/, sourceforge}. Here
you can obtain source code package as well as pre-packed packages for different
operating systems and architectures. Also, you can use SCM
@url{http://mercurial.selenic.com, mercurial} for accessing rspamd development
repository that can be found here:
@url{http://rspamd.hg.sourceforge.net:8000/hgroot/rspamd/rspamd}. Rspamd is
shipped with all modules and sample config by default. But there are some
requirements for building and running rspamd.

@node requirements
@section Requirements.

For building rspamd from sources you need @code{CMake} system. CMake is very
nice source building system and I decided to use it instead of GNU autotools.
CMake can be obtained here: @url{http://cmake.org}. Also rspamd uses gmime and
glib for MIME parsing and many other purposes (note that you are NOT required
to install any GUI libraries -  nor glib, nor gmime are GUI libraries). Gmime
and glib can be obtained from gnome site: @url{http://ftp.gnome.org/}. For
plugins and configuration system you also need lua language interpreter and
libraries. They can be easily obtained from @url{http://lua.org, official lua
site}. Also for rspamc client you need @code{perl} interpreter that could be
installed from @url{http://www.perl.org}.

@node building
@section Building and Installation.

Build process of rspamd is rather simple:
@itemize @bullet
@item Configure rspamd build environment, using cmake:
@example
$ cmake .
...
-- Configuring done
-- Generating done
-- Build files have been written to: /home/cebka/rspamd
@end example
@noindent
For special configuring options you can use
@example
$ ccmake .
 CMAKE_BUILD_TYPE                                                                                                                                           
 CMAKE_INSTALL_PREFIX             /usr/local                                                                                                                
 DEBUG_MODE                       ON                                                                                                                        
 ENABLE_GPERF_TOOLS               OFF                                                                                                                       
 ENABLE_OPTIMIZATION              OFF                                                                                                                       
 ENABLE_PERL                      OFF                                                                                                                       
 ENABLE_PROFILING                 OFF                                                                                                                       
 ENABLE_REDIRECTOR                OFF                                                                                                                       
 ENABLE_STATIC                    OFF                                                                                                                       
@end example
@noindent
Options allows building rspamd as static module (note that in this case
dynamicaly loaded plugins are @strong{NOT} supported), linking rspamd with
google performance tools for benchmarking and include some other flags while
building.
@item Build rspamd sources:
@example
$ make
[  6%] Built target rspamd_lua
[ 11%] Built target rspamd_json
[ 12%] Built target rspamd_evdns
[ 12%] Built target perlmodule
[ 58%] Built target rspamd
[ 76%] Built target test/rspamd-test
[ 85%] Built target utils/expression-parser
[ 94%] Built target utils/url-extracter
[ 97%] Built target rspamd_ipmark
[100%] Built target rspamd_regmark
@end example
@noindent
@item Install rspamd (as superuser):
@example
# make install
Install the project...
...
@end example
@noindent
@end itemize

After installation you would have several new files installed:
@itemize @bullet

@item Binaries:
@itemize @bullet
@item PREFIX/bin/rspamd - main rspamd executable
@item PREFIX/bin/rspamc - rspamd client program
@end itemize
@item Sample configuration files and rules:
@itemize @bullet
@item PREFIX/etc/rspamd.xml.sample - sample main config file
@item PREFIX/etc/rspamd/lua/*.lua - rspamd rules
@end itemize
@item Lua plugins:
@itemize @bullet
@item PREFIX/etc/rspamd/plugins/lua/*.lua - lua plugins
@end itemize

@end itemize
For @code{FreeBSD} system there also would be start script for running rspamd in
@emph{PREFIX/etc/rc.d/rspamd.sh}.

@node running
@section Running rspamd.

Rspamd can be started by running main rspamd executable -
@code{PREFIX/bin/rspamd}. There are several command-line options that can be
passed to rspamd. All of them can be displayed by passing --help argument:
@example
$ rspamd --help
Usage:
  rspamd [OPTION...] - run rspamd daemon

Summary:
  Rspamd daemon version 0.3.0

Help Options:
  -?, --help               Show help options

Application Options:
  -t, --config-test        Do config test and exit
  -f, --no-fork            Do not daemonize main process
  -c, --config             Specify config file
  -u, --user               User to run rspamd as
  -g, --group              Group to run rspamd as
  -p, --pid                Path to pidfile
  -V, --dump-vars          Print all rspamd variables and exit
  -C, --dump-cache         Dump symbols cache stats and exit
  -X, --convert-config     Convert old style of config to xml one
@end example
@noindent

All options are optional: by default rspamd would try to read
@code{PREFIX/etc/rspamd.xml} config file and run as daemon. Also there is test
mode that can be turned on by passing @option{-t} argument. In test mode rspamd
would read config file and checks its syntax, if config file is OK, then exit
code is zero and non zero otherwise. Test mode is useful for testing new config
file without restarting of rspamd. With @option{-C} and @option{-V} arguments it is
possible to dump variables or symbols cache data. The last ability can be used
for determining which symbols are most often, which are most slow and to watch
to real order of rules inside rspamd. @option{-X} option can be used to convert
old style (pre 0.3.0) config to xml one:
@example
$ rspamd -c ./rspamd.conf -X ./rspamd.xml
@end example
@noindent
After this command new xml config would be dumped to rspamd.xml file.

@node signals
@section Managing rspamd with signals.
First of all it is important to note that all user's signals should be sent to
rspamd main process and not to its children (as for child processes these
signals may have other meanings). To determine which process is main you can use
two ways:
@itemize @bullet
@item by reading pidfile:
@example
$ cat pidfile
@end example
@noindent
@item by getting process info:
@example
$ ps auxwww | grep rspamd
nobody 28378  0.0  0.2 49744  9424   rspamd: main process (rspamd)
nobody 64082  0.0  0.2 50784  9520   rspamd: worker process (rspamd)
nobody 64083  0.0  0.3 51792 11036   rspamd: worker process (rspamd)
nobody 64084  0.0  2.7 158288 114200 rspamd: controller process (rspamd)
nobody 64085  0.0  1.8 116304 75228  rspamd: fuzzy storage (rspamd)

$ ps auxwww | grep rspamd | grep main
nobody 28378  0.0  0.2 49744  9424   rspamd: main process (rspamd)
@end example
@noindent
@end itemize

After getting pid of main process it is possible to manage rspamd with signals:
@itemize @bullet
@item SIGHUP - restart rspamd: reread config file, start new workers (as well as
controller and other processes), stop accepting connections by old workers,
reopen all log files. Note that old workers would be terminated after one minute
that should allow to process all pending requests. All new requests to rspamd
would be processed by newly started workers.
@item SIGTERM - terminate rspamd system.
@end itemize

These signals may be used in start scripts as it is done in @code{FreeBSD} start
script. Restarting of rspamd is doing rather softly: no connections would be
dropped and if new config is syntaxically incorrect old config would be used.

@chapter Configuring of rspamd.

@node principles
@section Principles of work.

We need to define several terms to explain configuration of rspamd. Rspamd
operates with @strong{rules}, each rule defines some actions that should be done with
message to obtain result. Result is called @strong{symbol} - a symbolic
representation of rule. For example, if we have a rule to check DNS record for
a url that contains in message we may insert resulting symbol if this DNS record
is found. Each symbol has several attributes:
@itemize @bullet
@item name - symbolic name of symbol (usually uppercase, e.g. MIME_HTML_ONLY)
@item weight - numeric weight of this symbol (this means how important this rule is), may
be negative
@item options - list of symbolic options that defines additional information about
processing this rule
@end itemize

Weights of symbols are called @strong{factors}. Also when symbol is inserted it
is possible to define additional multiplier to factor. This can be used for
rules that have dynamic weights, for example statistical rules (when probability
is higher weight must be higher as well).

All symbols and corresponding rules are combined in @strong{metrics}. Metric
defines a group of symbols that are designed for common purposes. Each metric
has maximum weight: if sum of all rules' results (symbols) is bigger than this
limit then this message is considered as spam in this metric. The default metric
is called @emph{default} and rules that have not explicitly specified metric
would insert their results to this default metric.

Let's impress how this technics works:
@enumerate 1
@item First of all when rspamd is running each module (lua, internal or external
dynamic module) can register symbols in any defined metric. After this process
rspamd has a cache of symbols for each metric. This cache can be saved to file
for speeding up process of optimizing order of calling of symbols.
@item Rspamd gets a message from client and parse it with mime parsing and do
other parsing jobs like extracting text parts, urls, and stripping html tags.
@item For each metric rspamd is looking to metric's cache and select rules to
check according to their order (this order depends on frequence of symbol, its
weight and execution time).
@item Rspamd calls rules of metric till the sum weight of symbols in metric is
less than its limit.
@item If sum weight of symbols is more than limit the processing of rules is
stopped and message is counted as spam in this metric.
@end enumerate

After processing rules rspamd is also does statistic check of message. Rspamd
statistic module is presented as a set of @strong{classifiers}. Each classifier
defines algorithm of statistic checks of messages. Also classifier definition
contains definition of @strong{statistic files} (or @strong{statfiles} shortly).
Each statfile contains of number of patterns that are extracted from messages.
These patterns are put into statfiles during learning process. A short example:
you define classifier that contains two statfiles: @emph{ham} and @emph{spam}.
Than you find 10000 messages that are spam and 10000 messages that contains ham.
Then you learn rspamd with these messages. After this process @emph{ham}
statfile contains patterns from ham messages and @emph{spam} statfile contains
patterns from spam messages. Then when you are checking message via this
statfiles messages that are like spam would have more probability/weight in
@emph{spam} statfile than in @emph{ham} statfile and classifier would insert
symbol of @emph{spam} statfile and would calculate how this message is like
patterns that are contained in @emph{spam} statfile. But rspamd is not limiting
you to define one classifier or two statfiles. It is possible to define a number
of classifiers and a number of statfiles inside a classifier. It can be useful
for personal statistic or for specific spam patterns. Note that each classifier
can insert only one symbol - a symbol of statfile with max weight/probability.
Also note that statfiles check is allways done after all rules. So statistic can
@strong{correct} result of rules.

Now some words about @strong{modules}. All rspamd rules are contained in
modules. Modules can be internal (like SURBL, SPF, fuzzy check, email and
others) and external written in @code{lua} language. In fact there is no differ
in the way, how rules of these modules are called:
@enumerate 1
@item Rspamd loads config and loads specified modules.
@item Rspamd calls init function for each module passing configurations
arguments.
@item Each module examines configuration arguments and register its rules (or
not register depending on configuration) in rspamd metrics (or in a single
metric).
@item During metrics process rspamd calls registered callbacks for module's
rules.
@item These rules may insert results to metric.
@end enumerate

So there is no actual difference between lua and internal modules, each are just
providing callbacks for processing messages. Also inside callback it is possible
to change state of message's processing. For example this can be done when it is
required to make DNS or other network request and to wait result. So modules can
pause message's processing while waiting for some event. This is true for lua
modules as well.

@node config structure
@section Rspamd config file structure.

Rspamd config file is placed in PREFIX/etc/rspamd.xml by default. You can
specify other location by passing @option{-c} option to rspamd. Rspamd config file
contains configuration parameters in XML format. XML was selected for rather
simple manual editing config file and for simple automatic generation as well as
for dynamic configuration. I've decided to move rules logic from XML file to
keep it small and simple. So rules are defined in @code{lua} language and rspamd
parameters are defined in xml file (rspamd.xml). Configuration rules are
included by @strong{<lua>} tag that have @strong{src} attribute that defines
relative path to lua file (relative to placement of rspamd.xml):
@example
<lua src="rspamd/lua/rspamd.lua">fake</lua>
@end example
@noindent
Note that it is not currently possible to have empty tags. I hope this
restriction would be fixed in future. Rspamd xml config consists of several
sections:
@itemize @bullet
@item Main section - section where main config parameters are placed.
@item Workers section - section where workers are described.
@item Classifiers section - section where you define your classify logic
@item Modules section - a set of sections that describes module's rules (in fact
these rules should be in lua code)
@item Factors section - a section where you can set numeric values for symbols
@item Logging section - a section that describes rspamd logging
@item Views section - a section that defines rspamd views
@end itemize

So common structure of rspamd.xml can be described this way:
@example
<? xml version="1.0" encoding="utf-8" ?>
<rspamd>
 <!-- Main section directives -->
 ...
 <!-- Workers directives -->
 <worker>
  ...
 </worker>
 ...
 <!-- Classifiers directives -->
 <classifier>
  ...
 </classifier>
 ...
 <!-- Factors -->
 <factors>
  <factor name="MIME_HTML_ONLY>1.1</factor>
  ...
 </factors>
 <!-- Logging section -->
 <logging>
  <type>console</type>
  <level>info</level>
  ...
 </logging>
 <!-- Views section -->
 <view>
  ...
 </view>
 ...
 <!-- Modules settings -->
 <module name="regexp">
  <option name="test">test</option>
  ...
 </module>
 ...
</rspamd>
@end example

Each of these sections would be described further in details.

@section Rspamd configuration atoms.
@node config atoms

There are several primitive types of rspamd configuration parameters:
@itemize @bullet
@item String - common string that defines option.
@item Number - integer or fractional number (e.g.: 10 or -1.5).
@item Time - ammount of time in milliseconds, may has suffixes: 
@itemize @bullet
@item @emph{s} - for seconds (e.g. @emph{10s});
@item @emph{m} - for minutes (e.g. @emph{10m});
@item @emph{h} - for hours (e.g. @emph{10h});
@item @emph{d} - for days (e.g. @emph{10d});
@end itemize
@item Size - like number numerci reprezentation of size, but may have a suffix:
@itemize @bullet
@item @emph{k} - 'kilo' - number * 1024 (e.g. @emph{10k});
@item @emph{m} - 'mega' - number * 1024 * 1024 (e.g. @emph{10m});
@item @emph{g} - 'giga' - number * 1024 * 1024 * 1024 (e.g. @emph{1g});
@end itemize
@noindent
Size atoms are used for memory limits for example.
@item Lists - path to dynamic rspamd list (e.g. @emph{http://some.host/some/path}).
@end itemize

While practically all atoms are rather trivial to understand rspamd lists may
cause some confusion. Lists are widely used in rspamd for getting data that can
be often changed for example white or black lists, lists of ip addresses, lists
of domains. So for such purposes it is possible to use files that can be get
either from local filesystem (e.g. @code{file:///var/run/rspamd/whitelsist}) or
by HTTP (e.g. @code{http://some.host/some/path/list.txt}). Rspamd constantly
looks for changes in this files, if using HTTP it also set
@emph{If-Modified-Since} header and check for @emph{Not modified} reply. So it
causes no overhead when lists are not modified and may allow to store huge lists
and to distribute them over HTTP. Monitoring of lists is done with some random
delay (jitter), so if you have many rspamd servers in cluster that are
monitoring a single list they would come to check or download it in slightly different
time. The two most common list formats are @emph{IP list} and @emph{domains
list}. IP list contains of ip addresses in dot notation (e.g.
@code{192.168.1.1}) or ip/network pairs in CIDR notation (e.g.
@code{172.16.0.0/16}). Items in lists are separated by newline symbol. Lines
that begin with @emph{#} symbol are considered as comments and are ignored while
parsing. Domains list is very like ip list with difference that it contains
domain names.

@section Main rspamd configuration section.

Main rspamd configurtion section contains several definitions that determine
main parameters of rspamd for example path to pidfile, temporary directory, lua
includes, several limits e.t.c. Here is list of this directives explained:

@multitable @columnfractions .2 .8
@headitem Tag @tab Mean

@item @var{<tempdir>}
@tab Defines temporary directory for rspamd. Default is to use @env{TEMP}
environment variable or @code{/tmp}.

@item @var{<pidfile>}
@tab Path to rspamd pidfile. Here would be stored a pid of main process.
Pidfile is used to manage rspamd from start scripts.

@item @var{<statfile_pool_size>}
@tab Limit of statfile pool size: a total number of bytes that can be used for
mapping statistic files. Rspamd is using LRU system and would unmap the most
unused statfile when this limit would be reached. The common sense is to set
this variable equal to total size of all statfiles, but it can be less than this
in case of dynamic statfiles (for per-user statistic).

@item @var{<filters>}
@tab List of enabled internal filters. Items in this list can be separated by
spaces, semicolons or commas. If internal filter is not specified in this line
it would not be loaded or enabled.

@item @var{<raw_mode>}
@tab Boolean flag that specify whether rspamd should try to convert all
messages to UTF8 or not. If @var{raw_mode} is enabled all messages are
processed @emph{as is} and are not converted. Raw mode is faster than utf mode
but it may confuse statistics and regular expressions.

@item @var{<lua>}
@tab Defines path to lua file that should be loaded fro configuration. Path to
this file is defined in @strong{src} attribute. Text inside tag is required but
is not parsed (this is stupid limitation of parser's design).
@end multitable

@section Rspamd logging configuration.

Rspamd has a number of logging variants. First of all there are three types of
logs that are supported by rspamd: console loggging (just output log messages to
console), file logging (output log messages to file) and logging via syslog.
Also it is possible to filter logging to specific level:
@itemize @bullet
@item error - log only critical errors
@item warning - log errors and warnings
@item info - log all non-debug messages
@item debug - log all including debug messages (huge amount of logging)
@end itemize
Also it is possible to turn on debug messages for specific ip addresses. This
ability is usefull for testing.

For each logging type there are special mandatory parameters: log facility for
syslog (read @emph{syslog (3)} manual page for details about facilities), log
file for file logging. Also file logging may be buffered for speeding up. For
reducing logging noise rspamd detects for sequential identic log messages and
replace them with total number of repeats:
@example
#81123(fuzzy): May 11 19:41:54 rspamd file_log_function: Last message repeated 155 times
#81123(fuzzy): May 11 19:41:54 rspamd process_write_command: fuzzy hash was successfully added
@end example

Here is summary of logging parameters:


@multitable @columnfractions .2 .8
@headitem Tag @tab Mean
@item @var{<type>}
@tab Defines logging type (file, console or syslog). For each type mandatory
attriute must be present:
@itemize @bullet
@item @emph{filename} - path to log file for file logging type;
@item @emph{facility} - syslog logging facility.
@end itemize

@item @var{<level>}
@tab Defines loggging level (error, warning, info or debug).

@item @var{<log_buffer>}
@tab For file and console logging defines buffer in bytes (kilo, mega or giga
bytes) that would be used for logging output.

@item @var{<log_urls>}
@tab Flag that defines whether all urls in message would be logged. Useful for
testing.

@item @var{<debug_ip>}
@tab List that contains ip addresses for which debugging would be turned on. For
more information about ip lists look at @ref{config atoms}.
@end multitable

@bye