Browse Source

* Add modules documentation

tags/0.3.0
Vsevolod Stakhov 14 years ago
parent
commit
39e5de3dee
1 changed files with 469 additions and 0 deletions
  1. 469
    0
      doc/rspamd.texi

+ 469
- 0
doc/rspamd.texi View File

@@ -1366,4 +1366,473 @@ servers rspamd would select upstream by hash of fuzzy hash). Also storage can
contain several lists identified by number. Each hash has its own weight that
allows to set up dynamic rules that add different score from different hashes.

@chapter Rspamd modules.

@section Introduction.

This chapter describes modules that are shipped with rspamd. Here you can find
details about modules configuration, principles of working, tricks to make spam
filtering effective. First sections describe internal modules written in C:
regexp (regular expressions), surbl (black list for URLs), fuzzy_check (checks
for fuzzy hashes), chartable (check for character sets in messages) and emails
(check for blacklisted email addresses in messages). Modules configuration can
be done in lua or in config file itself.

@subsection Lua configuration.
You may use lua for setting configuration options for modules. With lua you can
write rather complex rules that can contain not only text lines, but also some
lua functions that would be called while processing messages. For loading lua
configuration you should add line to rspamd.xml:
@example
<lua src="/usr/local/etc/rspamd/lua/my.lua">fake</lua>
@end example
@noindent
It is possible to load several scripts this way. Inside lua file there would be
defined global table with name @var{config}. This table should contain
configuration options for modules indexed by module. This can be written this
way:
@example
config['module_name'] = {}
local mconfig = config['module_name']

mconfig['option_name'] = 'option value'

local a = 'aa'
local b = 'bb'

mconfig['other_option'] = string.format('%s, %s', a, b)
@end example
@noindent
In this simple example we defines new element of table that is associated with
module named 'module_name'. Then we assign to it an empty table (@code{@{@}})
and associate local variable mconfig. Then we set some elements of this table,
that is equialent to setting module options like that:
@example
option_name = option_value
other_option = aa, bb
@end example
@noindent
Also you may assign to elements of modules tables some functions. That functions
should accept one argument - worker task object and return result specific for
that option: number, string, boolean. This can be shown on this simple example:
@example

local function test (task)
if task:get_ip() == '127.0.0.1' then
return 1
else
return 0
end
end

mconfig['some_option'] = test
@end example
In this example we assign to module option 'some_option' a function that check
for message's ip and return 1 if that ip is '127.0.0.1'.

So using lua for configuration can help for making complex rules and for
structuring rules - you can place options for specific modules to specific files
and use lua function @code{dofile} for loading them (or add other @code{<lua>}
tag to rspamd.xml).

@subsection XML configuration.

Options for rspamd modules can be set up from xml file too. This can be used for
simple and/or temporary rules and should not be used for complex rules as this
would make xml file too hard to read and edit. Thought it is surely possible but
not recommended from points of config file understanding. Here is a simple
example of module config options:
@example
<module name="module_name">
<option name="option_name">option_value</option>
<option name="other_option">aa, bb</option>
</module>
@end example
@noindent
Note that you need to encode xml entitles like @code{&} - @code{&amp;} and so
on. Also only utf8 encoding is allowed. In sample rspamd configuration all
modules except regexp module are configured via xml as they have only settings
and regexp module has rules that are sometimes rather complex.

@section Regexp module.

@subsection Introduction.
Regexp module is one of the most important rspamd modules. Regexp module can
load regular expressions and filter messages according to them. Also it is
possible to use logical expressions of regexps to create complex rules of
filtering. It is allowed to use logical operators:
@itemize @bullet
@item & - logical @strong{AND} function
@item | - logical @strong{OR} function
@item ! - logical @strong{NOT} function
@end itemize
Also it is possible to use brackets for making priorities in expressions. Regexp
module operates with @emph{regexp items} that can be combined with logical
operators into logical @emph{regexp expresions}. Each expression is associated
with its symbol and if it evaluates to true with this message the symbol would
be inserted. Note that rspamd uses internal optimization of logical expressions
(for example if we have expression 'rule1 & rule2' rule2 would not be evaluated
if rule1 is false) and internal regexp cache (so if rule1 and rule2 have common
items they would be evaluated only once). So if you need speed optimization of
your rules you should take this fact into consideration.

@subsection Regular expressions.
Rspamd uses perl compatible regular expressions. You may read about perl regular
expression syntax here: @url{http://perldoc.perl.org/perlre.html}. In rspamd
regular expressions must be enclosed in slashes:
@example
/^\\d+$/
@end example
@noindent
If '/' symbol must be placed into regular expression it should be escaped:
@example
/^\\/\\w+$/
@end example
@noindent
After last slash it is possible to place regular expression modificators:
@multitable @columnfractions 0.1 0.9
@headitem Modificator @tab Mean
@item @strong{i} @tab Ignore case for this expression.
@item @strong{m} @tab Assume this expression as multiline.
@item @strong{s} @tab Assume @emph{.} as all characters including newline
characters (should be used with @strong{m} flag).
@item @strong{x} @tab Assume this expression as extended regexp.
@item @strong{u} @tab Performs ungreedy matches.
@item @strong{o} @tab Optimize regular expression.
@item @strong{r} @tab Assume this expression as @emph{raw} (this is actual for
utf8 mode of rspamd).
@item @strong{H} @tab Search expression in message's headers.
@item @strong{X} @tab Search expression in raw message's headers (without mime
decoding).
@item @strong{M} @tab Search expression in the whole message (must be used
carefully as @strong{the whole message} would be checked with this expression).
@item @strong{P} @tab Search expression in all text parts.
@item @strong{U} @tab Search expression in all urls.
@end multitable

You can combine flags with each other:
@example
/^some text$/iP
@end example
@noindent
All regexp must be with type: H, X, M, P or U as rspamd should know where to
search for specified pattern. Header regexps (H and X) have special syntax if
you need to check specific header, for example @emph{From} header:
@example
From=/^evil.*$/Hi
@end example
@noindent
If header name is not specified all headers would be matched. Raw headers is
matching is usefull for searching for mime specific headers like MIME-Version.
The problem is that gmime that is used for mime parsing adds some headers
implicitly, for example @emph{MIME-Version} and you should match them using raw
headers. Also if header's value is encoded (base64 or quoted-printable encoding)
you can search for decoded version using H modificator and for raw using X
modificator. This is usefull for finding bad encodings types or for unnecessary
encoding.

@subsection Internal function.
Rspamd provides several internal functions for simplifying message processing.
You can use internal function as items in logical expressions as they like
regular expressions return logical value (true or false). Here is list of
internal functions with their arguments:
@multitable @columnfractions 0.3 0.2 0.5
@headitem Function @tab Arguments @tab Description
@item header_exists
@tab header name
@tab Returns true if specified header exists.

@item compare_parts_distance
@tab number
@tab If message has two parts (text/plain and text/html) compare how much they
differs (html messages are compared with stripped tags). The difference is
number in percents (0 is identically parts and 100 is totally different parts).
So if difference is more than number this function returns true.

@item compare_transfer_encoding
@tab string
@tab Compares header Content-Transfer-Encoding with specified string.

@item content_type_compare_param
@tab param_name, param_value
@tab Compares specified parameter of Content-Type header with regexp or certain
string:
@example
content_type_compare_param(Charset, /windows-\d+/)
content_type_compare_param(Charset, ascii)
@end example
@noindent

@item content_type_has_param
@tab param_name
@tab Returns true if content-type has specified parameter.

@item content_type_is_subtype
@tab subtype_name
@tab Return true if content-type is of specified subtype (for example for
text/plain subtype is 'plain').

@item content_type_is_type
@tab type_name
@tab Return true if content-type is of specified type (for example for
text/plain subtype is 'text'):
@example
content_type_is_type(text)
content_type_is_subtype(/?.html/)
@end example
@noindent

@item regexp_match_number
@tab number,[regexps list]
@tab Returns true if specified number of regexps matches for this message. This
can be used for making rules when you do not know which regexps should match but
if 2 of them matches the symbol shoul be inserted. For example:
@example
regexp_match_number(2, /^some evil text.*$/Pi, From=/^hacker.*$/H, header_exists(Subject))
@end example
@noindent
@item has_only_html_part
@tab nothing
@tab Returns true when message has only HTML part

@item compare_recipients_distance
@tab number
@tab Like compare_parts_distance calculate difference between recipients. Number
is used as minimum percent of difference. Note that this function would check
distance only when there are more than 5 recipients in message.

@item is_recipients_sorted
@tab nothing
@tab Returns true if recipients list is sorted. This function would also works
for more than 5 recipients.

@item is_html_balanced
@tab nothing
@tab Returns true when all HTML tags in message are balanced.

@item has_html_tag
@tab tag_name
@tab Returns true if tag 'tag_name' exists in message.

@end multitable

These internal functions can be easily implemented in lua but I've decided to
make them built-in as they are widely used in our rules. In fact this list may
be extended in future.

@subsection Conclusion.
Rspamd regexp module is powerfull tool for matching different patterns in
messages. You may use logical expressions of regexps and internal rspamd
functions to make rules. Rspamd is shipped with many rules for regexp module
(most of them are taken from spamassassin rules as rspamd originally was a
replacement of spamassassin) so you can look at them in ETCDIR/rspamd/lua/regexp
directory. There are many built-in rules with detailed comments. Also note that
if you add logical rule into XML file you need to escape all XML entitles (like
@emph{&} operators). When you make complex rules from many parts do not forget
to add brackets for parts inside expression as you would not predict order of
checks otherwise. Rspamd regexp module has internal logical optimization and
regexp cache, so you may use identical regexp many times - they would be matched
only once. And in logical expression you may optimize performance by putting
likely TRUE regexp first in @emph{OR} expression and likely FALSE expression
first in @emph{AND} expression. A number of internal functions can simplify
complex expressions and for making common filters. Lua functions can be added in
rules as well (they should return boolean value).

@section SURBL module.

Surbl module is designed for checking urls via blacklists. You may read about
surbls at @url{http://www.surbl.org}. Here is the sequence of operations that is
done by surbl module:
@enumerate 1
@item Extract all urls in message and get domains for each url.
@item Check to special list called '2tld' and extract 3 components for domains
from that list and 2 components for domains that are not listed:
@example
http://virtual.somehost.domain.com/some_path
-> somehost.domain.com if domain.com is in 2tld list
-> domain.com if not in 2tld
@end example
@noindent
@item Remove duplicates from domain lists
@item For each registered surbl do dns request in form @emph{domain.surbl_name}
@item Get result and insert symbol if that name resolves
@item It is possible to examine bits in returned IP address and insert different
symbol for each bit that is turned on in result.
@end enumerate
All DNS requests are done asynchronously so you may not bother about blocking.
SURBL module has several configuration options:
@itemize @bullet
@item @emph{metric} - metric to insert symbol to.
@item @emph{2tld} - list argument of domains for those 3 components of domain name
would be extracted.
@item @emph{max_urls} - maximum number of urls to check.
@item @emph{whitelist} - map of domains for which surbl checks would not be performed.
@item @emph{suffix} - a name of surbl. It is possible to add several suffixes:
@example
suffix_RAMBLER_URIBL = insecure-bl.rambler.ru
or in xml:
<param name="suffix_RAMBLER_URIBL">insecure-bl.rambler.ru</param>
@end example
@noindent
It is possible to add %b to symbol name for checking specific bits:
@example
suffix_%b_SURBL_MULTI = multi.surbl.org
then you may define replaces for %b in symbol name for each bit in result:
bit_2 = SC -> sc.surbl.org
bit_4 = WS -> ws.surbl.org
bit_8 = PH -> ph.surbl.org
bit_16 = OB -> ob.surbl.org
bit_32 = AB -> ab.surbl.org
bit_64 = JP -> jp.surbl.org
@end example
@noindent
So we make one DNS request and check for specific list by checking bits in
result ip. This is described in surbl page:
@url{http://www.surbl.org/lists.html#multi}. Note that result symbol would NOT
contain %b as it would be replaced by bit name. Also if several bits are set
several corresponding symbols would be added.
@end itemize

Also surbl module can use redirector - a special daemon that can check for
redirects. It uses HTTP/1.0 for requests and accepts a url and returns resolved
result. Redirector is shipped with rspamd but not enabled by default. You may
enable it on stage of configuring but note that it requires many perl modules
for its work. Rspamd redirector is described in details further. Here are surbl
options for working with redirector:
@itemize @bullet
@item @emph{redirector}: adress of redirector (in format host:port)
@item @emph{redirector_connect_timeout} (seconds): redirector connect timeout (default: 1s)
@item @emph{redirector_read_timeout} (seconds): timeout for reading data (default: 5s)
@item @emph{redirector_hosts_map} (map string): map that contains domains to check with redirector
@end itemize

So surbl module is an easy to use way to check message's urls and it may be used
in every configuration as it filters rather big ammount of email spam and scam.

@section SPF module.

SPF module is designed to make checks of spf records of sender's domains. SPF
records are placed in TXT DNS items for domains that have enabled spf. You may
read about SPF at @url{http://en.wikipedia.org/wiki/Sender_Policy_Framework}.
There are 3 results of spf check for domain:
@itemize @bullet
@item ALLOW - this ip is allowed to send messages for this domain
@item FAIL - this ip is @strong{not} allowed to send messages for this domain
@item SOFTFAIL - it is unknown whether this ip is allowed to send mail for this
domain
@end itemize
SPF supports different mechanizms for checking: dns subrequests, macroses,
includes, blacklists. Rspamd supports the most of them. Also for security
reasons there is internal limits for DNS subrequests and inclusions recursion.
SPF module support very small ammount of options:
@itemize @bullet
@item @emph{metric} (string): metric to insert symbol (default: 'default')
@item @emph{symbol_allow} (string): symbol to insert (default: 'R_SPF_ALLOW')
@item @emph{symbol_fail} (string): symbol to insert (default: 'R_SPF_FAIL')
@item @emph{symbol_softfail} (string): symbol to insert (default: 'R_SPF_SOFTFAIL')
@end itemize

@section Chartable module.

Chartable is a simple module that detects different charsets in a message. This
module is aimed to protect from emails that contains symbols from different
character sets that looks like each other. Chartable module works differently
for raw and utf modes: in utf modes it detects different characters from unicode
tables and in raw modes only ASCII and non-ASCII symbols. Configuration of whis
module is very simple:
@itemize @bullet
@item @emph{metric} (string): metric to insert symbol (default: 'default')
@item @emph{symbol} (string): symbol to insert (default: 'R_BAD_CHARSET')
@item @emph{threshold} (double): value that would be used as threshold in expression
@math{N_{charset-changes} / N_{chars}}
(e.g. if threshold is 0.1 than charset change should occure more often than in 10 symbols),
default: 0.1
@end itemize

@section Fuzzy check module.

Fuzzy check module provides a client for rspamd fuzzy storage. Fuzzy check can
work with a cluster of rspamd fuzzy storages and the specific storage is
selected by value of hash of message's hash. The available configuration options
are:
@itemize @bullet
@item @emph{metric} (string): metric to insert symbol (default: 'default')
@item @emph{symbol} (string): symbol to insert (default: 'R_FUZZY')
@item @emph{max_score} (double): maximum score to that weights of hashes would be
normalized (default: 0 - no normalization)
@item @emph{fuzzy_map} (string): a string that contains map in format { fuzzy_key => [
symbol, weight ] } where fuzzy_key is number of fuzzy list. This string itself
should be in format 1:R_FUZZY_SAMPLE1:10,2:R_FUZZY_SAMPLE2:1 etc, where first
number is fuzzy key, second is symbol to insert and third - weight for
normalization
@item @emph{min_length} (integer): minimum length (in characters) for text part to be
checked for fuzzy hash (default: 0 - no limit)
@item @emph{whitelist} (map string): map of ip addresses that should not be checked
with this module
@item @emph{servers} (string): list of fuzzy servers in format
"server1:port,server2:port" - these servers would be used for checking and
storing fuzzy hashes
@end itemize

@section Forged recipients.

Forged recipients is a lua module that compares recipients provided by smtp
dialog and recipients from @emph{To:} header. Also it is possible to compare
@emph{From:} header with SMTP from. So you may set @strong{symbol_rcpt} option
to set up symbol that would be inserted when recipients differs and
@strong{symbol_sender} when senders differs.

@section Maillist.

Maillist is a module that detects whether this message is send by using one of
popular mailing list systems (among supported are ezmlm, mailman and
subscribe.ru systems). The module has only option @strong{symbol} that defines a
symbol that would be inserted if this message is sent via mailing list.

@section Once received.

This lua module checks received headers of message and insert symbol if only one
received header is presented in message (that usually signals that this mail is
sent directly to our MTA). Also it is possible to insert @emph{strict} symbol
that indicates that host from which we receive this message is either
unresolveable or has bad patterns (like 'dynamic', 'broadband' etc) that
indicates widely used botnets. Configuration options are:
@itemize @bullet
@item @emph{symbol}: symbol to insert for messages with one received header.
@item @emph{symbol_strict}: symbol to insert for messages with one received
header and containing bad patterns or unresolveable sender.
@item @emph{bad_host}: defines pattern that would be count as "bad".
@item @emph{good_host}: defines pattern that would be count as "good" (no strict
symbol would be inserted), note that "good" has a priority over "bad" pattern.
@end itemize
You can define several "good" and "bad" patterns for this module.

@section Received rbl.

Received rbl module checks for all received headers and make dns requests to IP
black lists. This can be used for checking whether this email was transfered by
some blacklisted gateway. Here are options available:
@itemize @bullet
@item @emph{symbol}: symbol to insert if message contains blacklisted received
headers
@item @emph{rbl}: a name of rbl to check, it is possible to define specific
symbol for this rbl by adding symbol name after semicolon:
@example
rbl = pbl.spamhaus.org:RECEIVED_PBL
@end example
@end itemize

@section Conclusion.

Rspamd is shipped with some ammount of modules that provides basic functionality
fro checking emails. You are allowed to add custom rules for regexp module and
to set up available parameters for other modules. Also you may write your own
modules (in C or Lua) but this would be described further in this documentation.
You may set configuration options for modules from lua or from xml depends on
its complexity. Internal modules are enabled and disabled by @strong{filters}
configuration option. Lua modules are loaded and usually can be disabled by
removing their configuration section from xml file or by removing corresponding
line from @strong{modules} section.

@bye

Loading…
Cancel
Save