summaryrefslogtreecommitdiffstats
path: root/doc/markdown/tutorials
diff options
context:
space:
mode:
authorVsevolod Stakhov <vsevolod@highsecure.ru>2015-12-17 11:35:43 +0000
committerVsevolod Stakhov <vsevolod@highsecure.ru>2015-12-17 11:35:43 +0000
commit8235c96ab94763128191bed32a4311c36a022a28 (patch)
tree9d883f7e438a3de008800727dc055857ee816d36 /doc/markdown/tutorials
parent2b50cf4941cd3dd9496ba92c3728f7800a013544 (diff)
downloadrspamd-8235c96ab94763128191bed32a4311c36a022a28.tar.gz
rspamd-8235c96ab94763128191bed32a4311c36a022a28.zip
Add writing rules tutorial
Diffstat (limited to 'doc/markdown/tutorials')
-rw-r--r--doc/markdown/tutorials/writing_rules.md428
1 files changed, 428 insertions, 0 deletions
diff --git a/doc/markdown/tutorials/writing_rules.md b/doc/markdown/tutorials/writing_rules.md
new file mode 100644
index 000000000..1ce6d933e
--- /dev/null
+++ b/doc/markdown/tutorials/writing_rules.md
@@ -0,0 +1,428 @@
+# Writing rspamd rules
+
+In this tutorial, I describe how to create new rules for rspamd both lua and regexp ones.
+
+## Introduction
+
+Rules are the essential part of spam filtering system and rspamd ships with some prepared rules. However, if you run your
+own system you might want to have your own rules for better spam filtering or better false positives rate. Rules are usually
+written in `lua` language, where you specify both custom logic and generic regular expressions.
+
+## Configuration files
+
+Since rspamd is shipped with internal rules it is a good idea to store your custom rules and configuration in some separate file
+to avoid clash with the pre-built rules that might change from version to version. There are some possibilities for these purposes:
+
+- Local rules in lua should be stored in the file named `${CONFDIR}/rspamd.local.lua` where `${CONFDIR}` is the directory where your configuration files are placed (e.g. `/etc/rspamd` or `/usr/local/etc/rspamd` for some systems)
+- Local configuration that **adds** options to rspamd should be placed in `${CONFDIR}/rspamd.conf.local`
+- Local configuration that **overrides** the default settings should be placed in `${CONFDIR}/rspamd.conf.override`
+
+Lua local configuration can be used for both override and extending:
+
+rspamd.lua:
+
+~~~lua
+config['regexp']['symbol'] = '/some_re/'
+~~~
+
+rspamd.local.lua:
+
+~~~lua
+config['regexp']['symbol1'] = '/other_re/' -- add 'symbol1' key to the table
+config['regexp']['symbol'] = '/override_re/' -- replace regexp for 'symbol'
+~~~
+
+For the configuration rules you can take a look at the following examples:
+
+rspamd.conf:
+
+~~~nginx
+var1 = "value1";
+
+section "name" {
+ var2 = "value2";
+}
+~~~
+
+rspamd.conf.local:
+
+~~~nginx
+var1 = "value2";
+
+section "name" {
+ var3 = "value3";
+}
+~~~
+
+Resulting config:
+
+~~~nginx
+var1 = "value1";
+var2 = "value2";
+
+section "name" {
+ var2 = "value2";
+}
+section "name" {
+ var3 = "value3";
+}
+~~~
+
+Override example:
+
+rspamd.conf:
+
+~~~nginx
+var1 = "value1";
+
+section "name" {
+ var2 = "value2";
+}
+~~~
+
+rspamd.conf.override:
+
+~~~nginx
+var1 = "value2";
+
+section "name" {
+ var3 = "value3";
+}
+~~~
+
+Resulting config:
+
+~~~nginx
+var1 = "value1";
+
+# Note that var2 is removed completely
+section "name" {
+ var3 = "value3";
+}
+~~~
+
+The conjunction of `override` and `local` configs should allow to resolve complicated issues without having like Turing complete language to distinguish cases.
+
+## Writing rules
+
+There are two main types of rules that are normally defined by rspamd:
+
+- `Lua` rules: pieces of code in lua programming language to work with messages processed
+- `Regexp` rules: regular expressions and combinations of regular expressions to match specific patterns
+
+Lua rules are useful to do some complex tasks: ask DNS, query redis or HTTP, examine some task specific details. Regexp rules are useful since they are
+optimized by rspamd heavily (especially when `hyperscan` is enabled) and allow to match custom patterns in headers, urls, text parts and even the whole message body.
+
+### Rules weights
+
+Rules weights are usually defined in the `metrics` which contain the following data:
+
+- score triggers for different actions
+- symbols scores
+- symbols descriptions
+- symbol group definitions:
+ + symbols in group
+ + description of groups
+ + joint group score limit
+
+For built-in rules scores are placed in the file called `${CONFDIR}/metrics.conf`, however, you have two possibilities to define scores for your rules:
+
+1. Define scores in `rspamd.conf.local` as following:
+
+~~~nginx
+metric "default" {
+ symbol "MY_SYMBOL" {
+ description = "my cool rule";
+ score = 1.5;
+ }
+}
+~~~
+
+2. Define scores directly in lua when describing symbol:
+
+~~~lua
+-- regexp rule
+config['regexp']['MY_SYMBOL'] = {
+ re = '/a/M & From=/blah/',
+ score = 1.5,
+ description = 'my cool rule',
+ group = 'my symbols'
+}
+
+-- lua rule
+rspamd_config.MY_LUA_SYMBOL = {
+ callback = function(task)
+ -- Do something
+ return true
+ end
+ score = -1.5,
+ description = 'another cool rule',
+ group = 'my symbols'
+}
+~~~
+
+## Regexp rules
+
+Regexp rules are executed by `regexp` module of rspamd and you can find the detailed description of regexp syntax in [the module documentation](../modules/regexp.md)
+In this tutorial, I will give merely some performance considerations about regular expressions:
+
+* Prefer lightweight regexps, such as header or url regexps to heavy ones, such as mime or body regexps
+* If you need to match some text in the message's content, prefer `mime` regexp as they are executed on text content only
+* If you **really** need to match the whole messages, then you might also consider [trie](../modules/trie.md) module as it is significantly faster
+* Avoid complex regexps, avoid backtracing, avoid negative groups `(?!)`, avoid capturing patterns (replace with `(?:)`), avoid potentially empty patterns, e.g. `/^.*$/`
+
+Following these rules allows to create fast but still efficient rules. To add regexp rules you should use `config` global table that is defined in any lua file used by rspamd:
+
+~~~lua
+config['regexp'] = {} -- Remove all regexp rules (including internal ones)
+local reconf = config['regexp'] -- Create alias for regexp configs
+
+local re1 = 'From=/foo@/H' -- Mind local here
+local re2 = '/blah/P'
+
+reconf['SYMBOL'] = {
+ re = string.format('(%s) && !(%s)', re1, re2), -- use string.format to create expression
+ score = 1.2,
+ description = 'some description',
+
+ condition = function(task) -- run this rule only if some condition is satisfied
+ return true
+ end,
+}
+~~~
+
+## Lua rules
+
+Lua rules are more powerful than regexp ones but they are not optimized so heavily and can cause performance issues if written incorrectly. All lua rules
+accepts a special parameter called `task` which represents a message scanned.
+
+### Return values
+
+Each lua rule can return 0 or false that means that rule has not matched or true of the symbol should be inserted.
+In fact, you can return any positive or negative number which would be multiplied by rule's score, e.g. if rule score is
+`1.2`, then when your function returns `1` then symbol will have score `1.2`, and when your function returns `2.0` then the symbol will have score `2.4`.
+
+### Rules conditions
+
+Like regexp rules, conditions are allowed for lua regexps, for example:
+
+~~~lua
+rspamd_config.SYMBOL = {
+ callback = function(task)
+ return 1
+ end,
+ score = 1.2,
+ description = 'some description',
+
+ condition = function(task) -- run this rule only if some condition is satisfied
+ return true
+ end,
+}
+~~~
+
+### Useful task manipulations
+
+There are number of methods in [task](../lua/task.md) objects. For example, you can get any parts in a message:
+
+~~~lua
+rspamd_config.HTML_MESSAGE = {
+ callback = function(task)
+ local parts = task:get_text_parts()
+
+ if parts then
+ for i,p in ipairs(parts) do
+ if p:is_html() then
+ return 1
+ end
+ end
+ end
+
+ return 0
+ end,
+ score = -0.1,
+ description = 'HTML included in message',
+}
+~~~
+
+You can get HTML information:
+
+~~~lua
+local function check_html_image(task, min, max)
+ local tp = task:get_text_parts()
+
+ for _,p in ipairs(tp) do
+ if p:is_html() then
+ local hc = p:get_html()
+ local len = p:get_length()
+
+
+ if len >= min and len < max then
+ local images = hc:get_images()
+ if images then
+ for _,i in ipairs(images) do
+ if i['embedded'] then
+ return true
+ end
+ end
+ end
+ end
+ end
+ end
+end
+
+rspamd_config.HTML_SHORT_LINK_IMG_1 = {
+ callback = function(task)
+ return check_html_image(task, 0, 1024)
+ end,
+ score = 3.0,
+ group = 'html',
+ description = 'Short html part (0..1K) with a link to an image'
+}
+~~~
+
+You can get message headers with full information passed:
+
+~~~lua
+
+rspamd_config.SUBJ_ALL_CAPS = {
+ callback = function(task)
+ local util = require "rspamd_util"
+ local sbj = task:get_header('Subject')
+
+ if sbj then
+ local stripped_subject = subject_re:search(sbj, false, true)
+ if stripped_subject and stripped_subject[1] and stripped_subject[1][2] then
+ sbj = stripped_subject[1][2]
+ end
+
+ if util.is_uppercase(sbj) then
+ return true
+ end
+ end
+
+ return false
+ end,
+ score = 3.0,
+ group = 'headers',
+ description = 'All capital letters in subject'
+}
+~~~
+
+You can also access HTTP headers, urls and other useful properties of rspamd tasks. Moreover, you can use
+global convenience modules exported by rspamd, such as [rspamd_util](../lua/util.md) or [rspamd_logger](../lua/logger.md) by requiring them in your rules:
+
+~~~lua
+rspamd_config.SUBJ_ALL_CAPS = {
+ callback = function(task)
+ local util = require "rspamd_util"
+ local logger = require "rspamd_logger"
+ ...
+ end,
+}
+~~~
+
+## Asynchronous actions
+
+For asynchronous actions, such as redis access or DNS checks it is recommended to use
+dedicated callbacks, called symbol handlers. The difference to generic lua rules is that
+dedicated callbacks are not obliged to return value but they use method `task:insert_result(symbol, weight)` to
+indicate match. All lua plugins are implemented as symbol handlers. Here is a simple example of symbol handler that checks DNS:
+
+~~~lua
+rspamd_config:register_symbol('SOME_SYMBOL', 1.0,
+ function(task)
+ local to_resolve = 'google.com'
+ local logger = require "rspamd_logger"
+
+ local dns_cb = function(resolver, to_resolve, results, err)
+ if results then
+ logger.infox(task, '<%1> host: [%2] resolved for symbol: %3',
+ task:get_message_id(), to_resolve, 'RULE')
+ task:insert_result(rule['symbol'], 1)
+ end
+ end
+ task:get_resolver():resolve_a({
+ task=task,
+ name = to_resolve,
+ callback = dns_cb})
+ end)
+~~~
+
+You can also set the desired score and description if you'd like:
+
+~~~lua
+rspamd_config:set_metric_symbol('SOME_SYMBOL', 1.2, 'some description')
+-- Table version
+if rule['score'] then
+ if not rule['group'] then
+ rule['group'] = 'whitelist'
+ end
+ rule['name'] = symbol
+ rspamd_config:set_metric_symbol(rule)
+end
+~~~
+
+## Difference between `config` and `rspamd_config`
+
+It might be confusing that there are two variables with the common mean. That comes from
+the history of rspamd and was used previously for a purpose. However, currently `rspamd_config` represents
+the object that can do many things:
+
+* Get configuration options:
+
+~~~lua
+rspamd_config:get_all_opts('section')
+~~~
+
+* Add maps:
+
+~~~lua
+rule['map'] = rspamd_config:add_kv_map(rule['domains'],
+ "Whitelist map for " .. symbol)
+~~~
+
+* Register callbacks for symbols:
+
+~~~lua
+rspamd_config:register_symbol('SOME_SYMBOL', 1.0, some_functions)
+~~~
+
+* Register lua rules (note that `__newindex` metamethod is actually used here):
+
+~~~lua
+rspamd_config.SYMBOL = {...}
+~~~
+
+* Register composites, prefilters, postfilters and so on
+
+On the contrary, `config` global is extremely simple: it's just a plain table of configuration options that is exactly the same
+as defined in `rspamd.conf` (and `rspamd.conf.local` or `rspamd.conf.override`). However, you can also use lua tables and even functions for some
+options. For example, `regexp` module also can accept `callback` argument:
+
+~~~lua
+config['regexp']['SYMBOL'] = {
+ callback = function(task) ... end,
+ ...
+}
+~~~
+
+However, such a syntax is discouraged and is preserved mostly for compatibility reasons.
+
+## Configuration applying order
+
+It might be unclear, but there is a strict order of configuration options application and replacements:
+
+1. `rspamd.conf` and `rspamd.conf.local` are processed
+2. `rspamd.conf.override` is processed and it **overrides** anything parsed on the previous step
+3. **Lua** rules are loaded and they can override everything from the previous steps, with the important exception of rules scores, that are **NOT** overrided if the according symbol is also defined in some `metric` section
+4. **Dynamic** configuration defined by webui (normally) is loaded and it can override rules scores or action scores from the previous steps
+
+## Rules check order
+
+Rules in rspamd are checked in the following order:
+
+1. **Prefilters**: checked every time and can stop all further processing by calling `task:set_pre_result()`
+2. **All symbols***: can depend on each other by calling `rspamd_config:add_dependency(from, to)`
+3. **Statistics**: is checked only when all symbols are checked
+4. **Composites**: combine symbols to adjust the final results
+5. **Post filters**: are executed even if a message is already rejected and symbols processing has been stopped