aboutsummaryrefslogtreecommitdiffstats
path: root/doc/markdown/modules/regexp.md
blob: a1a694f33278163907c819d481797bad6ccbcb13 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
# Rspamd regexp module

This is a core module that deals with regexp expressions to filter messages.

## Principles of work

Regexp module operates with `expressions` - a logical sequence of different `atoms`. Atoms
are elements of the expression and could be represented as regular expressions, rspamd
functions and lua functions. Rspamd supports the following operators in expressions:

* `&&` - logical AND (can be also written as `and` or even `&`)
* `||` - logical OR (`or` `|`)
* `!` - logical NOT (`not`)
* `+` - logical PLUS, usually used with comparisons:
	- `>` more than
	- `<` less than
	- `>=` more or equal
	- `<=` less or equal

Whilst logical operators are clear for understanding, PLUS is not so clear. In rspamd,
it is used to join multiple atoms or subexpressions and compare them to a specific number:

	A + B + C + D > 2 - evaluates to `true` if at least 3 operands are true
	(A & B) + C + D + E >= 2 -  evaluates to `true` if at least 2 operands are true

Operators has their own priorities:
	
1. NOT
2. PLUS
3. COMPARE
4. AND
5. OR

You can change priorities by braces, of course. All operations are *right* associative in rspamd.
While evaluating expressions, rspamd tries to optimize their execution time by reordering and does not evaluate
unnecessary branches.

## Expressions components

Rspamd support the following components within expressions:

* Regular expressions
* Internal functions
* Lua global functions (not widely used)

### Regular expressions

In rspamd, regular expressions could match different parts of messages:

* Headers (should be `Header-Name=/regexp/flags`)
* Textual mime parts
* Raw messages
* URLs

The match type is defined by special flags after the last `/` symbol:

* `H` - header regexp
* `M` - raw message regexp
* `P` - part regexp
* `U` - URL regexp

We strongly discourage from using of raw message regexps as they are expensive and
should be replaced by [trie](trie.md) rules if possible.

Each regexp also supports the following flags:

* `i` - ignore case
* `u` - use utf8 regexp
* `m` - multiline regexp
* `x` - extended regexp
* `O` - do not optimize regexp (rspamd optimizes regexps by default)

### Internal functions

Rspamd supports a set of internal functions to do some common spam filtering tasks:

* `check_smtp_data(type[, str or /re/])` - checks for the specific envelope argument: `from`, `rcpt`, `user`, `subject`
* `compare_encoding(str or /re/)` - compares message encoding with string or regexp
* `compare_parts_distance(inequality_percent)` - if a message is multipart/alternative, compare two parts and return `true` if they are inequal more than `inequality_percent`
* `compare_recipients_distance(inequality_percent)` - check how different are recipients of a message (works for > 5 recipients)
* `compare_transfer_encoding(str or /re/)` - compares message transfer encoding with string or regexp
* `content_type_compare_param(param, str or /re/)` - compare content-type parameter `param` with string or regexp
* `content_type_has_param(param)` - return true if `param` exists in content-type
* `content_type_is_subtype(str or /re/` - return `true` if subtype of content-type matches string or regexp
* `content_type_is_type(str or /re/)`- return `true` if type of content-type matches string or regexp
* `has_content_part(type)` - return `true` if the part with the specified `type` exists
* `has_content_part_len(type, len)` - return `true` if the part with the specified `type` exists and have at least `len` lenght
* `has_fake_html()` - check if there is an HTML part in message with no HTML tags
* `has_html_tag(tagname)` - return `true` if html part contains specified tag
* `has_only_html_part()` - return `true` if there is merely a single HTML part
* `header_exists(header)` - return if a specified header exists in the message
* `is_html_balanced()` - check whether HTML part has balanced tags
* `is_recipients_sorted()` - return `true` if there are more than 5 recipients in a message and they are sorted
* `raw_header_exists()` - does the same as `header_exists`

Many of these functions are just legacy but they are supported in terms of compatibility.

### Lua atoms

Lua atoms now can be lua global functions names or callbacks. This is 
a compatibility feature for previously written rules.

### Regexp objects

From rspamd 1.0, it is possible to add more power to regexp rules by using of
table notation while writing rules. A table can have the following fields:

- `callback`: lua callback for the rule
- `re`: regular expression (mutually exclusive with `callback` option)
- `condition`: function of task that determines when a rule should be executed
- `score`: default score
- `description`: default description
- `one_shot`: default one shot settings

Here is an example of table form definition of regexp rule:

~~~lua
config['regexp']['RE_TEST'] = {
    re = '/test/P',
    score = 10.0,
    condition = function(task)
        if task:get_header('Subject') then
            return true
        end
        return false
    end,
}
~~~