README.en.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170

API.
===========

API of rspamd is described in Doxygen documentation.

Logic of operation of rspamd filters.
==============================

1) All filters are registered in a config a file in the description of chains of filters:
header_filters = "regexp, my_func"
Where the filter name is or the name c the unit, or the name of script (lua or perl) function 
Types of filters:
* header_filters - the filters of headers
* mime_filters - the filters for every mime part
* message_filters - the filters of message without mime parsing
* url_filters - filters of URLs in messages

Filter register their results in metrics.

2) The Metric is a character value in which filters register their results.
There is a metrics by default - "default".
For each metrics there is a special function of consolidation which calculates coefficients
of results according to the internal logic of correspondence of characters and coefficients. 
By default the such function is the simple sum that can be configured in a configuration file:

# the Block factors
factors {
	# For example, "SURBL_DNS" =5.0
	"SYMBOL_NAME" = coefficient;
};

Also for the metrics it is possible to register special consolidation function:

metric {
	name = "test_metric";
	function = "some_function";
	required_score = 20.0;
};


The protocol.
=========

Answer format:
SPAMD/1.1 0 EX_OK 
      \/  \/   \/
  Version Code Errors
Spam: False; 2 / 5
It is a format of compatibility with sa-spamd (without metrics)

New format of the answer:
RSPAMD/1.0 0 EX_OK
Metric: Name; Spam_Result; Spam_Mark / Spam_Mark_Required
Metric: Name2; Spam_Result2; Spam_Mark2 / Spam_Mark_Required2

Type headers metric can be a little.
Format of output of characters:
SYMBOL1, SYMBOL2, SYMBOL3 - a format of compatibility with sa-spamd
Symbol: Name; Param1, Param2, Param3 - a format rspamd

The answer format:
PROCESS SPAMC/1.2
\/      \/
Command Version

SPAMC - the protocol of compatibility with sa-spamd
RSPAMC - new rspamd protocol
In any of operating modes following headers are supported:
Content-Length - Length of the message
Helo - HELO, received from the client
From - MAIL FROM
IP - IP of the client
Recipient-Number - Number of recipients
Rcpt - the recipient
Queue-ID - The queue identifier

These values can be used in filters rspamd.

Regular expressions
====================

Regular expressions are described in regexp module
.module ' regexp ' {
	SYMBOL = "regexp_expression";
};
header_filters = "regexp";

Format of regular expression:
"/pattern/flags"
Also for header lines there is special regexp line:
headername =/pattern/flags

Flags of regexp:
i, m, s, x, u, o - same, as at perl/pcre
r - raw not coded in utf8 regexp
H - searches for a header
M - searches in undecoded message
P - searches in decoded mime parts
U - searches in urls
X - searches in undecoded headers

Expression can contain regular expressions, functions, operators of logic and brackets:
SOME_SYMBOL = "To =/blah@blah/H AND! (From =/blah@blah/H | Subject =/blah/H)"

Also it is possible to use variables:
$to_blah = "To =/blah@blah/H";
$from_blah = "From =/blah@blah/H";
$subject_blah = "Subject =/blah/H";

Then the previous expression will be such:

SOME_SYMBOL = "$ {to_blah} AND! ($ {from_blah} | $ {subject_blah})"

Logic expressions rspamd
===========================

Expressions containing regular expressions, functions, logic operations, brackets, can be used
for the filtering. General rules:
- Logic operations can be boolean "And": ' & ', boolean "OR": ' | ' and boolean negation: '! '.
- A priority of logic operations: &| -> !, for priority change it is possible to use brackets:
 (A AND! B) |! (C|D)
- Space symbols in expressions are ignored
- The operand containing/re/args or string =/re/args is considered regular expression, in regular
expressions all symbols ' / ' and ' "' should be escaped by a symbol ' \', but symbol '\' is not need to be escaped.
- The operand which accepts arguments, is considered function. Arguments of function can be expressions, regexps or other functions.
Arguments in function are evaluated from left to right.
- There is a number of built-in functions:
  * header_exists - accepts header's name as argument, returns true if such heading exists
  * compare_parts_distance - accepts as argument number from 0 to 100 which reflects a difference in percentage
    between letter parts. Function works with the messages containing 2 text parts (text/plain and text/html) and
	returns true when these parts differ more than on N percent. If the argument is not specified,
	function searches for completely different parts.
  * compare_transfer_encoding - compares Content-Transfer-Encoding with the argument
  * content_type_compare_param - compares Content-Type param with regular expression or line:
     content_type_compare_param (Charset,/windows-\d +/)
	 content_type_compare_param (Charset, ascii)
  * content_type_has_param - checks for specified Content-Type parameter
  * content_type_is_subtype - compares a subtype of content-type to regular expression or line
  * content_type_is_type - compares type of content-type to regular expression or line
     content_type_is_type (text)
     content_type_is_subtype (/?.html/)
  * regexp_match_number - accepts as the number of matched expressions as first parameter number and list of expressions. 
    If the number of matched expressions is more than first argument function returns TRUE, for example:
	regexp_match_number (2, $ {__ RE1}, $ {__ RE2}, header_exists (Subject))
  * has_only_html_part - function returns TRUE if there is only HTML part in the message
  * compare_recipients_distance - calculates percent of similar recipients of the message. Accepts argument - a threshold in 
    percentage of similar recipients.
  * is_recipients_sorted - returns TRUE if the list of addressees is sorted (works only if the number of addressees> = 5).
  * is_html_balanced - returns TRUE if tags in all html parts are balanced
  * has_html_tag - returns TRUE if specified html tag is found

The module chartable.
================

The module is intended for search of words with the mixed symbols, for example:
kашa - a part in a Latin, and a part in Cyrillics.
Module parametres:

.module ' chartable ' {
	metric = "default";
	symbold = "R_MIXED_CHARSET";
	threshold = "0.1";
};

threshold is a relation of transitions between codings to total number of symbols in words, for example, we have a word
"kаша" (the first letter Latin), then total number of transitions - 3, and number of transitions between codings - 1, then 
The relation - 1/3.

For inclusion of the module he is necessary for adding in the list mime_filters:
mime_filters = "chartable";