diff options
author | Vsevolod Stakhov <vsevolod@highsecure.ru> | 2016-07-09 11:17:18 +0100 |
---|---|---|
committer | Vsevolod Stakhov <vsevolod@highsecure.ru> | 2016-07-09 11:17:18 +0100 |
commit | 14803e9faeefeee69e97902573f3e367ceaf9744 (patch) | |
tree | a2a27b8032f2e4c96d8801c1436ec25ab465c9e7 /doc/markdown/architecture | |
parent | 2d2a741df6954e042ae2f5ce6c1a66c2d61acc11 (diff) | |
download | rspamd-14803e9faeefeee69e97902573f3e367ceaf9744.tar.gz rspamd-14803e9faeefeee69e97902573f3e367ceaf9744.zip |
[Doc] Documentation now lives in rspamd.com repo
Diffstat (limited to 'doc/markdown/architecture')
-rw-r--r-- | doc/markdown/architecture/index.md | 106 | ||||
-rw-r--r-- | doc/markdown/architecture/protocol.md | 154 |
2 files changed, 0 insertions, 260 deletions
diff --git a/doc/markdown/architecture/index.md b/doc/markdown/architecture/index.md deleted file mode 100644 index 710a21064..000000000 --- a/doc/markdown/architecture/index.md +++ /dev/null @@ -1,106 +0,0 @@ -# Rspamd architecture - -## Introduction - -Rspamd is a universal spam filtering system based on an event-driven processing model, which means that Rspamd is not intended to block anywhere in the code. To process messages Rspamd uses a set of `rules`. Each `rule` is a symbolic name associated with a message property. For example, we can define the following rules: - -- `SPF_ALLOW` - means that a message is validated by SPF; -- `BAYES_SPAM` - means that a message is statistically considered as spam; -- `FORGED_OUTLOOK_MID` - message ID seems to be forged for the Outlook MUA. - -Rules are defined by [modules](../modules/). If there is a module, for example, that performs SPF checks it may define several rules according to SPF policy: - -- `SPF_ALLOW` - a sender is allowed to send messages for this domain; -- `SPF_DENY` - a sender is denied by SPF policy; -- `SPF_SOFTFAIL` - there is no affinity defined by SPF policy. - -Rspamd supports two main types of modules: internal modules written in C and external modules written in Lua. There is no real difference between the two types with the exception that C modules are embedded and can be enabled in a `filters` attribute in the `options` section of the config: - -~~~ucl -options { - filters = "regexp,surbl,spf,dkim,fuzzy_check,chartable,email"; - ... -} -~~~ - -## Protocol - -Rspamd uses the HTTP protocol for all operations. This protocol is described in the [protocol section](protocol.md). - -## Metrics - -Rules in Rspamd define a logic of checks, but it is required to set up weights for each rule. (For Rspamd, weight means `significance`.) Rules with a greater absolute value of weight are considered more important. The weight of rules is defined in `metrics`. Each metric is a set of grouped rules with specific weights. For example, we may define the following weights for our SPF rules: - -- `SPF_ALLOW`: -1 -- `SPF_DENY`: 2 -- `SPF_SOFTFAIL`: 0.5 - -Positive weights mean that this rule increases a messages 'spammyness', while negative weights mean the opposite. - -### Rules scheduler - -To avoid unnecessary checks Rspamd uses a scheduler of rules for each message. If a message is considered as definite spam then further checks are not performed. This scheduler is rather naive and it performs the following logic: - -- select negative rules *before* positive ones to prevent false positives; -- prefer rules with the following characteristics: - - frequent rules; - - rules with more weight; - - faster rules - -These optimizations can filter definite spam more quickly than a generic queue. - -Since Rspamd-0.9 there are further optimizations for rules and expressions that are described generally in the [following presentation](http://highsecure.ru/ast-rspamd.pdf). - -## Actions - -Another important property of metrics is their actions set. This set defines recommended actions for a message if it reaches a certain score defined by all rules which have been triggered. Rspamd defines the following actions: - -- `No action`: a message is likely to be ham; -- `Greylist`: greylist a message if it is not certainly ham; -- `Add header`: a message is likely spam, so add a specific header; -- `Rewrite subject`: a message is likely spam, so rewrite its subject; -- `Reject`: a message is very likely spam, so reject it completely - -These actions are just recommendations for the MTA and are not to be strictly followed. For all actions that are greater or equal than `greylist` it is recommended to perform explicit greylisting. `Add header` and `rewrite subject` actions are very close in semantics and are both considered as probable spam. `Reject` is a strong rule which usually means that a message should be really rejected by the MTA. The triggering score for these actions should be specified according to their logic priorities. If two actions have the same weight, the result is unspecified. - -## Rules weight - -The weight of rules is not necessarily constant. For example, for statistics rules we have no certain confidence if a message is spam or not; instead we have a measure of probability. To allow fuzzy rules weight, Rspamd supports `dynamic weights`. Generally, it means that a rule may add a dynamic range from 0 to a defined weight in the metric. So if we define the symbol `BAYES_SPAM` with a weight of 5.0, then this rule can add a resulting symbol with a weight from 0 to 5.0. To distribute values, Rspamd uses a form of Sigma function to provide a fair distribution curve. The majority of Rspamd rules, with the exception of fuzzy rules, use static weights. - -## Statistics - -Rspamd uses statistic algorithms to precisely calculate the final score of a message. Currently, the only algorithm defined is OSB-Bayes. You can find details of this algorithm in the following [paper](http://osbf-lua.luaforge.net/papers/osbf-eddc.pdf). Rspamd uses a window size of 5 words in its classification. During the classification procedure, Rspamd splits a message into a set of tokens. Tokens are separated by punctuation or whitespace characters. Short tokens (less than 3 symbols) are ignored. For each token, Rspamd calculates two non-cryptographic hashes used subsequently as indices. All these tokens are stored in different statistics backends (mmapped files, SQLite3 database or Redis server). Currently, the recommended backend for statistics is `Redis`. - -## Running rspamd - -There are several command-line options that can be passed to rspamd. All of them can be displayed by passing the `--help` argument. - -All options are optional: by default rspamd will try to read the `etc/rspamd.conf` config file and run as a daemon. Also there is a test mode that can be turned on by passing the `-t` argument. In test mode, rspamd reads the config file and checks its syntax. If a configuration file is OK, the exit code is zero. Test mode is useful for testing new config files without restarting rspamd. - -## Managing rspamd using signals - -It is important to note that all user signals should be sent to the rspamd main process and not to its children (as for child processes these signals can have other meanings). You can identify the main process: - -- by reading the pidfile: - - $ cat pidfile - -- by getting process info: - - $ ps auxwww | grep rspamd - nobody 28378 0.0 0.2 49744 9424 rspamd: main process - nobody 64082 0.0 0.2 50784 9520 rspamd: worker process - nobody 64083 0.0 0.3 51792 11036 rspamd: worker process - nobody 64084 0.0 2.7 158288 114200 rspamd: controller process - nobody 64085 0.0 1.8 116304 75228 rspamd: fuzzy storage - - $ ps auxwww | grep rspamd | grep main - nobody 28378 0.0 0.2 49744 9424 rspamd: main process - -After getting the pid of the main process it is possible to manage rspamd with signals, as follows: - -- `SIGHUP` - restart rspamd: reread config file, start new workers (as well as controller and other processes), stop accepting connections by old workers, reopen all log files. Note that old workers would be terminated after one minute which should allow processing of all pending requests. All new requests to rspamd will be processed by the newly started workers. -- `SIGTERM` - terminate rspamd. -- `SIGUSR1` - reopen log files (useful for log file rotation). - -These signals may be used in rc-style scripts. Restarting of rspamd is performed softly: no connections are dropped and if a new config is incorrect then the old config is used. diff --git a/doc/markdown/architecture/protocol.md b/doc/markdown/architecture/protocol.md deleted file mode 100644 index 81d10d67b..000000000 --- a/doc/markdown/architecture/protocol.md +++ /dev/null @@ -1,154 +0,0 @@ -# Rspamd protocol - -## Protocol basics - -Rspamd uses the HTTP protocol, either version 1.0 or 1.1. (There is also a compatibility layer described further in this document.) Rspamd defines some headers which allow the passing of extra information about a scanned message, such as envelope data, IP address or SMTP SASL authentication data, etc. Rspamd supports normal and chunked encoded HTTP requests. - -## Rspamd HTTP request - -Rspamd encourages the use of the HTTP protocol since it is standard and can be used by every programming language without the use of exotic libraries. A typical HTTP request looks like the following: - - POST /check HTTP/1.0 - Content-Length: 26969 - From: smtp@example.com - Pass: all - Ip: 95.211.146.161 - Helo: localhost.localdomain - Hostname: localhost - - <your message goes here> - -You can also use chunked encoding that allows streamlined data transfer which is useful if you don't know the length of a message. - -### HTTP request - -Normally, you should just use '/check' here. However, if you want to communicate with the controller then you might want to use controllers commands. - -(TODO: write this part) - -### HTTP headers - -To avoid unnecessary work, Rspamd allows an MTA to pass pre-processed data about the message by using either HTTP headers or a JSON control block (described further in this document). Rspamd supports the following non-standard HTTP headers: - -| Header | Description | -| :-------------- | :-------------------------------- | -| **Deliver-To:** | Defines actual delivery recipient of message. Can be used for personalized statistics and for user specific options. | -| **IP:** | Defines IP from which this message is received. | -| **Helo:** | Defines SMTP helo | -| **Hostname:** | Defines resolved hostname | -| **From:** | Defines SMTP mail from command data | -| **Queue-Id:** | Defines SMTP queue id for message (can be used instead of message id in logging). | -| **Rcpt:** | Defines SMTP recipient (there may be several `Rcpt` headers) | -| **Pass:** | If this header has `all` value, all filters would be checked for this message. | -| **Subject:** | Defines subject of message (is used for non-mime messages). | -| **User:** | Defines SMTP user. | -| **Message-Length:** | Defines the length of message excluding the control block. | - -Controller also defines certain headers: - -(TODO: write this part) - -Standard HTTP headers, such as `Content-Length`, are also supported. - -## Rspamd HTTP reply - -Rspamd reply is encoded in `JSON`. Here is a typical HTTP reply: - - HTTP/1.1 200 OK - Connection: close - Server: rspamd/0.9.0 - Date: Mon, 30 Mar 2015 16:19:35 GMT - Content-Length: 825 - Content-Type: application/json - -~~~json -{ - "default": { - "is_spam": false, - "is_skipped": false, - "score": 5.2, - "required_score": 7, - "action": "add header", - "DATE_IN_PAST": { - "name": "DATE_IN_PAST", - "score": 0.1 - }, - "FORGED_SENDER": { - "name": "FORGED_SENDER", - "score": 5 - }, - "TEST": { - "name": "TEST", - "score": 100500 - }, - "FUZZY_DENIED": { - "name": "FUZZY_DENIED", - "score": 0, - "options": [ - "1: 1.00 / 1.00", - "1: 1.00 / 1.00" - ] - }, - "HFILTER_HELO_5": { - "name": "HFILTER_HELO_5", - "score": 0.1 - } - }, - "urls": [ - "www.example.com", - "another.example.com" - ], - "emails": [ - "user@example.com" - ], - "message-id": "4E699308EFABE14EB3F18A1BB025456988527794@example" -} -~~~ - -For convenience, the reply is LINTed using [JSONLint](http://jsonlint.com). The actual reply is compressed for speed. - -The reply can be treated as a JSON object where keys are metric names (namely `default`) and values are objects that represent metrics. - -Each metric has the following fields: - -* `is_spam` - boolean value that indicates whether a message is spam -* `is_skipped` - boolean flag that is `true` if a message has been skipped due to settings -* `score` - floating point value representing the effective score of message -* `required_score` - floating point value meaning the threshold value for the metric -* `action` - recommended action for a message: - - `no action` - message is likely ham; - - `greylist` - message should be greylisted; - - `add header` - message is suspicious and should be marked as spam - - `rewrite subject` - message is suspicious and should have subject rewritten - - `soft reject` - message should be temporary rejected (for example, due to rate limit exhausting) - - `reject` - message should be rejected as spam - -Additionally, metric contains all symbols added during a message's processing, indexed by symbol names. - -Additional keys which may be in the reply include: - -* `subject` - if action is `rewrite subject` this value defines the desired subject for a message -* `urls` - a list of URLs found in a message (only hostnames) -* `emails` - a list of emails found in a message -* `message-id` - ID of message (useful for logging) -* `messages` - array of optional messages added by Rspamd filters (such as `SPF`) - -## Rspamd JSON control block - -Since Rspamd version 0.9 it is also possible to pass additional data by prepending a JSON control block to a message. So you can use either headers or a JSON block to pass data from the MTA to Rspamd. - -To use a JSON control block, you need to pass an extra header called `Message-Length` to Rspamd. This header should be equal to the size of the message **excluding** the JSON control block. Therefore, the size of the control block is equal to `Content-Length - Message-Length`. Rspamd assumes that a message starts immediately after the control block (with no extra CRLF). This method is equally compatible with streaming transfer, however even if you are not specifying `Content-Length` you are still required to specify `Message-Length`. - -Here is an example of a JSON control block: - -~~~json -{ - "from": "smtp@example.com", - "pass_all": "true", - "ip": "95.211.146.161", - "helo": "localhost.localdomain", - "hostname": "localhost" -} -~~~ - -Moreover, [UCL](https://github.com/vstakhov/libucl) JSON extensions and syntax conventions are also supported inside the control block. |