You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

rspamd.texi 81KB

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630163116321633163416351636163716381639164016411642164316441645164616471648164916501651165216531654165516561657165816591660166116621663166416651666166716681669167016711672167316741675167616771678167916801681168216831684168516861687168816891690169116921693169416951696169716981699170017011702170317041705170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744174517461747174817491750175117521753175417551756175717581759176017611762176317641765176617671768176917701771177217731774177517761777177817791780178117821783178417851786178717881789179017911792179317941795179617971798179918001801180218031804180518061807180818091810181118121813181418151816181718181819182018211822182318241825182618271828182918301831183218331834183518361837183818391840184118421843184418451846184718481849185018511852185318541855185618571858185918601861186218631864186518661867186818691870187118721873187418751876187718781879188018811882188318841885188618871888188918901891189218931894189518961897189818991900190119021903190419051906190719081909191019111912191319141915191619171918191919201921192219231924192519261927192819291930193119321933193419351936193719381939194019411942194319441945194619471948194919501951195219531954195519561957195819591960196119621963196419651966196719681969
  1. \input texinfo
  2. @settitle "Rspamd Spam Filtering System"
  3. @titlepage
  4. @title Rspamd Spam Filtering System
  5. @subtitle A User's Guide for Rspamd
  6. @author Vsevolod Stakhov
  7. @end titlepage
  8. @contents
  9. @chapter Rspamd purposes and features.
  10. @section Introduction.
  11. Rspamd filtering system is created as a replacement of popular
  12. @code{spamassassin}
  13. spamd and is designed to be fast, modular and easily extendable system. Rspamd
  14. core is written in @code{C} language using event driven paradigma. Plugins for rspamd
  15. can be written in @code{lua}. Rspamd is designed to process connections
  16. completely asynchronous and do not block anywhere in code. Spam filtering system
  17. contains of several processes among them are:
  18. @itemize @bullet
  19. @item Main process
  20. @item Workers processes
  21. @item Controller process
  22. @item Other processes
  23. @end itemize
  24. Main process manages all other processes, accepting signals from OS (for example
  25. SIGHUP) and spawn all types of processes if any of them die. Workers processes
  26. do all tasks for filtering e-mail (or HTML messages in case of using rspamd as
  27. non-MIME filter). Controller process is designed to manage rspamd itself (for
  28. example get statistics or learning rspamd). Other processes can do different
  29. jobs among them now are implemented @code{LMTP} worker that implements
  30. @code{LMTP} protocol for filtering mail and fuzzy hashes storage server.
  31. @section Features.
  32. The main features of rspamd are:
  33. @itemize @bullet
  34. @item Completely asynchronous filtering that allows a big number of simultenious
  35. connections.
  36. @item Easily extendable architecture that can be extended by plugins written in
  37. @code{lua} and by dynamicaly loaded plugins written in @code{c}.
  38. @item Ability to work in cluster: rspamd is able to perform statfiles
  39. synchronization, dynamic load of lists via HTTP, to use distributed fuzzy hashes
  40. storage.
  41. @item Advanced statistics: rspamd now is shipped with winnow-osb classifier that
  42. provides more accurate statistics than traditional bayesian algorithms based on
  43. single words.
  44. @item Internal optimizer: rspamd first of all try to check rules that were met
  45. more often, so for huge spam storms it works very fast as it just checks only
  46. that rules that @emph{can} happen and skip all others.
  47. @item Ability to manage the whole cluster by using controller process.
  48. @item Compatibility with existing @code{spamassassin} SPAMC protocol.
  49. @item Extended @code{RSPAMC} protocol that allows to pass many additional data
  50. from SMTP dialog to rspamd filter.
  51. @item Internal support of IMAP in rspamc client for automated learning.
  52. @item Internal support of many anti-spam technologies, among them are
  53. @code{SPF} and @code{SURBL}.
  54. @item Active support and development of new features.
  55. @end itemize
  56. @chapter Installation of rspamd.
  57. @section Obtaining of rspamd.
  58. The main rspamd site is @url{http://rspamd.sourceforge.net/, sourceforge}. Here
  59. you can obtain source code package as well as pre-packed packages for different
  60. operating systems and architectures. Also, you can use SCM
  61. @url{http://mercurial.selenic.com, mercurial} for accessing rspamd development
  62. repository that can be found here:
  63. @url{http://rspamd.hg.sourceforge.net:8000/hgroot/rspamd/rspamd}. Rspamd is
  64. shipped with all modules and sample config by default. But there are some
  65. requirements for building and running rspamd.
  66. @section Requirements.
  67. For building rspamd from sources you need @code{CMake} system. CMake is very
  68. nice source building system and I decided to use it instead of GNU autotools.
  69. CMake can be obtained here: @url{http://cmake.org}. Also rspamd uses gmime and
  70. glib for MIME parsing and many other purposes (note that you are NOT required
  71. to install any GUI libraries - nor glib, nor gmime are GUI libraries). Gmime
  72. and glib can be obtained from gnome site: @url{http://ftp.gnome.org/}. For
  73. plugins and configuration system you also need lua language interpreter and
  74. libraries. They can be easily obtained from @url{http://lua.org, official lua
  75. site}. Also for rspamc client you need @code{perl} interpreter that could be
  76. installed from @url{http://www.perl.org}.
  77. @section Building and Installation.
  78. Build process of rspamd is rather simple:
  79. @itemize @bullet
  80. @item Configure rspamd build environment, using cmake:
  81. @example
  82. $ cmake .
  83. ...
  84. -- Configuring done
  85. -- Generating done
  86. -- Build files have been written to: /home/cebka/rspamd
  87. @end example
  88. @noindent
  89. For special configuring options you can use
  90. @example
  91. $ ccmake .
  92. CMAKE_BUILD_TYPE
  93. CMAKE_INSTALL_PREFIX /usr/local
  94. DEBUG_MODE ON
  95. ENABLE_GPERF_TOOLS OFF
  96. ENABLE_OPTIMIZATION OFF
  97. ENABLE_PERL OFF
  98. ENABLE_PROFILING OFF
  99. ENABLE_REDIRECTOR OFF
  100. ENABLE_STATIC OFF
  101. @end example
  102. @noindent
  103. Options allows building rspamd as static module (note that in this case
  104. dynamicaly loaded plugins are @strong{NOT} supported), linking rspamd with
  105. google performance tools for benchmarking and include some other flags while
  106. building.
  107. @item Build rspamd sources:
  108. @example
  109. $ make
  110. [ 6%] Built target rspamd_lua
  111. [ 11%] Built target rspamd_json
  112. [ 12%] Built target rspamd_evdns
  113. [ 12%] Built target perlmodule
  114. [ 58%] Built target rspamd
  115. [ 76%] Built target test/rspamd-test
  116. [ 85%] Built target utils/expression-parser
  117. [ 94%] Built target utils/url-extracter
  118. [ 97%] Built target rspamd_ipmark
  119. [100%] Built target rspamd_regmark
  120. @end example
  121. @noindent
  122. @item Install rspamd (as superuser):
  123. @example
  124. # make install
  125. Install the project...
  126. ...
  127. @end example
  128. @noindent
  129. @end itemize
  130. After installation you would have several new files installed:
  131. @itemize @bullet
  132. @item Binaries:
  133. @itemize @bullet
  134. @item PREFIX/bin/rspamd - main rspamd executable
  135. @item PREFIX/bin/rspamc - rspamd client program
  136. @end itemize
  137. @item Sample configuration files and rules:
  138. @itemize @bullet
  139. @item PREFIX/etc/rspamd.xml.sample - sample main config file
  140. @item PREFIX/etc/rspamd/lua/*.lua - rspamd rules
  141. @end itemize
  142. @item Lua plugins:
  143. @itemize @bullet
  144. @item PREFIX/etc/rspamd/plugins/lua/*.lua - lua plugins
  145. @end itemize
  146. @end itemize
  147. For @code{FreeBSD} system there also would be start script for running rspamd in
  148. @emph{PREFIX/etc/rc.d/rspamd.sh}.
  149. @section Running rspamd.
  150. Rspamd can be started by running main rspamd executable -
  151. @code{PREFIX/bin/rspamd}. There are several command-line options that can be
  152. passed to rspamd. All of them can be displayed by passing --help argument:
  153. @example
  154. $ rspamd --help
  155. Usage:
  156. rspamd [OPTION...] - run rspamd daemon
  157. Summary:
  158. Rspamd daemon version 0.3.0
  159. Help Options:
  160. -?, --help Show help options
  161. Application Options:
  162. -t, --config-test Do config test and exit
  163. -f, --no-fork Do not daemonize main process
  164. -c, --config Specify config file
  165. -u, --user User to run rspamd as
  166. -g, --group Group to run rspamd as
  167. -p, --pid Path to pidfile
  168. -V, --dump-vars Print all rspamd variables and exit
  169. -C, --dump-cache Dump symbols cache stats and exit
  170. -X, --convert-config Convert old style of config to xml one
  171. @end example
  172. @noindent
  173. All options are optional: by default rspamd would try to read
  174. @code{PREFIX/etc/rspamd.xml} config file and run as daemon. Also there is test
  175. mode that can be turned on by passing @option{-t} argument. In test mode rspamd
  176. would read config file and checks its syntax, if config file is OK, then exit
  177. code is zero and non zero otherwise. Test mode is useful for testing new config
  178. file without restarting of rspamd. With @option{-C} and @option{-V} arguments it is
  179. possible to dump variables or symbols cache data. The last ability can be used
  180. for determining which symbols are most often, which are most slow and to watch
  181. to real order of rules inside rspamd. @option{-X} option can be used to convert
  182. old style (pre 0.3.0) config to xml one:
  183. @example
  184. $ rspamd -c ./rspamd.conf -X ./rspamd.xml
  185. @end example
  186. @noindent
  187. After this command new xml config would be dumped to rspamd.xml file.
  188. @section Managing rspamd with signals.
  189. First of all it is important to note that all user's signals should be sent to
  190. rspamd main process and not to its children (as for child processes these
  191. signals may have other meanings). To determine which process is main you can use
  192. two ways:
  193. @itemize @bullet
  194. @item by reading pidfile:
  195. @example
  196. $ cat pidfile
  197. @end example
  198. @noindent
  199. @item by getting process info:
  200. @example
  201. $ ps auxwww | grep rspamd
  202. nobody 28378 0.0 0.2 49744 9424 rspamd: main process (rspamd)
  203. nobody 64082 0.0 0.2 50784 9520 rspamd: worker process (rspamd)
  204. nobody 64083 0.0 0.3 51792 11036 rspamd: worker process (rspamd)
  205. nobody 64084 0.0 2.7 158288 114200 rspamd: controller process (rspamd)
  206. nobody 64085 0.0 1.8 116304 75228 rspamd: fuzzy storage (rspamd)
  207. $ ps auxwww | grep rspamd | grep main
  208. nobody 28378 0.0 0.2 49744 9424 rspamd: main process (rspamd)
  209. @end example
  210. @noindent
  211. @end itemize
  212. After getting pid of main process it is possible to manage rspamd with signals:
  213. @itemize @bullet
  214. @item SIGHUP - restart rspamd: reread config file, start new workers (as well as
  215. controller and other processes), stop accepting connections by old workers,
  216. reopen all log files. Note that old workers would be terminated after one minute
  217. that should allow to process all pending requests. All new requests to rspamd
  218. would be processed by newly started workers.
  219. @item SIGTERM - terminate rspamd system.
  220. @end itemize
  221. These signals may be used in start scripts as it is done in @code{FreeBSD} start
  222. script. Restarting of rspamd is doing rather softly: no connections would be
  223. dropped and if new config is syntaxically incorrect old config would be used.
  224. @chapter Configuring of rspamd.
  225. @section Principles of work.
  226. We need to define several terms to explain configuration of rspamd. Rspamd
  227. operates with @strong{rules}, each rule defines some actions that should be done with
  228. message to obtain result. Result is called @strong{symbol} - a symbolic
  229. representation of rule. For example, if we have a rule to check DNS record for
  230. a url that contains in message we may insert resulting symbol if this DNS record
  231. is found. Each symbol has several attributes:
  232. @itemize @bullet
  233. @item name - symbolic name of symbol (usually uppercase, e.g. MIME_HTML_ONLY)
  234. @item weight - numeric weight of this symbol (this means how important this rule is), may
  235. be negative
  236. @item options - list of symbolic options that defines additional information about
  237. processing this rule
  238. @end itemize
  239. Weights of symbols are called @strong{factors}. Also when symbol is inserted it
  240. is possible to define additional multiplier to factor. This can be used for
  241. rules that have dynamic weights, for example statistical rules (when probability
  242. is higher weight must be higher as well).
  243. All symbols and corresponding rules are combined in @strong{metrics}. Metric
  244. defines a group of symbols that are designed for common purposes. Each metric
  245. has maximum weight: if sum of all rules' results (symbols) is bigger than this
  246. limit then this message is considered as spam in this metric. The default metric
  247. is called @emph{default} and rules that have not explicitly specified metric
  248. would insert their results to this default metric.
  249. Let's impress how this technics works:
  250. @enumerate 1
  251. @item First of all when rspamd is running each module (lua, internal or external
  252. dynamic module) can register symbols in any defined metric. After this process
  253. rspamd has a cache of symbols for each metric. This cache can be saved to file
  254. for speeding up process of optimizing order of calling of symbols.
  255. @item Rspamd gets a message from client and parse it with mime parsing and do
  256. other parsing jobs like extracting text parts, urls, and stripping html tags.
  257. @item For each metric rspamd is looking to metric's cache and select rules to
  258. check according to their order (this order depends on frequence of symbol, its
  259. weight and execution time).
  260. @item Rspamd calls rules of metric till the sum weight of symbols in metric is
  261. less than its limit.
  262. @item If sum weight of symbols is more than limit the processing of rules is
  263. stopped and message is counted as spam in this metric.
  264. @end enumerate
  265. After processing rules rspamd is also does statistic check of message. Rspamd
  266. statistic module is presented as a set of @strong{classifiers}. Each classifier
  267. defines algorithm of statistic checks of messages. Also classifier definition
  268. contains definition of @strong{statistic files} (or @strong{statfiles} shortly).
  269. Each statfile contains of number of patterns that are extracted from messages.
  270. These patterns are put into statfiles during learning process. A short example:
  271. you define classifier that contains two statfiles: @emph{ham} and @emph{spam}.
  272. Than you find 10000 messages that are spam and 10000 messages that contains ham.
  273. Then you learn rspamd with these messages. After this process @emph{ham}
  274. statfile contains patterns from ham messages and @emph{spam} statfile contains
  275. patterns from spam messages. Then when you are checking message via this
  276. statfiles messages that are like spam would have more probability/weight in
  277. @emph{spam} statfile than in @emph{ham} statfile and classifier would insert
  278. symbol of @emph{spam} statfile and would calculate how this message is like
  279. patterns that are contained in @emph{spam} statfile. But rspamd is not limiting
  280. you to define one classifier or two statfiles. It is possible to define a number
  281. of classifiers and a number of statfiles inside a classifier. It can be useful
  282. for personal statistic or for specific spam patterns. Note that each classifier
  283. can insert only one symbol - a symbol of statfile with max weight/probability.
  284. Also note that statfiles check is allways done after all rules. So statistic can
  285. @strong{correct} result of rules.
  286. Now some words about @strong{modules}. All rspamd rules are contained in
  287. modules. Modules can be internal (like SURBL, SPF, fuzzy check, email and
  288. others) and external written in @code{lua} language. In fact there is no differ
  289. in the way, how rules of these modules are called:
  290. @enumerate 1
  291. @item Rspamd loads config and loads specified modules.
  292. @item Rspamd calls init function for each module passing configurations
  293. arguments.
  294. @item Each module examines configuration arguments and register its rules (or
  295. not register depending on configuration) in rspamd metrics (or in a single
  296. metric).
  297. @item During metrics process rspamd calls registered callbacks for module's
  298. rules.
  299. @item These rules may insert results to metric.
  300. @end enumerate
  301. So there is no actual difference between lua and internal modules, each are just
  302. providing callbacks for processing messages. Also inside callback it is possible
  303. to change state of message's processing. For example this can be done when it is
  304. required to make DNS or other network request and to wait result. So modules can
  305. pause message's processing while waiting for some event. This is true for lua
  306. modules as well.
  307. @section Rspamd config file structure.
  308. Rspamd config file is placed in PREFIX/etc/rspamd.xml by default. You can
  309. specify other location by passing @option{-c} option to rspamd. Rspamd config file
  310. contains configuration parameters in XML format. XML was selected for rather
  311. simple manual editing config file and for simple automatic generation as well as
  312. for dynamic configuration. I've decided to move rules logic from XML file to
  313. keep it small and simple. So rules are defined in @code{lua} language and rspamd
  314. parameters are defined in xml file (rspamd.xml). Configuration rules are
  315. included by @strong{<lua>} tag that have @strong{src} attribute that defines
  316. relative path to lua file (relative to placement of rspamd.xml):
  317. @example
  318. <lua src="rspamd/lua/rspamd.lua">fake</lua>
  319. @end example
  320. @noindent
  321. Note that it is not currently possible to have empty tags. I hope this
  322. restriction would be fixed in future. Rspamd xml config consists of several
  323. sections:
  324. @itemize @bullet
  325. @item Main section - section where main config parameters are placed.
  326. @item Workers section - section where workers are described.
  327. @item Classifiers section - section where you define your classify logic
  328. @item Modules section - a set of sections that describes module's rules (in fact
  329. these rules should be in lua code)
  330. @item Metrics section - a section where you can set weights of symbols in metrics and metrics settings
  331. @item Logging section - a section that describes rspamd logging
  332. @item Views section - a section that defines rspamd views
  333. @end itemize
  334. So common structure of rspamd.xml can be described this way:
  335. @example
  336. <? xml version="1.0" encoding="utf-8" ?>
  337. <rspamd>
  338. <!-- Main section directives -->
  339. ...
  340. <!-- Workers directives -->
  341. <worker>
  342. ...
  343. </worker>
  344. ...
  345. <!-- Classifiers directives -->
  346. <classifier>
  347. ...
  348. </classifier>
  349. ...
  350. <!-- Logging section -->
  351. <logging>
  352. <type>console</type>
  353. <level>info</level>
  354. ...
  355. </logging>
  356. <!-- Views section -->
  357. <view>
  358. ...
  359. </view>
  360. ...
  361. <!-- Modules settings -->
  362. <module name="regexp">
  363. <option name="test">test</option>
  364. ...
  365. </module>
  366. ...
  367. </rspamd>
  368. @end example
  369. Each of these sections would be described further in details.
  370. @section Rspamd configuration atoms.
  371. There are several primitive types of rspamd configuration parameters:
  372. @itemize @bullet
  373. @item String - common string that defines option.
  374. @item Number - integer or fractional number (e.g.: 10 or -1.5).
  375. @item Time - ammount of time in milliseconds, may has suffixes:
  376. @itemize @bullet
  377. @item @emph{s} - for seconds (e.g. @emph{10s});
  378. @item @emph{m} - for minutes (e.g. @emph{10m});
  379. @item @emph{h} - for hours (e.g. @emph{10h});
  380. @item @emph{d} - for days (e.g. @emph{10d});
  381. @end itemize
  382. @item Size - like number numerci reprezentation of size, but may have a suffix:
  383. @itemize @bullet
  384. @item @emph{k} - 'kilo' - number * 1024 (e.g. @emph{10k});
  385. @item @emph{m} - 'mega' - number * 1024 * 1024 (e.g. @emph{10m});
  386. @item @emph{g} - 'giga' - number * 1024 * 1024 * 1024 (e.g. @emph{1g});
  387. @end itemize
  388. @noindent
  389. Size atoms are used for memory limits for example.
  390. @item Lists - path to dynamic rspamd list (e.g. @emph{http://some.host/some/path}).
  391. @end itemize
  392. While practically all atoms are rather trivial to understand rspamd lists may
  393. cause some confusion. Lists are widely used in rspamd for getting data that can
  394. be often changed for example white or black lists, lists of ip addresses, lists
  395. of domains. So for such purposes it is possible to use files that can be get
  396. either from local filesystem (e.g. @code{file:///var/run/rspamd/whitelsist}) or
  397. by HTTP (e.g. @code{http://some.host/some/path/list.txt}). Rspamd constantly
  398. looks for changes in this files, if using HTTP it also set
  399. @emph{If-Modified-Since} header and check for @emph{Not modified} reply. So it
  400. causes no overhead when lists are not modified and may allow to store huge lists
  401. and to distribute them over HTTP. Monitoring of lists is done with some random
  402. delay (jitter), so if you have many rspamd servers in cluster that are
  403. monitoring a single list they would come to check or download it in slightly different
  404. time. The two most common list formats are @emph{IP list} and @emph{domains
  405. list}. IP list contains of ip addresses in dot notation (e.g.
  406. @code{192.168.1.1}) or ip/network pairs in CIDR notation (e.g.
  407. @code{172.16.0.0/16}). Items in lists are separated by newline symbol. Lines
  408. that begin with @emph{#} symbol are considered as comments and are ignored while
  409. parsing. Domains list is very like ip list with difference that it contains
  410. domain names.
  411. @section Main rspamd configuration section.
  412. Main rspamd configurtion section contains several definitions that determine
  413. main parameters of rspamd for example path to pidfile, temporary directory, lua
  414. includes, several limits e.t.c. Here is list of this directives explained:
  415. @multitable @columnfractions .2 .8
  416. @headitem Tag @tab Mean
  417. @item @var{<tempdir>}
  418. @tab Defines temporary directory for rspamd. Default is to use @env{TEMP}
  419. environment variable or @code{/tmp}.
  420. @item @var{<pidfile>}
  421. @tab Path to rspamd pidfile. Here would be stored a pid of main process.
  422. Pidfile is used to manage rspamd from start scripts.
  423. @item @var{<statfile_pool_size>}
  424. @tab Limit of statfile pool size: a total number of bytes that can be used for
  425. mapping statistic files. Rspamd is using LRU system and would unmap the most
  426. unused statfile when this limit would be reached. The common sense is to set
  427. this variable equal to total size of all statfiles, but it can be less than this
  428. in case of dynamic statfiles (for per-user statistic).
  429. @item @var{<filters>}
  430. @tab List of enabled internal filters. Items in this list can be separated by
  431. spaces, semicolons or commas. If internal filter is not specified in this line
  432. it would not be loaded or enabled.
  433. @item @var{<raw_mode>}
  434. @tab Boolean flag that specify whether rspamd should try to convert all
  435. messages to UTF8 or not. If @var{raw_mode} is enabled all messages are
  436. processed @emph{as is} and are not converted. Raw mode is faster than utf mode
  437. but it may confuse statistics and regular expressions.
  438. @item @var{<lua>}
  439. @tab Defines path to lua file that should be loaded fro configuration. Path to
  440. this file is defined in @strong{src} attribute. Text inside tag is required but
  441. is not parsed (this is stupid limitation of parser's design).
  442. @end multitable
  443. @section Rspamd logging configuration.
  444. Rspamd has a number of logging variants. First of all there are three types of
  445. logs that are supported by rspamd: console loggging (just output log messages to
  446. console), file logging (output log messages to file) and logging via syslog.
  447. Also it is possible to filter logging to specific level:
  448. @itemize @bullet
  449. @item error - log only critical errors
  450. @item warning - log errors and warnings
  451. @item info - log all non-debug messages
  452. @item debug - log all including debug messages (huge amount of logging)
  453. @end itemize
  454. Also it is possible to turn on debug messages for specific ip addresses. This
  455. ability is usefull for testing.
  456. For each logging type there are special mandatory parameters: log facility for
  457. syslog (read @emph{syslog (3)} manual page for details about facilities), log
  458. file for file logging. Also file logging may be buffered for speeding up. For
  459. reducing logging noise rspamd detects for sequential identic log messages and
  460. replace them with total number of repeats:
  461. @example
  462. #81123(fuzzy): May 11 19:41:54 rspamd file_log_function: Last message repeated 155 times
  463. #81123(fuzzy): May 11 19:41:54 rspamd process_write_command: fuzzy hash was successfully added
  464. @end example
  465. Here is summary of logging parameters:
  466. @multitable @columnfractions .2 .8
  467. @headitem Tag @tab Mean
  468. @item @var{<type>}
  469. @tab Defines logging type (file, console or syslog). For each type mandatory
  470. attriute must be present:
  471. @itemize @bullet
  472. @item @emph{filename} - path to log file for file logging type;
  473. @item @emph{facility} - syslog logging facility.
  474. @end itemize
  475. @item @var{<level>}
  476. @tab Defines loggging level (error, warning, info or debug).
  477. @item @var{<log_buffer>}
  478. @tab For file and console logging defines buffer in bytes (kilo, mega or giga
  479. bytes) that would be used for logging output.
  480. @item @var{<log_urls>}
  481. @tab Flag that defines whether all urls in message would be logged. Useful for
  482. testing.
  483. @item @var{<debug_ip>}
  484. @tab List that contains ip addresses for which debugging would be turned on. For
  485. more information about ip lists look at config atoms section.
  486. @end multitable
  487. @section Metrics configuration.
  488. Setting of rspamd metrics is the main way to change rules' weights. You can set
  489. up weights for all rules: for those that have static weights (for example simple
  490. regexp rules) and for those that have dynamic weights (for example statistic
  491. rules). In all cases the base weight of rule is multiplied by metric's weight value.
  492. For static rules base weight is usually 1.0. So we have:
  493. @itemize @bullet
  494. @item @math{w_{symbol} = w_{static} * factor} - for static rules
  495. @item @math{w_{symbol} = w_{dynamic} * factor} - for dynamic rules
  496. @end itemize
  497. Also there is an ability to add so called "grow factor" - additional multiplier
  498. that would be used when we have more than one symbol in metric. So for each
  499. added symbol this factor would increment its power. This can be written as:
  500. @math{w_{total} = w_1 * gf ^ 0 + w_2 * gf ^ 1 + ... + w_n * gf ^ {n - 1}}
  501. Grow multiplier is used to increment weight of rules when message got many
  502. symbols (likely spammy). Note that only rules with positive weights would
  503. increase grow factor, those with negative weights would just be added. Also note
  504. that grow factor can be less than 1 but it is uncommon use (in this case we
  505. would have weight lowering when we have many symbols for this message). Metrics
  506. can be set up with config section(s) @emph{metric}:
  507. @example
  508. <metric>
  509. <name>test_metric</name>
  510. <action>reject</action>
  511. <symbol weight="0.1">MIME_HTML_ONLY</symbol>
  512. <grow_factor>1.1</grow_factor>
  513. </metric>
  514. @end example
  515. Note that you basically need to add symbols to metric when you add additional rules.
  516. The decision of weight of newly added rule basically depends on its importance. For
  517. example you are absolutely sure that some rule would add a symbol on only spam
  518. messages, so you can increase weight of such rule so it would filter such spam.
  519. But if you increase weight of rules you should be more or less sure that it
  520. would not increase false positive errors rate to unacceptable level (false
  521. positive errors are errors when good mail is treated as spam). Rspamd comes with
  522. a set of default rules and default weights of that rules are placed in
  523. rspamd.xml.sample. In most cases it is reasonable to change them for your mail
  524. system, for example increase weights of some rules or decrease for others. Also
  525. note that default grow factor is 1.0 that means that weights of rules do not
  526. depend on count of added symbols. For some situations it useful to set grow
  527. factor to value more than 1.0. Also by modifying weights it is possible to
  528. manage static multiplier for dynamic rules.
  529. @section Workers configuration.
  530. Workers are rspamd processes that are doing specific jobs. Now are supported 4
  531. types of workers:
  532. @enumerate 1
  533. @item Normal worker - a typical worker that process messages.
  534. @item Controller worker - a worker that manages rspamd, get statistics and do
  535. learning tasks.
  536. @item Fuzzy storage worker - a worker that contains a collection of fuzzy
  537. hashes.
  538. @item LMTP worker - experimental worker that acts as LMTP server.
  539. @end enumerate
  540. These types of workers has some common parameters:
  541. @multitable @columnfractions .2 .8
  542. @headitem Parameter @tab Mean
  543. @item @emph{<type>}
  544. @tab Type of worker (normal, controller, lmtp or fuzzy)
  545. @item @emph{<bind_socket>}
  546. @tab Socket credits to bind this worker to. Inet and unix sockets are supported:
  547. @example
  548. <bind_socket>localhost:11333</bind_socket>
  549. <bind_socket>/var/run/rspamd.sock</bind_socket>
  550. @end example
  551. @noindent
  552. Also for inet sockets you may specify @code{*} as address to bind to all
  553. available inet interfaces:
  554. @example
  555. <bind_socket>*:11333</bind_socket>
  556. @end example
  557. @noindent
  558. @item @emph{<count>}
  559. @tab Number of worker processes of this type. By default this number is
  560. equialent to number of logical processors in system.
  561. @item @emph{<maxfiles>}
  562. @tab Maximum number of file descriptors available to this worker process.
  563. @item @emph{<maxcore>}
  564. @tab Maximum size of core file that would be dumped in cause of critical errors
  565. (in mega/kilo/giga bytes).
  566. @end multitable
  567. Also each of workers types can have specific parameters:
  568. @itemize @bullet
  569. @item Normal worker:
  570. @itemize @bullet
  571. @item @var{<custom_filters>} - path to dynamically loaded plugins that would do real
  572. check of incoming messages. These modules are described further.
  573. @item @var{<mime>} - if this parameter is "no" than this worker assumes that incoming
  574. messages are in non-mime format (e.g. forum's messages) and standart mime
  575. headers are added to them.
  576. @end itemize
  577. @item Controller worker:
  578. @itemize @bullet
  579. @item @var{<password>} - a password that would be used to access to contorller's
  580. privilleged commands.
  581. @end itemize
  582. @item Fuzzy worker:
  583. @itemize @bullet
  584. @item @var{<hashfile>} - a path to file where fuzzy hashes would be permamently stored.
  585. @item @var{<use_judy>} - if libJudy is present in system use it for faster storage.
  586. @item @var{<frequent_score>} - if judy is not turned on use this score to place hashes
  587. with score that is more than this value to special faster list (this is designed
  588. to increase lookup speed for frequent hashes).
  589. @item @var{<expire>} - time to expire of fuzzy hashes after their placement in storage.
  590. @end itemize
  591. @end itemize
  592. These parameters can be set inside worker's definition:
  593. @example
  594. <worker>
  595. <type>fuzzy</type>
  596. <bind_socket>*:11335</bind_socket>
  597. <count>1</count>
  598. <maxfiles>2048</maxfiles>
  599. <maxcore>0</maxcore>
  600. <!-- Other params -->
  601. <param name="use_judy">yes</param>
  602. <param name="hashfile">/spool/rspamd/fuzzy.db</param>
  603. <param name="expire">10d</param>
  604. </worker>
  605. @end example
  606. @noindent
  607. The purpose of each worker's type would be described later. The main parameters
  608. that could be defined are bind sockets for workers, their count, password for
  609. controller's commands and parameters for fuzzy storage. Default config provides
  610. reasonable values of this parameters (except password of course), so for basic
  611. configuration you may just replace controller's password to more secure one.
  612. @section Classifiers configuration.
  613. @subsection Common classifiers options.
  614. Each classifier has mandatory option @var{type} that defines internal algorithm
  615. that is used for classifying. Currently only @code{winnow} is supported. You can
  616. read theoretical description of algorithm used here:
  617. @url{http://www.siefkes.net/papers/winnow-spam.pdf}
  618. The common classifier configuration consists of base classifier parameters and
  619. definitions of two (or more than two) statfiles. During classify process rspamd
  620. check each statfile in classifier and select those that has more
  621. probability/weight than others. If all statfiles has zero weight this classifier
  622. do not add any symbols. Among common classifiers options are:
  623. @multitable @columnfractions .2 .8
  624. @headitem Tag @tab Mean
  625. @item @var{<tokenizer>}
  626. @tab Tokenizer to extract tokens from messages. Currently only @emph{osb}
  627. tokenizer is supported
  628. @item @var{<metric>}
  629. @tab Metric to which this classifier would insert symbol.
  630. @end multitable
  631. Also option @var{min_tokens} is supported to specify minimum number of tokens to
  632. work with (this is usefull to avoid classifying of short messages as statistic
  633. is practically useless for small amount of tokens). Here is example of base
  634. classifier config:
  635. @example
  636. <classifier type="winnow">
  637. <tokenizer>osb-text</tokenizer>
  638. <metric>default</metric>
  639. <option name="min_tokens">20</option>
  640. <statfile>
  641. ...
  642. </statfile>
  643. </classifier>
  644. @end example
  645. @subsection Statfiles options.
  646. The most common statfile options are @var{symbol} and @var{size}. The first one defines
  647. which symbol would be inserted if this statfile would have maximal weight inside
  648. classifier and size defines statfile size on disk and in memory. Note that
  649. statfiles are mapped directly to memory and you should practically note
  650. parameter @var{statfile_pool_size} of main section which defines maximum ammount
  651. of memory for mapping statistic files. Also note that statistic files are
  652. of constant size: if you defines 100 megabytes statfile it would occupy 100
  653. megabytes of disc space and 100 megabytes of memory when it is used (mapped).
  654. Each statfile is indexed by tokens and contains so called "token chains". This
  655. mechanizm would be described further but note that each statfile has parameter
  656. "free tokens" that defines how much space is available for new tokens. If
  657. statfile has no free space the most unused tokens would be removed from
  658. statfile.
  659. Here is list of common options of statfiles:
  660. @multitable @columnfractions .2 .8
  661. @headitem Tag @tab Mean
  662. @item @var{<symbol>}
  663. @tab Defines symbol to insert for this statfile.
  664. @item @var{<size>}
  665. @tab Size of this statfile in bytes (kilo/mega/giga bytes).
  666. @item @var{<path>}
  667. @tab Filesystem path to statistic file.
  668. @item @var{<normalizer>}
  669. @tab Defines weight normalization structure. Can be lua function name or
  670. internal normalizer. Internal normalizer is defined in format:
  671. "internal:<max_weight>" where max_weight is fractional number that limits the
  672. maximum weight of this statfile's symbol (this is so called dynamic weight).
  673. @item @var{<binlog>}
  674. @tab Defines binlog affinity: master or slave. This option is used for statfiles
  675. binary sync that would be described further.
  676. @item @var{<binlog_master>}
  677. @tab Defines credits of binlog master for this statfile.
  678. @item @var{<binlog_rotate>}
  679. @tab Defines rotate time for binlog.
  680. @end multitable
  681. Internal normalization of statfile weight works in this way:
  682. @itemize @bullet
  683. @item @math{R_{score} = 1} when @math{W_{statfile} < 1}
  684. @item @math{R_{score} = W_statfile ^ 2} when @math{1 < W_{statfile} < max / 2}
  685. @item @math{R_{score} = W_statfile} when @math{max / 2 < W_{statfile} < max}
  686. @item @math{R_{score} = max} when @math{W_{statfile} > max}
  687. @end itemize
  688. The final result weight would be: @math{weight = R_{score} * W_{weight}}.
  689. Here is sample classifier configuration with two statfiles that can be used for
  690. spam/ham classifying:
  691. @example
  692. <symbol weight="-1.00">WINNOW_HAM</symbol>
  693. <symbol weight="1.00">WINNOW_SPAM</symbol>
  694. ...
  695. <!-- Classifiers section -->
  696. <classifier type="winnow">
  697. <tokenizer>osb-text</tokenizer>
  698. <metric>default</metric>
  699. <option name="min_tokens">20</option>
  700. <statfile>
  701. <symbol>WINNOW_HAM</symbol>
  702. <size>100M</size>
  703. <path>/var/run/rspamd/data.ham</path>
  704. <normalizer>internal:3</normalizer>
  705. </statfile>
  706. <statfile>
  707. <symbol>WINNOW_SPAM</symbol>
  708. <size>100M</size>
  709. <path>/var/run/rspamd/data.spam</path>
  710. <normalizer>internal:3</normalizer>
  711. </statfile>
  712. </classifier>
  713. <!-- End of classifiers section -->
  714. @end example
  715. @noindent
  716. In this sample we define classifier that contains two statfiles:
  717. @emph{WINNOW_SPAM} and @emph{WINNOW_HAM}. Each statfile has 100 megabytes size
  718. (so they would occupy 200Mb while classifying). Also each statfile has maximum
  719. weight of 3 so with such weights (-1 for WINNOW_HAM and 1 for WINNOW_SPAM) the
  720. result weight of symbols would be 0..3 for @emph{WINNOW_SPAM} and 0..-3 for
  721. @emph{WINNOW_HAM}.
  722. @section Composites config.
  723. Composite symbols are rules that allow combining of several other symbols by
  724. using logical expressions. For example you can add composite symbol COMP1 that
  725. would be added if SYMBOL1 and SYMBOL2 are presented after message checks. When
  726. composite symbol is added the symbols that are in that composite are removed. So
  727. if message has symbols SYMBOL1 and SYMBOL2 the composite symbol COMP1 would be
  728. inserted in place of these two symbols. Not that if composite symbol is not
  729. inserted the symbols that are inside it are not touched. So SYMBOL1 and SYMBOL2
  730. can be presented separately, but when COMP1 is added SYMBOL1 and SYMBOL2 would
  731. be removed. Composite symbols can be defined in main configuration section. Here
  732. is example of composite rules definition:
  733. @example
  734. <composite name="ONCE_RECEIVED_PBL">ONCE_RECEIVED &amp; RECEIVED_PBL</composite>
  735. <composite name="SPF_TRUSTED">R_SPF_TRUSTED &amp; R_SPF_ALLOW</composite>
  736. <composite name="TRUSTED_FROM">R_TRUSTED_FROM &amp; R_SPF_ALLOW</composite>
  737. @end example
  738. Note that you need to insert xml entity (@emph{&amp;}) instead of '&' symbol;
  739. @section Modules config.
  740. @subsection Lua modules loading.
  741. For loading custom lua modules you should use @emph{<modules>} section:
  742. @example
  743. <modules>
  744. <module>/usr/local/etc/rspamd/plugins/lua</module>
  745. </modules>
  746. @end example
  747. @noindent
  748. Each @emph{<module>} directive defines path to lua modules. If this is a
  749. directory so all @code{*.lua} files inside that directory would be loaded. If
  750. this is a file it would be loaded directly.
  751. @subsection Modules configuration.
  752. Each module can have its own config section (this is true not only for internal
  753. module but also for lua modules). Such section is called @emph{<module>} with
  754. mandatory attribute @emph{"name"}. Each module can be configured by
  755. @emph{<option>} directives. These directives must also have @emph{"name"}
  756. attribute. So module configuration is done in @code{param = value} style:
  757. @example
  758. <module name="fuzzy_check">
  759. <option name="servers">localhost:11335</option>
  760. <option name="symbol">R_FUZZY</option>
  761. <option name="min_length">300</option>
  762. <option name="max_score">10</option>
  763. </module>
  764. @end example
  765. @noindent
  766. The common parameters are:
  767. @itemize @bullet
  768. @item symbol - symbol that this module should insert.
  769. @end itemize
  770. But each module can have its own unique parameters. So it would be discussed
  771. furhter in detailed modules description. Also note that for internal modules you
  772. should edit @emph{<filters>} parameter in main section: this parameter defines
  773. which internal modules would be turned on in this configuration.
  774. @section Views config.
  775. It is possible to make different rules for different
  776. networks/senders/recipients. For this purposes you can use rspamd views: maps of
  777. conditions (ip, sender, recipients) and actions, associated with them. For
  778. example you can turn rspamd off for specific conditions by using
  779. @emph{skip_check} action or check only specific rules. Views are defined inside
  780. @emph{<view>} xml section. Here is list of available tags inside section:
  781. @multitable @columnfractions .2 .8
  782. @headitem Tag @tab Mean
  783. @item @var{<skip_check>}
  784. @tab Boolean flag (yes or no) that specifies whether rspamd checks should be
  785. turned off for this ip
  786. @item @var{<symbols>}
  787. @tab Defines comma-separated list of symbols that should be checked for this
  788. view
  789. @item @var{<ip>}
  790. @tab Map argument that defines path to list of ip addresses (may be with CIDR
  791. masks) to which this view should be applied.
  792. @item @var{<client_ip>}
  793. @tab Map argument that defines path to list of ip addresses of rspamd clients
  794. to which this view should be applied. Note that this is ip of rspamd client not
  795. ip of message's sender.
  796. @item @var{<from>}
  797. @tab Map argument that defines path to list of senders to which this view should
  798. be applied.
  799. @end multitable
  800. Here is an example view definition
  801. @example
  802. <view>
  803. <skip_check>yes</skip_check>
  804. <ip>file:///usr/local/etc/rspamd/whitelist</ip>
  805. </view>
  806. @end example
  807. @chapter Rspamd clients interaction.
  808. @section Introduction.
  809. After you have basic config file you may test rspamd functionality by using
  810. whether telnet like utility or @emph{rspamc} client. For testing newly installed
  811. config it is possible to run config file test:
  812. @example
  813. $ rspamd -t
  814. syntax OK
  815. @end example
  816. Rspamc utility is written in @code{perl} language and uses perl modules that are
  817. shipped with rspamd: @emph{Mail::Rspamd::Client} for client's protocol and
  818. @emph{Mail::Rspamd::Config} for reading and writing configuration. The
  819. documentation for these modules can be found by commands:
  820. @example
  821. $ perldoc Mail::Rspamd::Client
  822. $ perldoc Mail::Rspamd::Config
  823. @end example
  824. So other way to access rspamd is to use perl client API:
  825. @example
  826. use Mail::Rspamd::Client;
  827. my $config = @{
  828. hosts => ['localhost:11333'],
  829. @};
  830. my $client = new Mail::Rspamd::Client(%config);
  831. if (! $client->ping()) @{
  832. die "Cannot ping rspamd: $client->@{error@}";
  833. @}
  834. my $result = $client->check($testmsg);
  835. if ($result->@{'default'@}->@{isspam@} eq 'True') @{
  836. # do something with spam message here
  837. @}
  838. @end example
  839. @section Rspamc protocol.
  840. Rspamc protocol is an extension over traditional spamc protocol that is used by
  841. spamassassin. This protocol looks like traditional HTTP session: first line is
  842. method with version, headers can be passed by next lines and the message itself
  843. is waited after empty line:
  844. @example
  845. <REQUEST>
  846. SYMBOLS RSPAMC/1.1
  847. Content-Length: 2200
  848. <message octets>
  849. <REPLY>
  850. RSPAMD/1.1 0 OK
  851. Metric: default; True; 10.40 / 10.00 / 0.00
  852. Symbol: R_UNDISC_RCPT
  853. Symbol: ONCE_RECEIVED
  854. Symbol: R_MISSING_CHARSET
  855. Urls:
  856. @end example
  857. @noindent
  858. The format of method line can be presented as:
  859. @example
  860. <COMMAND> RSPAMC/<version>
  861. @end example
  862. @noindent
  863. Version can be 1.0 and 1.1. The main difference that in 1.1 metrics output also
  864. has @emph{reject score} - hard limit of score for metric. This would be
  865. discussed while describing user's options. Commands are:
  866. @multitable @columnfractions .2 .8
  867. @headitem Command @tab Mean
  868. @item CHECK
  869. @tab Check a message and output results for each metric. But do not output
  870. symbols.
  871. @item SYMBOLS
  872. @tab Same as @emph{CHECK} but output symbols.
  873. @item PROCESS
  874. @tab Same as @emph{SYMBOLS} but output also original message with inserted
  875. X-Spam headers.
  876. @item PING
  877. @tab Do not do any processing, just check rspamd state:
  878. @example
  879. $ telnet localhost 11333
  880. Trying 127.0.0.1...
  881. Connected to localhost.
  882. Escape character is '^]'.
  883. PING RSPAMC/1.1
  884. RSPAMD/1.1 0 PONG
  885. Connection closed by foreign host.
  886. @end example
  887. @noindent
  888. @end multitable
  889. After command there should be one mandatory header: @strong{Content-Length} that
  890. defines message's length in bytes and optional headers:
  891. @multitable @columnfractions .2 .8
  892. @headitem Header @tab Mean
  893. @item @var{Deliver-To:}
  894. @tab Defines actual delivery recipient of message. Can be used for personalized
  895. statistic and for user specific options.
  896. @item @var{IP:}
  897. @tab Defines IP from which this message is received.
  898. @item @var{Helo:}
  899. @tab Defines SMTP helo.
  900. @item @var{From:}
  901. @tab Defines SMTP mail from command data.
  902. @item @var{Queue-Id:}
  903. @tab Defines SMTP queue id for message (can be used instead of message id in
  904. logging).
  905. @item @var{Rcpt:}
  906. @tab Defines SMTP recipient (it may be several @emph{Rcpt:} headers).
  907. @item @var{Pass:}
  908. @tab If this header has @emph{"all"} value, all filters would be checked for
  909. this message.
  910. @item @var{Subject:}
  911. @tab Defines subject of message (is used for non-mime messages).
  912. @item @var{User:}
  913. @tab Defines SMTP user (this is currently unused in rspamd however).
  914. @end multitable
  915. So rspamc protocol allows to pass many data from MTA to rspamd. This is used to
  916. increase speed of processing and for building filters (like SPF filter). Also
  917. note that rspamd support spamassassin spamc protocol and you can even pass
  918. rspamc headers in spamc mode, but reply of rspamd in spamc mode would be much
  919. shorter: it would only use "default" metric and won't show additional options
  920. for symbols. Rspamc reply looks like this:
  921. @example
  922. RSPAMD/1.1 0 OK
  923. Metric: default; True; 10.40 / 10.00 / 0.00
  924. Symbol: R_UNDISC_RCPT
  925. Symbol: ONCE_RECEIVED
  926. Symbol: R_MISSING_CHARSET
  927. Urls:
  928. @end example
  929. @noindent
  930. First line is method reply: @code{<PROTOCOL>/<VERSION> <ERROR_CODE> <ERROR_REPLY>}.
  931. Error code is 0 when no error occured. After first reply line there are metrics
  932. output. For @emph{SYMBOLS} and @emph{PROCESS} commands there are symbols lines
  933. after each metric. And for @emph{PROCESS} command there would be original
  934. message after all metrics results. Metric result line looks like this:
  935. @example
  936. Metric: <name>; <result>; <score> / <required_score> / <reject_score>
  937. @end example
  938. @noindent
  939. For 1.0 version of rspamc protocol @emph{reject_score} parameter is not printed.
  940. Symbol line looks like this:
  941. @example
  942. Symbol: <Name>[; param1[, param2...]]
  943. @end example
  944. @noindent
  945. Some symbols can have parameters attached. It is useful for example for RBL
  946. checks (you can insert additional data after symbol name), for statistic and
  947. fuzzy checks. Also rspamd inserts @emph{Urls} line in which all urls that are
  948. contained in message are printed in comma-separated list.
  949. Note that this protocol is used for normal workers. Controller, fuzzy storage
  950. and lmtp/smtp workers are using other protocols. For example controller's
  951. protocol is oriented on interactive sessions: you can pass many commands to
  952. controller before disconnecting. Fuzzy storage is using UDP for making
  953. interaction with storage faster. LMTP/SMTP workers are using lmtp and smtp
  954. protocols. All of these protocols would be described in further chapters about
  955. rspamd workers.
  956. @section Controller protocol.
  957. Rspamd controller can also be accessed by telnet, by rspamc client or by using
  958. perl module Mail::Rspamd::Client. Controller protocol accepts commands and it is
  959. possible to send several commands during a single session. Here is an example
  960. telnet session:
  961. @example
  962. >telnet localhost 11334
  963. Trying 127.0.0.1...
  964. Connected to localhost.
  965. Escape character is '^]'.
  966. Rspamd version 0.3.0 is running on spam1.rambler.ru
  967. stat
  968. Messages scanned: 1526901
  969. Messages treated as spam: 238171, 15.60%
  970. Messages treated as ham: 1288730, 84.40%
  971. Messages learned: 0
  972. Connections count: 1529758
  973. Control connections count: 15
  974. Pools allocated: 3059589
  975. Pools freed: 3056134
  976. Bytes allocated: 98545852799
  977. Memory chunks allocated: 8745374
  978. Shared chunks allocated: 7
  979. Chunks freed: 8737507
  980. Oversized chunks: 768784
  981. Fuzzy hashes stored: 0
  982. Fuzzy hashes expired: 0
  983. Statfile: WINNOW_SPAM (version 186); length: 100.0 MB; free blocks: 748504; total blocks: 6553581; free: 11.42%
  984. Statfile: WINNOW_HAM (version 186); length: 100.0 MB; free blocks: 748504; total blocks: 6553581; free: 11.42%
  985. END
  986. @end example
  987. @noindent
  988. So you can see that reply from controller is ended with line that contains word
  989. @strong{END}. It is also possible to get summary help for controller's commands:
  990. @example
  991. help
  992. Rspamd CLI commands (* - privilleged command):
  993. help - this help message
  994. (*) learn <statfile> <size> [-r recipient] [-m multiplier] [-f from] [-n] - learn message to specified statfile
  995. quit - quit CLI session
  996. (*) reload - reload rspamd
  997. (*) shutdown - shutdown rspamd
  998. stat - show different rspamd stat
  999. counters - show rspamd counters
  1000. uptime - rspamd uptime
  1001. END
  1002. @end example
  1003. @noindent
  1004. Note that some commands are privilleged ones - you are required to enter a
  1005. password for them:
  1006. @example
  1007. >telnet localhost 11334
  1008. Trying 127.0.0.1...
  1009. Connected to localhost.
  1010. Escape character is '^]'.
  1011. Rspamd version 0.3.0 is running on spam1.rambler.ru
  1012. reload
  1013. not authorized
  1014. END
  1015. password q1
  1016. password accepted
  1017. END
  1018. reload
  1019. reload request sent
  1020. END
  1021. Connection closed by foreign host.
  1022. @end example
  1023. @noindent
  1024. This password is configured in rspamd.xml in worker section where you are
  1025. describing controller:
  1026. @example
  1027. <worker>
  1028. <type>controller</type>
  1029. ...
  1030. <!-- Other params -->
  1031. <param name="password">q1</param>
  1032. </worker>
  1033. @end example
  1034. In many cases it is more easy to use rspamc to access controller. Here is
  1035. example of learning statfiles using rspamc CLI:
  1036. @example
  1037. % rspamc -h localhost:11334 -P q1 -s WINNOW_HAM learn < /tmp/exim.eml
  1038. Results for host localhost:11334:
  1039. Learn succeed. Sum weight: 1.51
  1040. % rspamc -h localhost:11334 -P q1 -s WINNOW_SPAM learn < /tmp/bad.eml
  1041. Results for host localhost:11334:
  1042. Learn succeed. Sum weight: 1.51
  1043. @end example
  1044. Note that rspamc handles password issues and other things like timeouts and
  1045. error handling inside and makes this tasks rather easy.
  1046. @section More about rspamc client.
  1047. Rspamc is small and simple client that allows to simplify common tasks for
  1048. rspamd manage. Rspamc is written in perl language and requires some modules for
  1049. its work:
  1050. @itemize @bullet
  1051. @item Mail::Rspamd::Client - a module that contains common function for
  1052. accessing rspamd, shipped with rspamd and installed automatically
  1053. @item Term::Cap - a module that allows basic interaction with terminal, can be
  1054. obtained via @url{http://www.cpan.org, cpan}.
  1055. @end itemize
  1056. Rspamc accepts several command line options:
  1057. @example
  1058. % rspamc --help
  1059. Usage: rspamc.pl [-h host] [-H hosts_list] [-P password] [-c conf_file] [-s statfile] [-d user@@domain] [command] [path]
  1060. -h host to connect (in format host:port) or unix socket path
  1061. -H path to file that contains list of hosts
  1062. -P define control password
  1063. -c config file to parse
  1064. -s statfile to use for learn commands
  1065. Additional options:
  1066. -d define deliver-to header
  1067. -w define weight for fuzzy operations
  1068. -S define search string for IMAP operations
  1069. -i emulate that message was send from specified IP
  1070. -p pass message throught all filters
  1071. Notes:
  1072. imap format: imap:user:<username>:password:[<password>]:host:<hostname>:mbox:<mboxname>
  1073. Password may be omitted and then it would be asked in terminal
  1074. imaps requires IO::Socket::SSL
  1075. IMAP search strings samples:
  1076. ALL - All messages in the mailbox;
  1077. FROM <string> - Messages that contain the specified string in the envelope structure's FROM field;
  1078. HEADER <field-name> <string> - Messages that have a header with the specified field-name and that
  1079. contains the specified string in the text of the header (what comes after the colon);
  1080. NEW - Messages that have the Recent flag set but not the Seen flag.
  1081. This is functionally equivalent to "(RECENT UNSEEN)".
  1082. OLD - Messages that do not have the Recent flag set.
  1083. SEEN - Messages that have the Seen flag set.
  1084. SENTBEFORE <date> - Messages whose [RFC-2822] Date: header (disregarding time and timezone)
  1085. is earlier than the specified date.
  1086. TO <string> - Messages that contain the specified string in the envelope structure's TO field.
  1087. TEXT <string> - Messages that contain the specified string in the header or body of the message.
  1088. OR <search-key1> <search-key2> - Messages that match either search key (same for AND and NOT operations).
  1089. Version: 0.3.0
  1090. @end example
  1091. @noindent
  1092. After options you should specify command to execute, for example:
  1093. @example
  1094. % rspamc symbols < /tmp/exim.eml
  1095. @end example
  1096. @noindent
  1097. After command name you may specify objects to apply to: files, directories or
  1098. even imap folders:
  1099. @itemize @bullet
  1100. @item A single file:
  1101. @example
  1102. % rspamc symbols /tmp/exim.eml
  1103. @end example
  1104. @noindent
  1105. @item A list of files:
  1106. @example
  1107. % rspamc symbols /tmp/*.eml
  1108. @end example
  1109. @noindent
  1110. @item Directories:
  1111. @example
  1112. % rspamc symbols /tmp/*.eml /tmp/to_scan/
  1113. @end example
  1114. @noindent
  1115. @item IMAP folder:
  1116. @example
  1117. % rspamc symbols imap:user:username:password::host:localhost:mbox:INBOX
  1118. Enter IMAP password:
  1119. @end example
  1120. @noindent
  1121. Note that it is possible to specify empty password and be prompted for a
  1122. password during execution (you also need perl module Term::ReadKey for turning
  1123. on noecho input of password).
  1124. @end itemize
  1125. For fetching imap messages you may also use search string by specifying -S
  1126. option. Some examples of IMAP search strings can be found in a help message. For
  1127. more complex things you may read rfc3501 about imap4 search strings. This may be
  1128. found for example here: @url{http://www.faqs.org/rfcs/rfc3501.html}. IMAP access
  1129. may be usefull for setting up automatic learning scripts. Also it is possible to
  1130. use SSL version of imap by specifying @strong{imaps} instead @strong{imap} as
  1131. first component. Note that for SSL access you need @emph{IO::Socket::SSL} perl
  1132. module.
  1133. @chapter Statistics and hashes storage.
  1134. @section Introduction.
  1135. First of all we need to strictly define purposes of hashes and statistic. Hashes
  1136. are used to find very close messages (for example messages where there are only
  1137. several words changed), while statistic can find @strong{probability} of
  1138. belonging message to specified class of messages. So when you learn rspamd with
  1139. message's hash you just add this hash to storage and when you learn rspamd
  1140. statistic you add tokens from message to specified class. So statistic is
  1141. probabilistic method to filter message, while fuzzy hashes can detect specific
  1142. patterns in messages and filter them.
  1143. @section Classifiers and statistic.
  1144. @subsection Tokenization.
  1145. Now rspamd supports OSB-Winnow statistic algorithm. Let's describe it in
  1146. details. First of all message is separeted into a set of tokens. The algorithm
  1147. of extracting tokens is rather simple now:
  1148. @enumerate 1
  1149. @item Extract graph symbols till first non-graph symbol (whitespace, punctuation
  1150. etc), the group of graph symbols forms a token, non-graphs are separators.
  1151. @item Fill an array with token till @strong{window size} is reached (currently
  1152. this size is 5 tokens).
  1153. @item Get pairs of tokens from array and extract their hashes:
  1154. @itemize @bullet
  1155. @item * . . . * -> token1 (h1, h5);
  1156. @item . * . . * -> token2 (h2, h5);
  1157. @item . . * . * -> token3 (h3, h5);
  1158. @item . . . * * -> token4 (h4, h5);
  1159. @end itemize
  1160. @noindent
  1161. @item Insert these tokens to statfile (indexed by first hash).
  1162. @item Shift window on next word.
  1163. @end enumerate
  1164. So after tokenizing process we would have tokens each of that contains 2 hashes of 2
  1165. words from message. This mechanics allows to count not only words itself but
  1166. also its combinations into a message, so providing more accurate statistic.
  1167. @subsection Classifying.
  1168. For classifying process @strong{winnow} algorithm is used. In this statistic
  1169. algtorithm we operate not with probabilities but with weights. Each token has
  1170. its own weight and when we learn some statfile with tokens rspamd does several
  1171. things:
  1172. @enumerate 1
  1173. @item Try to find token inside statfile.
  1174. @item If a token found multiply its weight by so called @strong{promotion
  1175. factor} (that is now 1.23).
  1176. @item If token not found insert it into statfile with weight 1.
  1177. @end enumerate
  1178. If it is needed to lower token weight, so its weight is multiplied with
  1179. @strong{demotion factor} (currently 0.83). Classify process is even more simple:
  1180. @enumerate 1
  1181. @item Extract tokens from a message.
  1182. @item For each statfile check weight of obtained tokens and store summary
  1183. weight.
  1184. @item Compare sums for each statfile and select statfile with the most big sum.
  1185. @item Do weight normalization and insert symbol of selected statfile.
  1186. @end enumerate
  1187. @subsection Statfiles synchronization.
  1188. Rspamd allows to make master/slave statfiles synchronization. This is done by
  1189. writing changes to statfiles to special @emph{binary log}. Binary log is a file
  1190. on filesystem named like statfile but with @emph{.binlog} suffix. Binary log
  1191. consist of two level indexes and binary changes to each statfile. So after each
  1192. learning process the version of affected statfiles is increased by 1 and a
  1193. record is written to binary log. Binary logs have fixed size limit and may have
  1194. time limit (rotate time). The process of synchronization may be described as:
  1195. @enumerate 1
  1196. @item Slave rspamd periodically asks master for version of statfiles monitored.
  1197. @item If master has version that is larger than slave's one the synchronization
  1198. process starts.
  1199. @item During synchronization process master looks at version reported by client
  1200. in binary log.
  1201. @item If version is found all records that are @strong{after} client's version
  1202. are sent to client.
  1203. @item Client accepts changes and apply binary patches one-by-one incrementing
  1204. statfile's version.
  1205. @item If version that client reports is not found in binary log the completely
  1206. statfile is sent to client (slow way, but practically that would take place only
  1207. once for fresh slaves).
  1208. @end enumerate
  1209. Here is example configuration for master statfile:
  1210. @example
  1211. <statfile>
  1212. <symbol>WINNOW_HAM</symbol>
  1213. <size>100M</size>
  1214. <path>/spool/rspamd/data.ham</path>
  1215. <normalizer>internal:3</normalizer>
  1216. <binlog>master</binlog>
  1217. <binlog_rotate>1d</binlog_rotate>
  1218. </statfile>
  1219. @end example
  1220. @noindent
  1221. Here we define binlog affinity (master) that automatically create binlog file
  1222. @file{/spool/rspamd/data.ham.binlog} and set up time limit for it (1 day).
  1223. For slaves you should first of all set up controller worker to accept network
  1224. connections (statfile synchronization is done via controller workers). The
  1225. second task is to define affinity for slave and master's address:
  1226. @example
  1227. <statfile>
  1228. <symbol>WINNOW_HAM</symbol>
  1229. <size>100M</size>
  1230. <path>/spool/rspamd/data.ham</path>
  1231. <normalizer>internal:3</normalizer>
  1232. <binlog>slave</binlog>
  1233. <binlog_master>spam10:11334</binlog_master>
  1234. </statfile>
  1235. @end example
  1236. @subsection Conclusion.
  1237. Statfiles synchronization allows to set up rspamd cluster that uses the common
  1238. statfiles and easily learn the whole cluster without unnecessary overhead.
  1239. @section Hashes and hash storage.
  1240. @subsection Fuzzy hashes.
  1241. Hashes that are used in rspamd for messages are not cryptoghraphic. Instead of
  1242. them fuzzy hashes are used. Fuzzy hashes is technics that allows to obtain
  1243. common hashes for common messages (for cryptographic hashes you usually get very
  1244. different hashes even if input messages are very common but not identical). The
  1245. main principle of fuzzy hashing is to break up text parts of message into small
  1246. pieces (blocks) and calculate hash for each block using so called @emph{rolling
  1247. hash}. After this process the final hash is forming by setting bytes in it from
  1248. blocks. So if we have 2 messages each of that contains 100 blocks and 99 of them
  1249. are identical we would have 2 hashes that differs only in one byte. So we can
  1250. consider that one message is 99% like other message.
  1251. @subsection Fuzzy storage.
  1252. In rspamd hashes can be stored in fuzzy storage. Fuzzy storage is a special
  1253. worker that can store hashes and reply about score of hashes. Inside fuzzy
  1254. storage each hash has its own weight and list number. List number is integer
  1255. that specify to which list this hash is related. This number can be used in
  1256. fuzzy_check plugin inside rspamd to add custom symbol. There are two ways of
  1257. storing fuzzy hashes: store them in a set of linear linked lists and storing
  1258. hashes in very fast judy tree. First way is good for a relatively small number
  1259. of fuzzy hashes. Also in this case @emph{fuzzy match} is used, so you can find
  1260. not only identical hashes but also common hashes. But for large number of hashes
  1261. this method is very slow. The second way requires libJudy in system (can be
  1262. found at @url{http://judy.sourceforge.net}) and turns off @emph{fuzzy matching}
  1263. - only identical hashes would be found. On the other hand you may store millions
  1264. of hashes in judy tree not loosing nor memory, nor CPU.
  1265. @subsection Conclusion.
  1266. Fuzzy hashes is efficient way to make up different black or white lists. Fuzzy
  1267. storage can be distributed over several machines (if you specify several storage
  1268. servers rspamd would select upstream by hash of fuzzy hash). Also storage can
  1269. contain several lists identified by number. Each hash has its own weight that
  1270. allows to set up dynamic rules that add different score from different hashes.
  1271. @chapter Rspamd modules.
  1272. @section Introduction.
  1273. This chapter describes modules that are shipped with rspamd. Here you can find
  1274. details about modules configuration, principles of working, tricks to make spam
  1275. filtering effective. First sections describe internal modules written in C:
  1276. regexp (regular expressions), surbl (black list for URLs), fuzzy_check (checks
  1277. for fuzzy hashes), chartable (check for character sets in messages) and emails
  1278. (check for blacklisted email addresses in messages). Modules configuration can
  1279. be done in lua or in config file itself.
  1280. @subsection Lua configuration.
  1281. You may use lua for setting configuration options for modules. With lua you can
  1282. write rather complex rules that can contain not only text lines, but also some
  1283. lua functions that would be called while processing messages. For loading lua
  1284. configuration you should add line to rspamd.xml:
  1285. @example
  1286. <lua src="/usr/local/etc/rspamd/lua/my.lua">fake</lua>
  1287. @end example
  1288. @noindent
  1289. It is possible to load several scripts this way. Inside lua file there would be
  1290. defined global table with name @var{config}. This table should contain
  1291. configuration options for modules indexed by module. This can be written this
  1292. way:
  1293. @example
  1294. config['module_name'] = @{@}
  1295. local mconfig = config['module_name']
  1296. mconfig['option_name'] = 'option value'
  1297. local a = 'aa'
  1298. local b = 'bb'
  1299. mconfig['other_option'] = string.format('%s, %s', a, b)
  1300. @end example
  1301. @noindent
  1302. In this simple example we defines new element of table that is associated with
  1303. module named 'module_name'. Then we assign to it an empty table (@code{@{@}})
  1304. and associate local variable mconfig. Then we set some elements of this table,
  1305. that is equialent to setting module options like that:
  1306. @example
  1307. option_name = option_value
  1308. other_option = aa, bb
  1309. @end example
  1310. @noindent
  1311. Also you may assign to elements of modules tables some functions. That functions
  1312. should accept one argument - worker task object and return result specific for
  1313. that option: number, string, boolean. This can be shown on this simple example:
  1314. @example
  1315. local function test (task)
  1316. if task:get_ip() == '127.0.0.1' then
  1317. return 1
  1318. else
  1319. return 0
  1320. end
  1321. end
  1322. mconfig['some_option'] = test
  1323. @end example
  1324. In this example we assign to module option 'some_option' a function that check
  1325. for message's ip and return 1 if that ip is '127.0.0.1'.
  1326. So using lua for configuration can help for making complex rules and for
  1327. structuring rules - you can place options for specific modules to specific files
  1328. and use lua function @code{dofile} for loading them (or add other @code{<lua>}
  1329. tag to rspamd.xml).
  1330. @subsection XML configuration.
  1331. Options for rspamd modules can be set up from xml file too. This can be used for
  1332. simple and/or temporary rules and should not be used for complex rules as this
  1333. would make xml file too hard to read and edit. Thought it is surely possible but
  1334. not recommended from points of config file understanding. Here is a simple
  1335. example of module config options:
  1336. @example
  1337. <module name="module_name">
  1338. <option name="option_name">option_value</option>
  1339. <option name="other_option">aa, bb</option>
  1340. </module>
  1341. @end example
  1342. @noindent
  1343. Note that you need to encode xml entitles like @code{&} - @code{&amp;} and so
  1344. on. Also only utf8 encoding is allowed. In sample rspamd configuration all
  1345. modules except regexp module are configured via xml as they have only settings
  1346. and regexp module has rules that are sometimes rather complex.
  1347. @section Regexp module.
  1348. @subsection Introduction.
  1349. Regexp module is one of the most important rspamd modules. Regexp module can
  1350. load regular expressions and filter messages according to them. Also it is
  1351. possible to use logical expressions of regexps to create complex rules of
  1352. filtering. It is allowed to use logical operators:
  1353. @itemize @bullet
  1354. @item & - logical @strong{AND} function
  1355. @item | - logical @strong{OR} function
  1356. @item ! - logical @strong{NOT} function
  1357. @end itemize
  1358. Also it is possible to use brackets for making priorities in expressions. Regexp
  1359. module operates with @emph{regexp items} that can be combined with logical
  1360. operators into logical @emph{regexp expresions}. Each expression is associated
  1361. with its symbol and if it evaluates to true with this message the symbol would
  1362. be inserted. Note that rspamd uses internal optimization of logical expressions
  1363. (for example if we have expression 'rule1 & rule2' rule2 would not be evaluated
  1364. if rule1 is false) and internal regexp cache (so if rule1 and rule2 have common
  1365. items they would be evaluated only once). So if you need speed optimization of
  1366. your rules you should take this fact into consideration.
  1367. @subsection Regular expressions.
  1368. Rspamd uses perl compatible regular expressions. You may read about perl regular
  1369. expression syntax here: @url{http://perldoc.perl.org/perlre.html}. In rspamd
  1370. regular expressions must be enclosed in slashes:
  1371. @example
  1372. /^\\d+$/
  1373. @end example
  1374. @noindent
  1375. If '/' symbol must be placed into regular expression it should be escaped:
  1376. @example
  1377. /^\\/\\w+$/
  1378. @end example
  1379. @noindent
  1380. After last slash it is possible to place regular expression modificators:
  1381. @multitable @columnfractions 0.1 0.9
  1382. @headitem Modificator @tab Mean
  1383. @item @strong{i} @tab Ignore case for this expression.
  1384. @item @strong{m} @tab Assume this expression as multiline.
  1385. @item @strong{s} @tab Assume @emph{.} as all characters including newline
  1386. characters (should be used with @strong{m} flag).
  1387. @item @strong{x} @tab Assume this expression as extended regexp.
  1388. @item @strong{u} @tab Performs ungreedy matches.
  1389. @item @strong{o} @tab Optimize regular expression.
  1390. @item @strong{r} @tab Assume this expression as @emph{raw} (this is actual for
  1391. utf8 mode of rspamd).
  1392. @item @strong{H} @tab Search expression in message's headers.
  1393. @item @strong{X} @tab Search expression in raw message's headers (without mime
  1394. decoding).
  1395. @item @strong{M} @tab Search expression in the whole message (must be used
  1396. carefully as @strong{the whole message} would be checked with this expression).
  1397. @item @strong{P} @tab Search expression in all text parts.
  1398. @item @strong{U} @tab Search expression in all urls.
  1399. @end multitable
  1400. You can combine flags with each other:
  1401. @example
  1402. /^some text$/iP
  1403. @end example
  1404. @noindent
  1405. All regexp must be with type: H, X, M, P or U as rspamd should know where to
  1406. search for specified pattern. Header regexps (H and X) have special syntax if
  1407. you need to check specific header, for example @emph{From} header:
  1408. @example
  1409. From=/^evil.*$/Hi
  1410. @end example
  1411. @noindent
  1412. If header name is not specified all headers would be matched. Raw headers is
  1413. matching is usefull for searching for mime specific headers like MIME-Version.
  1414. The problem is that gmime that is used for mime parsing adds some headers
  1415. implicitly, for example @emph{MIME-Version} and you should match them using raw
  1416. headers. Also if header's value is encoded (base64 or quoted-printable encoding)
  1417. you can search for decoded version using H modificator and for raw using X
  1418. modificator. This is usefull for finding bad encodings types or for unnecessary
  1419. encoding.
  1420. @subsection Internal function.
  1421. Rspamd provides several internal functions for simplifying message processing.
  1422. You can use internal function as items in logical expressions as they like
  1423. regular expressions return logical value (true or false). Here is list of
  1424. internal functions with their arguments:
  1425. @multitable @columnfractions 0.3 0.2 0.5
  1426. @headitem Function @tab Arguments @tab Description
  1427. @item header_exists
  1428. @tab header name
  1429. @tab Returns true if specified header exists.
  1430. @item compare_parts_distance
  1431. @tab number
  1432. @tab If message has two parts (text/plain and text/html) compare how much they
  1433. differs (html messages are compared with stripped tags). The difference is
  1434. number in percents (0 is identically parts and 100 is totally different parts).
  1435. So if difference is more than number this function returns true.
  1436. @item compare_transfer_encoding
  1437. @tab string
  1438. @tab Compares header Content-Transfer-Encoding with specified string.
  1439. @item content_type_compare_param
  1440. @tab param_name, param_value
  1441. @tab Compares specified parameter of Content-Type header with regexp or certain
  1442. string:
  1443. @example
  1444. content_type_compare_param(Charset, /windows-\d+/)
  1445. content_type_compare_param(Charset, ascii)
  1446. @end example
  1447. @noindent
  1448. @item content_type_has_param
  1449. @tab param_name
  1450. @tab Returns true if content-type has specified parameter.
  1451. @item content_type_is_subtype
  1452. @tab subtype_name
  1453. @tab Return true if content-type is of specified subtype (for example for
  1454. text/plain subtype is 'plain').
  1455. @item content_type_is_type
  1456. @tab type_name
  1457. @tab Return true if content-type is of specified type (for example for
  1458. text/plain subtype is 'text'):
  1459. @example
  1460. content_type_is_type(text)
  1461. content_type_is_subtype(/?.html/)
  1462. @end example
  1463. @noindent
  1464. @item regexp_match_number
  1465. @tab number,[regexps list]
  1466. @tab Returns true if specified number of regexps matches for this message. This
  1467. can be used for making rules when you do not know which regexps should match but
  1468. if 2 of them matches the symbol shoul be inserted. For example:
  1469. @example
  1470. regexp_match_number(2, /^some evil text.*$/Pi, From=/^hacker.*$/H, header_exists(Subject))
  1471. @end example
  1472. @noindent
  1473. @item has_only_html_part
  1474. @tab nothing
  1475. @tab Returns true when message has only HTML part
  1476. @item compare_recipients_distance
  1477. @tab number
  1478. @tab Like compare_parts_distance calculate difference between recipients. Number
  1479. is used as minimum percent of difference. Note that this function would check
  1480. distance only when there are more than 5 recipients in message.
  1481. @item is_recipients_sorted
  1482. @tab nothing
  1483. @tab Returns true if recipients list is sorted. This function would also works
  1484. for more than 5 recipients.
  1485. @item is_html_balanced
  1486. @tab nothing
  1487. @tab Returns true when all HTML tags in message are balanced.
  1488. @item has_html_tag
  1489. @tab tag_name
  1490. @tab Returns true if tag 'tag_name' exists in message.
  1491. @item check_smtp_data
  1492. @tab item, regexp
  1493. @tab Returns true if specified part of smtp dialog matches specified regexp. Can
  1494. check HELO, FROM and RCPT items.
  1495. @end multitable
  1496. These internal functions can be easily implemented in lua but I've decided to
  1497. make them built-in as they are widely used in our rules. In fact this list may
  1498. be extended in future.
  1499. @subsection Dynamic rules.
  1500. Rspamd regexp module can use dynamic rules that can be written in json syntax.
  1501. Dynamic rules are loaded at runtime and can be modified while rspamd is working.
  1502. Also it is possible to turn dynamic rules for specific networks only and add rules
  1503. that does not contain any regexp (this can be usefull for dynamic lists for example).
  1504. Dynamic rules can be obtained like any other dynamic map via file monitoring or via
  1505. http. Here are examples of dynamic rules definitions:
  1506. @example
  1507. <module name="regexp">
  1508. <option name="dynamic_rules">file:///tmp/rules.json</option>
  1509. </module>
  1510. @end example
  1511. @noindent
  1512. or for http map:
  1513. @example
  1514. <module name="regexp">
  1515. <option name="dynamic_rules">http://somehost/rules.json</option>
  1516. </module>
  1517. @end example
  1518. @noindent
  1519. Rules are presented as json array (in brackets @emph{'[]'}). Each rule is json object.
  1520. This object can have several properties (properties with @strong{*} are required):
  1521. @multitable @columnfractions 0.3 0.7
  1522. @headitem Property @tab Mean
  1523. @item symbol(*)
  1524. @tab Symbol for rule.
  1525. @item factor(*)
  1526. @tab Factor for rule.
  1527. @item rule
  1528. @tab Rule itself (regexp expression).
  1529. @item enabled
  1530. @tab Boolean flag that define whether this rule is enabled (rule is enabled if
  1531. this flag is not present by default).
  1532. @item networks
  1533. @tab Json array of networks (in CIDR format, also it is possible to add negation
  1534. by prepending @emph{!} symbol before item.
  1535. @end multitable
  1536. Here is an example of dynamic rule:
  1537. @example
  1538. [
  1539. {
  1540. "rule": "/test/rP",
  1541. "symbol": "R_TMP_1",
  1542. "factor": 1.1,
  1543. "networks": ["!192.168.1.0/24", "172.16.0.0/16"],
  1544. "enabled": false
  1545. }
  1546. ]
  1547. @end example
  1548. Note that dynamic rules are constantly monitored for changes and are reloaded
  1549. completely when modification is detected. If you change dynamic rules they
  1550. would be reloaded in a minute and would be applied for new messages.
  1551. @subsection Conclusion.
  1552. Rspamd regexp module is powerfull tool for matching different patterns in
  1553. messages. You may use logical expressions of regexps and internal rspamd
  1554. functions to make rules. Rspamd is shipped with many rules for regexp module
  1555. (most of them are taken from spamassassin rules as rspamd originally was a
  1556. replacement of spamassassin) so you can look at them in ETCDIR/rspamd/lua/regexp
  1557. directory. There are many built-in rules with detailed comments. Also note that
  1558. if you add logical rule into XML file you need to escape all XML entitles (like
  1559. @emph{&} operators). When you make complex rules from many parts do not forget
  1560. to add brackets for parts inside expression as you would not predict order of
  1561. checks otherwise. Rspamd regexp module has internal logical optimization and
  1562. regexp cache, so you may use identical regexp many times - they would be matched
  1563. only once. And in logical expression you may optimize performance by putting
  1564. likely TRUE regexp first in @emph{OR} expression and likely FALSE expression
  1565. first in @emph{AND} expression. A number of internal functions can simplify
  1566. complex expressions and for making common filters. Lua functions can be added in
  1567. rules as well (they should return boolean value).
  1568. @section SURBL module.
  1569. Surbl module is designed for checking urls via blacklists. You may read about
  1570. surbls at @url{http://www.surbl.org}. Here is the sequence of operations that is
  1571. done by surbl module:
  1572. @enumerate 1
  1573. @item Extract all urls in message and get domains for each url.
  1574. @item Check to special list called '2tld' and extract 3 components for domains
  1575. from that list and 2 components for domains that are not listed:
  1576. @example
  1577. http://virtual.somehost.domain.com/some_path
  1578. -> somehost.domain.com if domain.com is in 2tld list
  1579. -> domain.com if not in 2tld
  1580. @end example
  1581. @noindent
  1582. @item Remove duplicates from domain lists
  1583. @item For each registered surbl do dns request in form @emph{domain.surbl_name}
  1584. @item Get result and insert symbol if that name resolves
  1585. @item It is possible to examine bits in returned IP address and insert different
  1586. symbol for each bit that is turned on in result.
  1587. @end enumerate
  1588. All DNS requests are done asynchronously so you may not bother about blocking.
  1589. SURBL module has several configuration options:
  1590. @itemize @bullet
  1591. @item @emph{metric} - metric to insert symbol to.
  1592. @item @emph{2tld} - list argument of domains for those 3 components of domain name
  1593. would be extracted.
  1594. @item @emph{max_urls} - maximum number of urls to check.
  1595. @item @emph{whitelist} - map of domains for which surbl checks would not be performed.
  1596. @item @emph{suffix} - a name of surbl. It is possible to add several suffixes:
  1597. @example
  1598. suffix_RAMBLER_URIBL = insecure-bl.rambler.ru
  1599. or in xml:
  1600. <param name="suffix_RAMBLER_URIBL">insecure-bl.rambler.ru</param>
  1601. @end example
  1602. @noindent
  1603. It is possible to add %b to symbol name for checking specific bits:
  1604. @example
  1605. suffix_%b_SURBL_MULTI = multi.surbl.org
  1606. then you may define replaces for %b in symbol name for each bit in result:
  1607. bit_2 = SC -> sc.surbl.org
  1608. bit_4 = WS -> ws.surbl.org
  1609. bit_8 = PH -> ph.surbl.org
  1610. bit_16 = OB -> ob.surbl.org
  1611. bit_32 = AB -> ab.surbl.org
  1612. bit_64 = JP -> jp.surbl.org
  1613. @end example
  1614. @noindent
  1615. So we make one DNS request and check for specific list by checking bits in
  1616. result ip. This is described in surbl page:
  1617. @url{http://www.surbl.org/lists.html#multi}. Note that result symbol would NOT
  1618. contain %b as it would be replaced by bit name. Also if several bits are set
  1619. several corresponding symbols would be added.
  1620. @end itemize
  1621. Also surbl module can use redirector - a special daemon that can check for
  1622. redirects. It uses HTTP/1.0 for requests and accepts a url and returns resolved
  1623. result. Redirector is shipped with rspamd but not enabled by default. You may
  1624. enable it on stage of configuring but note that it requires many perl modules
  1625. for its work. Rspamd redirector is described in details further. Here are surbl
  1626. options for working with redirector:
  1627. @itemize @bullet
  1628. @item @emph{redirector}: adress of redirector (in format host:port)
  1629. @item @emph{redirector_connect_timeout} (seconds): redirector connect timeout (default: 1s)
  1630. @item @emph{redirector_read_timeout} (seconds): timeout for reading data (default: 5s)
  1631. @item @emph{redirector_hosts_map} (map string): map that contains domains to check with redirector
  1632. @end itemize
  1633. So surbl module is an easy to use way to check message's urls and it may be used
  1634. in every configuration as it filters rather big ammount of email spam and scam.
  1635. @section SPF module.
  1636. SPF module is designed to make checks of spf records of sender's domains. SPF
  1637. records are placed in TXT DNS items for domains that have enabled spf. You may
  1638. read about SPF at @url{http://en.wikipedia.org/wiki/Sender_Policy_Framework}.
  1639. There are 3 results of spf check for domain:
  1640. @itemize @bullet
  1641. @item ALLOW - this ip is allowed to send messages for this domain
  1642. @item FAIL - this ip is @strong{not} allowed to send messages for this domain
  1643. @item SOFTFAIL - it is unknown whether this ip is allowed to send mail for this
  1644. domain
  1645. @end itemize
  1646. SPF supports different mechanizms for checking: dns subrequests, macroses,
  1647. includes, blacklists. Rspamd supports the most of them. Also for security
  1648. reasons there is internal limits for DNS subrequests and inclusions recursion.
  1649. SPF module support very small ammount of options:
  1650. @itemize @bullet
  1651. @item @emph{metric} (string): metric to insert symbol (default: 'default')
  1652. @item @emph{symbol_allow} (string): symbol to insert (default: 'R_SPF_ALLOW')
  1653. @item @emph{symbol_fail} (string): symbol to insert (default: 'R_SPF_FAIL')
  1654. @item @emph{symbol_softfail} (string): symbol to insert (default: 'R_SPF_SOFTFAIL')
  1655. @end itemize
  1656. @section Chartable module.
  1657. Chartable is a simple module that detects different charsets in a message. This
  1658. module is aimed to protect from emails that contains symbols from different
  1659. character sets that looks like each other. Chartable module works differently
  1660. for raw and utf modes: in utf modes it detects different characters from unicode
  1661. tables and in raw modes only ASCII and non-ASCII symbols. Configuration of whis
  1662. module is very simple:
  1663. @itemize @bullet
  1664. @item @emph{metric} (string): metric to insert symbol (default: 'default')
  1665. @item @emph{symbol} (string): symbol to insert (default: 'R_BAD_CHARSET')
  1666. @item @emph{threshold} (double): value that would be used as threshold in expression
  1667. @math{N_{charset-changes} / N_{chars}}
  1668. (e.g. if threshold is 0.1 than charset change should occure more often than in 10 symbols),
  1669. default: 0.1
  1670. @end itemize
  1671. @section Fuzzy check module.
  1672. Fuzzy check module provides a client for rspamd fuzzy storage. Fuzzy check can
  1673. work with a cluster of rspamd fuzzy storages and the specific storage is
  1674. selected by value of hash of message's hash. The available configuration options
  1675. are:
  1676. @itemize @bullet
  1677. @item @emph{metric} (string): metric to insert symbol (default: 'default')
  1678. @item @emph{symbol} (string): symbol to insert (default: 'R_FUZZY')
  1679. @item @emph{max_score} (double): maximum score to that weights of hashes would be
  1680. normalized (default: 0 - no normalization)
  1681. @item @emph{fuzzy_map} (string): a string that contains map in format @{ fuzzy_key => [
  1682. symbol, weight ] @} where fuzzy_key is number of fuzzy list. This string itself
  1683. should be in format 1:R_FUZZY_SAMPLE1:10,2:R_FUZZY_SAMPLE2:1 etc, where first
  1684. number is fuzzy key, second is symbol to insert and third - weight for
  1685. normalization
  1686. @item @emph{min_length} (integer): minimum length (in characters) for text part to be
  1687. checked for fuzzy hash (default: 0 - no limit)
  1688. @item @emph{whitelist} (map string): map of ip addresses that should not be checked
  1689. with this module
  1690. @item @emph{servers} (string): list of fuzzy servers in format
  1691. "server1:port,server2:port" - these servers would be used for checking and
  1692. storing fuzzy hashes
  1693. @end itemize
  1694. @section Forged recipients.
  1695. Forged recipients is a lua module that compares recipients provided by smtp
  1696. dialog and recipients from @emph{To:} header. Also it is possible to compare
  1697. @emph{From:} header with SMTP from. So you may set @strong{symbol_rcpt} option
  1698. to set up symbol that would be inserted when recipients differs and
  1699. @strong{symbol_sender} when senders differs.
  1700. @section Maillist.
  1701. Maillist is a module that detects whether this message is send by using one of
  1702. popular mailing list systems (among supported are ezmlm, mailman and
  1703. subscribe.ru systems). The module has only option @strong{symbol} that defines a
  1704. symbol that would be inserted if this message is sent via mailing list.
  1705. @section Once received.
  1706. This lua module checks received headers of message and insert symbol if only one
  1707. received header is presented in message (that usually signals that this mail is
  1708. sent directly to our MTA). Also it is possible to insert @emph{strict} symbol
  1709. that indicates that host from which we receive this message is either
  1710. unresolveable or has bad patterns (like 'dynamic', 'broadband' etc) that
  1711. indicates widely used botnets. Configuration options are:
  1712. @itemize @bullet
  1713. @item @emph{symbol}: symbol to insert for messages with one received header.
  1714. @item @emph{symbol_strict}: symbol to insert for messages with one received
  1715. header and containing bad patterns or unresolveable sender.
  1716. @item @emph{bad_host}: defines pattern that would be count as "bad".
  1717. @item @emph{good_host}: defines pattern that would be count as "good" (no strict
  1718. symbol would be inserted), note that "good" has a priority over "bad" pattern.
  1719. @end itemize
  1720. You can define several "good" and "bad" patterns for this module.
  1721. @section Received rbl.
  1722. Received rbl module checks for all received headers and make dns requests to IP
  1723. black lists. This can be used for checking whether this email was transfered by
  1724. some blacklisted gateway. Here are options available:
  1725. @itemize @bullet
  1726. @item @emph{symbol}: symbol to insert if message contains blacklisted received
  1727. headers
  1728. @item @emph{rbl}: a name of rbl to check, it is possible to define specific
  1729. symbol for this rbl by adding symbol name after semicolon:
  1730. @example
  1731. rbl = pbl.spamhaus.org:RECEIVED_PBL
  1732. @end example
  1733. @end itemize
  1734. @section Multimap.
  1735. Multimap is lua module that provides functionality to operate with different types
  1736. of lists. Now it can works with maps of strings for extracting MIME headers and match
  1737. them using lists. Also it is possible to create ip (or ipnetwork) maps for
  1738. checking ip address from which we receive that mail. DNS black lists are also
  1739. supported.
  1740. Multimap module works with a set of rules. Each rule can be one of three types:
  1741. @enumerate 1
  1742. @item @emph{ip}: is used for lists of ip addresses
  1743. @item @emph{header}: is used to match headers
  1744. @item @emph{dnsbl}: is used for dns lists of ip addresses
  1745. @end enumerate
  1746. Basically each rule is a line separated by commas containing rule parameters.
  1747. Each parameter has name and value, separated by equal sign. Here is list of
  1748. parameters (mandatory parameters are marked with @strong{*}):
  1749. @itemize @bullet
  1750. @item @strong{*} @emph{type}: rule type
  1751. @item @strong{*} @emph{map}: path to map (in uri format file:// or http://) or
  1752. name of dns blacklist
  1753. @item @strong{*} @emph{symbol}: symbol to insert
  1754. @item @emph{header}: header to use for header rules
  1755. @item @emph{pattern}: pattern that would be used to extract specific part of
  1756. header
  1757. @end itemize
  1758. @noindent
  1759. Here is an example of multimap rules:
  1760. @example
  1761. <module name="multimap">
  1762. <option name="rule">type = header, header = To, pattern = @(.+)>?$, map = file:///var/db/rspamd/rcpt_test, symbol = R_RCPT_WHITELIST</option>
  1763. <option name="rule">type = ip, map = file:///var/db/rspamd/ip_test, symbol = R_IP_WHITELIST</option>
  1764. <option name="rule">type = dnsbl, map = pbl.spamhaus.org, symbol = R_IP_PBL</option>
  1765. </module>
  1766. @end example
  1767. @section Conclusion.
  1768. Rspamd is shipped with some ammount of modules that provides basic functionality
  1769. fro checking emails. You are allowed to add custom rules for regexp module and
  1770. to set up available parameters for other modules. Also you may write your own
  1771. modules (in C or Lua) but this would be described further in this documentation.
  1772. You may set configuration options for modules from lua or from xml depends on
  1773. its complexity. Internal modules are enabled and disabled by @strong{filters}
  1774. configuration option. Lua modules are loaded and usually can be disabled by
  1775. removing their configuration section from xml file or by removing corresponding
  1776. line from @strong{modules} section.
  1777. @bye