index
:
rspamd.git
external-maps
libev-migration
log_json
master
mime-rework
rdns-tcp-rework
rework-symcache
rspamd-0.5
rspamd-0.6
rspamd-0.7
rspamd-0.8
rspamd-0.9
rspamd-1.0
rspamd-1.1
rspamd-1.2
rspamd-1.3
rspamd-1.4
rspamd-1.5
rspamd-1.6
rspamd-1.9
rspamd-3.10
rspamd-3.7
rspamd-3.8
rspamd-3.9
torch-removal
vstakhov-anonymize-mime
vstakhov-another-grow-factor-fix
vstakhov-ci-try
vstakhov-conf-reorg
vstakhov-cpu-detection
vstakhov-cumulative-tcp-timeout
vstakhov-fasttext-langdet
vstakhov-fix-2047-encode
vstakhov-fix-dcc
vstakhov-fuzzy-cxx
vstakhov-fuzzy-limits-display
vstakhov-fuzzy-tcp
vstakhov-gpt-ollama
vstakhov-keypair-encoding
vstakhov-known-senders
vstakhov-llm-anonymize
vstakhov-llm-embeddings
vstakhov-lua-text-api
vstakhov-new-hiredis
vstakhov-openssl-provider-message
vstakhov-remove-control-block
vstakhov-some-build-fixes
vstakhov-ssl-fixes
vstakhov-stringzilla
vstakhov-strip-attachments
vstakhov-surbl-conf-fix
vstakhov-universal-hashing-lua
vstakhov-utf8-mime
vstakhov-zstd-headers
Rapid spam filtering system: https://github.com/rspamd/rspamd
www-data
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
src
/
libstat
/
tokenizers
Commit message (
Expand
)
Author
Age
Files
Lines
*
[Fix] Fix format string and some length issues
Vsevolod Stakhov
2023-09-26
2
-12
/
+27
*
[Rework] Use clang-format to unify formatting in all sources
Vsevolod Stakhov
2023-07-26
3
-377
/
+380
*
[Minor] Get rid of some compiler warnings
Vsevolod Stakhov
2022-11-04
1
-1
/
+1
*
[Rework] Html: Further rework of the tags content extraction
Vsevolod Stakhov
2021-06-22
1
-8
/
+0
*
[Fix] Fix tokenization near exceptions
Vsevolod Stakhov
2021-06-17
1
-2
/
+2
*
[Project] Add process exceptions for invisible text
Vsevolod Stakhov
2021-06-16
1
-0
/
+8
*
[Minor] Reduce timer calls when doing tokenisation
Vsevolod Stakhov
2021-06-07
1
-1
/
+4
*
[Feature] Add multiple base32 alphabets for decoding
Vsevolod Stakhov
2020-04-09
1
-1
/
+1
*
[Rework] Rework URL structure: adjust tld part
Vsevolod Stakhov
2020-03-09
1
-1
/
+1
*
[Minor] Oops, check for UBRK_DONE first
Vsevolod Stakhov
2019-10-25
1
-3
/
+3
*
[Minor] Add safety check when using icu ubrk iterators
Vsevolod Stakhov
2019-10-24
2
-7
/
+42
*
[Minor] Fix array size
Vsevolod Stakhov
2019-09-26
1
-2
/
+2
*
[Fix] Fix normalization of non-alphabet based languages
Vsevolod Stakhov
2019-08-27
1
-6
/
+2
*
[Minor] Some more alignment fixes
Vsevolod Stakhov
2019-08-12
1
-4
/
+0
*
[Minor] Add long texts sanity checks
Vsevolod Stakhov
2019-07-25
1
-1
/
+54
*
[Project] Adopt libstat code
Vsevolod Stakhov
2019-07-12
1
-6
/
+9
*
[Rework] Add C++ guards to all headers
Vsevolod Stakhov
2019-07-08
1
-16
/
+27
*
[Fix] Fix DoS caused by bug in glib
Vsevolod Stakhov
2019-05-08
1
-0
/
+8
*
[Minor] Fix some more suspicious cases
Vsevolod Stakhov
2019-04-07
1
-2
/
+2
*
[Feature] Try to filter bad unicode types during normalisation
Vsevolod Stakhov
2019-02-25
1
-1
/
+19
*
[Minor] Slightly extend what we can treat as words
Vsevolod Stakhov
2018-11-30
1
-1
/
+1
*
[Fix] Some fixes for raw parts
Vsevolod Stakhov
2018-11-27
1
-0
/
+1
*
[Minor] Fix for DSN
Vsevolod Stakhov
2018-11-27
1
-1
/
+1
*
[Feature] Ignore bogus whitespaces in the words
Vsevolod Stakhov
2018-11-26
1
-1
/
+8
*
[Project] Use URLs TLDs instead of !!EX!! in stat tokens
Vsevolod Stakhov
2018-11-26
1
-16
/
+39
*
[Project] Use more generalised API to produce meta words
Vsevolod Stakhov
2018-11-26
2
-50
/
+82
*
[Minor] Check language detector pointer before use
Vsevolod Stakhov
2018-11-26
1
-2
/
+2
*
[Project] Finish basic tasks in new unicode project
Vsevolod Stakhov
2018-11-25
1
-10
/
+20
*
[Project] Rework parts conversion and serialization
Vsevolod Stakhov
2018-11-25
1
-8
/
+5
*
[Project] Another try to normalize unicode properly
Vsevolod Stakhov
2018-11-25
2
-109
/
+137
*
[Project] Various unicode fixes in language detector
Vsevolod Stakhov
2018-11-25
1
-3
/
+2
*
[Project] Rework stemming
Vsevolod Stakhov
2018-11-24
3
-9
/
+105
*
[Project] Add function to normalize unicode on per words basis
Vsevolod Stakhov
2018-11-24
2
-1
/
+137
*
[Project] Start words unicode structure rework
Vsevolod Stakhov
2018-11-24
1
-48
/
+52
*
[Feature] Skip stop words in statistics
Vsevolod Stakhov
2018-11-15
2
-19
/
+31
*
[Fix] Rework bayes calculations...
Vsevolod Stakhov
2018-11-14
1
-1
/
+1
*
[Minor] Move subject tokenisation to a separate routine
Vsevolod Stakhov
2018-11-08
2
-3
/
+69
*
[CritFix] Fix words decay one more time (affects long messages)
Vsevolod Stakhov
2018-09-25
1
-4
/
+8
*
[Fix] Fix words decay algorithm
Vsevolod Stakhov
2018-09-11
1
-1
/
+1
*
[Minor] Properly set flag on text tokens
Vsevolod Stakhov
2018-09-07
1
-3
/
+4
*
[Minor] Further fixes in tokenization algorithm
Vsevolod Stakhov
2018-09-07
1
-20
/
+28
*
[Feature] Implement new text tokenizer based on libicu
Vsevolod Stakhov
2018-09-06
2
-203
/
+218
*
[Rework] Rework utf content processing in text parts
Vsevolod Stakhov
2018-09-05
2
-5
/
+5
*
[Project] Start unicode rework
Vsevolod Stakhov
2018-08-23
2
-20
/
+28
*
[Minor] Fix out-of-boundary access
Vsevolod Stakhov
2018-03-27
1
-1
/
+1
*
[Fix] Do not skip the last character
Vsevolod Stakhov
2017-10-31
1
-0
/
+1
*
[Fix] Do not try to dereference last character
Vsevolod Stakhov
2017-10-31
1
-1
/
+8
*
[Minor] Further g_slice cleanup
Vsevolod Stakhov
2017-10-28
1
-2
/
+2
*
[Fix] Further tokenization fixes
Vsevolod Stakhov
2017-10-21
1
-1
/
+1
*
[Fix] Deal with another case when processing exceptions
Vsevolod Stakhov
2017-10-21
1
-0
/
+8
[next]