index
:
rspamd.git
external-maps
libev-migration
log_json
master
mime-rework
rdns-tcp-rework
rework-symcache
rspamd-0.5
rspamd-0.6
rspamd-0.7
rspamd-0.8
rspamd-0.9
rspamd-1.0
rspamd-1.1
rspamd-1.2
rspamd-1.3
rspamd-1.4
rspamd-1.5
rspamd-1.6
rspamd-1.9
rspamd-3.10
rspamd-3.7
rspamd-3.8
rspamd-3.9
torch-removal
vstakhov-anonymize-mime
vstakhov-another-grow-factor-fix
vstakhov-ci-try
vstakhov-conf-reorg
vstakhov-cpu-detection
vstakhov-cumulative-tcp-timeout
vstakhov-fasttext-langdet
vstakhov-fix-2047-encode
vstakhov-fix-dcc
vstakhov-fuzzy-cxx
vstakhov-fuzzy-limits-display
vstakhov-fuzzy-tcp
vstakhov-gpt-ollama
vstakhov-keypair-encoding
vstakhov-known-senders
vstakhov-llm-anonymize
vstakhov-llm-embeddings
vstakhov-lua-shingles
vstakhov-lua-text-api
vstakhov-new-hiredis
vstakhov-openssl-provider-message
vstakhov-redis-pool-fixes
vstakhov-remove-control-block
vstakhov-some-build-fixes
vstakhov-ssl-fixes
vstakhov-stringzilla
vstakhov-strip-attachments
vstakhov-surbl-conf-fix
vstakhov-universal-hashing-lua
vstakhov-utf8-mime
vstakhov-zstd-headers
Rapid spam filtering system: https://github.com/rspamd/rspamd
www-data
about
summary
refs
log
tree
commit
diff
stats
log msg
author
committer
range
path:
root
/
src
/
libstat
/
tokenizers
/
tokenizers.c
Commit message (
Expand
)
Author
Age
Files
Lines
*
[Fix] Fix format string and some length issues
Vsevolod Stakhov
2023-09-26
1
-11
/
+10
*
[Rework] Use clang-format to unify formatting in all sources
Vsevolod Stakhov
2023-07-26
1
-240
/
+236
*
[Minor] Get rid of some compiler warnings
Vsevolod Stakhov
2022-11-04
1
-1
/
+1
*
[Rework] Html: Further rework of the tags content extraction
Vsevolod Stakhov
2021-06-22
1
-8
/
+0
*
[Fix] Fix tokenization near exceptions
Vsevolod Stakhov
2021-06-17
1
-2
/
+2
*
[Project] Add process exceptions for invisible text
Vsevolod Stakhov
2021-06-16
1
-0
/
+8
*
[Minor] Reduce timer calls when doing tokenisation
Vsevolod Stakhov
2021-06-07
1
-1
/
+4
*
[Rework] Rework URL structure: adjust tld part
Vsevolod Stakhov
2020-03-09
1
-1
/
+1
*
[Minor] Oops, check for UBRK_DONE first
Vsevolod Stakhov
2019-10-25
1
-3
/
+3
*
[Minor] Add safety check when using icu ubrk iterators
Vsevolod Stakhov
2019-10-24
1
-6
/
+40
*
[Minor] Fix array size
Vsevolod Stakhov
2019-09-26
1
-2
/
+2
*
[Fix] Fix normalization of non-alphabet based languages
Vsevolod Stakhov
2019-08-27
1
-6
/
+2
*
[Minor] Some more alignment fixes
Vsevolod Stakhov
2019-08-12
1
-4
/
+0
*
[Minor] Add long texts sanity checks
Vsevolod Stakhov
2019-07-25
1
-1
/
+54
*
[Project] Adopt libstat code
Vsevolod Stakhov
2019-07-12
1
-6
/
+9
*
[Fix] Fix DoS caused by bug in glib
Vsevolod Stakhov
2019-05-08
1
-0
/
+8
*
[Minor] Fix some more suspicious cases
Vsevolod Stakhov
2019-04-07
1
-2
/
+2
*
[Feature] Try to filter bad unicode types during normalisation
Vsevolod Stakhov
2019-02-25
1
-1
/
+19
*
[Minor] Slightly extend what we can treat as words
Vsevolod Stakhov
2018-11-30
1
-1
/
+1
*
[Fix] Some fixes for raw parts
Vsevolod Stakhov
2018-11-27
1
-0
/
+1
*
[Minor] Fix for DSN
Vsevolod Stakhov
2018-11-27
1
-1
/
+1
*
[Feature] Ignore bogus whitespaces in the words
Vsevolod Stakhov
2018-11-26
1
-1
/
+8
*
[Project] Use URLs TLDs instead of !!EX!! in stat tokens
Vsevolod Stakhov
2018-11-26
1
-16
/
+39
*
[Project] Use more generalised API to produce meta words
Vsevolod Stakhov
2018-11-26
1
-48
/
+79
*
[Minor] Check language detector pointer before use
Vsevolod Stakhov
2018-11-26
1
-2
/
+2
*
[Project] Rework parts conversion and serialization
Vsevolod Stakhov
2018-11-25
1
-8
/
+5
*
[Project] Another try to normalize unicode properly
Vsevolod Stakhov
2018-11-25
1
-109
/
+136
*
[Project] Various unicode fixes in language detector
Vsevolod Stakhov
2018-11-25
1
-3
/
+2
*
[Project] Rework stemming
Vsevolod Stakhov
2018-11-24
1
-2
/
+98
*
[Project] Add function to normalize unicode on per words basis
Vsevolod Stakhov
2018-11-24
1
-1
/
+133
*
[Project] Start words unicode structure rework
Vsevolod Stakhov
2018-11-24
1
-48
/
+52
*
[Minor] Move subject tokenisation to a separate routine
Vsevolod Stakhov
2018-11-08
1
-3
/
+67
*
[CritFix] Fix words decay one more time (affects long messages)
Vsevolod Stakhov
2018-09-25
1
-4
/
+8
*
[Fix] Fix words decay algorithm
Vsevolod Stakhov
2018-09-11
1
-1
/
+1
*
[Minor] Properly set flag on text tokens
Vsevolod Stakhov
2018-09-07
1
-3
/
+4
*
[Minor] Further fixes in tokenization algorithm
Vsevolod Stakhov
2018-09-07
1
-20
/
+28
*
[Feature] Implement new text tokenizer based on libicu
Vsevolod Stakhov
2018-09-06
1
-203
/
+215
*
[Rework] Rework utf content processing in text parts
Vsevolod Stakhov
2018-09-05
1
-4
/
+4
*
[Project] Start unicode rework
Vsevolod Stakhov
2018-08-23
1
-17
/
+17
*
[Minor] Fix out-of-boundary access
Vsevolod Stakhov
2018-03-27
1
-1
/
+1
*
[Fix] Do not skip the last character
Vsevolod Stakhov
2017-10-31
1
-0
/
+1
*
[Fix] Do not try to dereference last character
Vsevolod Stakhov
2017-10-31
1
-1
/
+8
*
[Fix] Further tokenization fixes
Vsevolod Stakhov
2017-10-21
1
-1
/
+1
*
[Fix] Deal with another case when processing exceptions
Vsevolod Stakhov
2017-10-21
1
-0
/
+8
*
[Fix] Do not strip last character in the last word
Vsevolod Stakhov
2017-10-21
1
-2
/
+2
*
[Fix] Fix another tokenization issue
Vsevolod Stakhov
2017-10-21
1
-1
/
+31
*
[CritFix] Another portion of tokenization fixes
Vsevolod Stakhov
2017-10-18
1
-16
/
+19
*
[Minor] More strict boundaries checks and composites policies fix
Vsevolod Stakhov
2017-04-09
1
-0
/
+2
*
[Rework] Set token data as uint64_t instead of chars array
Vsevolod Stakhov
2017-04-04
1
-12
/
+0
*
[Feature] Store text tokens inside bayes tokens
Vsevolod Stakhov
2017-03-31
1
-0
/
+1
[next]