aboutsummaryrefslogtreecommitdiffstats
path: root/src/libstat/tokenizers
Commit message (Collapse)AuthorAgeFilesLines
* [Project] Reduce default window size of OSB tokenizer to 2Vsevolod Stakhov2024-06-111-1/+1
|
* [Rework] Further types conversion (no functional changes)Vsevolod Stakhov2024-03-183-62/+62
|
* [Rework] Remove some of the GLib types in lieu of standard onesVsevolod Stakhov2024-03-183-25/+25
| | | | This types have constant conflicts with the system ones especially on OSX.
* [Fix] Make stat tokens allocation consistentVsevolod Stakhov2024-02-131-5/+4
|
* [Fix] Fix format string and some length issuesVsevolod Stakhov2023-09-262-12/+27
|
* [Rework] Use clang-format to unify formatting in all sourcesVsevolod Stakhov2023-07-263-377/+380
| | | | No meaningful changes.
* [Minor] Get rid of some compiler warningsVsevolod Stakhov2022-11-041-1/+1
|
* [Rework] Html: Further rework of the tags content extractionVsevolod Stakhov2021-06-221-8/+0
|
* [Fix] Fix tokenization near exceptionsVsevolod Stakhov2021-06-171-2/+2
|
* [Project] Add process exceptions for invisible textVsevolod Stakhov2021-06-161-0/+8
|
* [Minor] Reduce timer calls when doing tokenisationVsevolod Stakhov2021-06-071-1/+4
|
* [Feature] Add multiple base32 alphabets for decodingVsevolod Stakhov2020-04-091-1/+1
|
* [Rework] Rework URL structure: adjust tld partVsevolod Stakhov2020-03-091-1/+1
|
* [Minor] Oops, check for UBRK_DONE firstVsevolod Stakhov2019-10-251-3/+3
|
* [Minor] Add safety check when using icu ubrk iteratorsVsevolod Stakhov2019-10-242-7/+42
|
* [Minor] Fix array sizeVsevolod Stakhov2019-09-261-2/+2
|
* [Fix] Fix normalization of non-alphabet based languagesVsevolod Stakhov2019-08-271-6/+2
|
* [Minor] Some more alignment fixesVsevolod Stakhov2019-08-121-4/+0
|
* [Minor] Add long texts sanity checksVsevolod Stakhov2019-07-251-1/+54
|
* [Project] Adopt libstat codeVsevolod Stakhov2019-07-121-6/+9
|
* [Rework] Add C++ guards to all headersVsevolod Stakhov2019-07-081-16/+27
|
* [Fix] Fix DoS caused by bug in glibVsevolod Stakhov2019-05-081-0/+8
|
* [Minor] Fix some more suspicious casesVsevolod Stakhov2019-04-071-2/+2
|
* [Feature] Try to filter bad unicode types during normalisationVsevolod Stakhov2019-02-251-1/+19
|
* [Minor] Slightly extend what we can treat as wordsVsevolod Stakhov2018-11-301-1/+1
|
* [Fix] Some fixes for raw partsVsevolod Stakhov2018-11-271-0/+1
|
* [Minor] Fix for DSNVsevolod Stakhov2018-11-271-1/+1
|
* [Feature] Ignore bogus whitespaces in the wordsVsevolod Stakhov2018-11-261-1/+8
| | | | Issue: #2649
* [Project] Use URLs TLDs instead of !!EX!! in stat tokensVsevolod Stakhov2018-11-261-16/+39
|
* [Project] Use more generalised API to produce meta wordsVsevolod Stakhov2018-11-262-50/+82
|
* [Minor] Check language detector pointer before useVsevolod Stakhov2018-11-261-2/+2
|
* [Project] Finish basic tasks in new unicode projectVsevolod Stakhov2018-11-251-10/+20
|
* [Project] Rework parts conversion and serializationVsevolod Stakhov2018-11-251-8/+5
|
* [Project] Another try to normalize unicode properlyVsevolod Stakhov2018-11-252-109/+137
|
* [Project] Various unicode fixes in language detectorVsevolod Stakhov2018-11-251-3/+2
|
* [Project] Rework stemmingVsevolod Stakhov2018-11-243-9/+105
|
* [Project] Add function to normalize unicode on per words basisVsevolod Stakhov2018-11-242-1/+137
|
* [Project] Start words unicode structure reworkVsevolod Stakhov2018-11-241-48/+52
|
* [Feature] Skip stop words in statisticsVsevolod Stakhov2018-11-152-19/+31
|
* [Fix] Rework bayes calculations...Vsevolod Stakhov2018-11-141-1/+1
|
* [Minor] Move subject tokenisation to a separate routineVsevolod Stakhov2018-11-082-3/+69
| | | | Issue: #2623
* [CritFix] Fix words decay one more time (affects long messages)Vsevolod Stakhov2018-09-251-4/+8
|
* [Fix] Fix words decay algorithmVsevolod Stakhov2018-09-111-1/+1
|
* [Minor] Properly set flag on text tokensVsevolod Stakhov2018-09-071-3/+4
|
* [Minor] Further fixes in tokenization algorithmVsevolod Stakhov2018-09-071-20/+28
|
* [Feature] Implement new text tokenizer based on libicuVsevolod Stakhov2018-09-062-203/+218
|
* [Rework] Rework utf content processing in text partsVsevolod Stakhov2018-09-052-5/+5
| | | | | | - Store unicode in UTF parts - Store unicode for HTML parts - Rename struct fields and split them into unicode/utf components
* [Project] Start unicode reworkVsevolod Stakhov2018-08-232-20/+28
|
* [Minor] Fix out-of-boundary accessVsevolod Stakhov2018-03-271-1/+1
|
* [Fix] Do not skip the last characterVsevolod Stakhov2017-10-311-0/+1
| | | | MFH: rspamd-1.6