aboutsummaryrefslogtreecommitdiffstats
path: root/src/libstat/tokenizers
Commit message (Expand)AuthorAgeFilesLines
* [Project] Reduce default window size of OSB tokenizer to 2Vsevolod Stakhov2024-06-111-1/+1
* [Rework] Further types conversion (no functional changes)Vsevolod Stakhov2024-03-183-62/+62
* [Rework] Remove some of the GLib types in lieu of standard onesVsevolod Stakhov2024-03-183-25/+25
* [Fix] Make stat tokens allocation consistentVsevolod Stakhov2024-02-131-5/+4
* [Fix] Fix format string and some length issuesVsevolod Stakhov2023-09-262-12/+27
* [Rework] Use clang-format to unify formatting in all sourcesVsevolod Stakhov2023-07-263-377/+380
* [Minor] Get rid of some compiler warningsVsevolod Stakhov2022-11-041-1/+1
* [Rework] Html: Further rework of the tags content extractionVsevolod Stakhov2021-06-221-8/+0
* [Fix] Fix tokenization near exceptionsVsevolod Stakhov2021-06-171-2/+2
* [Project] Add process exceptions for invisible textVsevolod Stakhov2021-06-161-0/+8
* [Minor] Reduce timer calls when doing tokenisationVsevolod Stakhov2021-06-071-1/+4
* [Feature] Add multiple base32 alphabets for decodingVsevolod Stakhov2020-04-091-1/+1
* [Rework] Rework URL structure: adjust tld partVsevolod Stakhov2020-03-091-1/+1
* [Minor] Oops, check for UBRK_DONE firstVsevolod Stakhov2019-10-251-3/+3
* [Minor] Add safety check when using icu ubrk iteratorsVsevolod Stakhov2019-10-242-7/+42
* [Minor] Fix array sizeVsevolod Stakhov2019-09-261-2/+2
* [Fix] Fix normalization of non-alphabet based languagesVsevolod Stakhov2019-08-271-6/+2
* [Minor] Some more alignment fixesVsevolod Stakhov2019-08-121-4/+0
* [Minor] Add long texts sanity checksVsevolod Stakhov2019-07-251-1/+54
* [Project] Adopt libstat codeVsevolod Stakhov2019-07-121-6/+9
* [Rework] Add C++ guards to all headersVsevolod Stakhov2019-07-081-16/+27
* [Fix] Fix DoS caused by bug in glibVsevolod Stakhov2019-05-081-0/+8
* [Minor] Fix some more suspicious casesVsevolod Stakhov2019-04-071-2/+2
* [Feature] Try to filter bad unicode types during normalisationVsevolod Stakhov2019-02-251-1/+19
* [Minor] Slightly extend what we can treat as wordsVsevolod Stakhov2018-11-301-1/+1
* [Fix] Some fixes for raw partsVsevolod Stakhov2018-11-271-0/+1
* [Minor] Fix for DSNVsevolod Stakhov2018-11-271-1/+1
* [Feature] Ignore bogus whitespaces in the wordsVsevolod Stakhov2018-11-261-1/+8
* [Project] Use URLs TLDs instead of !!EX!! in stat tokensVsevolod Stakhov2018-11-261-16/+39
* [Project] Use more generalised API to produce meta wordsVsevolod Stakhov2018-11-262-50/+82
* [Minor] Check language detector pointer before useVsevolod Stakhov2018-11-261-2/+2
* [Project] Finish basic tasks in new unicode projectVsevolod Stakhov2018-11-251-10/+20
* [Project] Rework parts conversion and serializationVsevolod Stakhov2018-11-251-8/+5
* [Project] Another try to normalize unicode properlyVsevolod Stakhov2018-11-252-109/+137
* [Project] Various unicode fixes in language detectorVsevolod Stakhov2018-11-251-3/+2
* [Project] Rework stemmingVsevolod Stakhov2018-11-243-9/+105
* [Project] Add function to normalize unicode on per words basisVsevolod Stakhov2018-11-242-1/+137
* [Project] Start words unicode structure reworkVsevolod Stakhov2018-11-241-48/+52
* [Feature] Skip stop words in statisticsVsevolod Stakhov2018-11-152-19/+31
* [Fix] Rework bayes calculations...Vsevolod Stakhov2018-11-141-1/+1
* [Minor] Move subject tokenisation to a separate routineVsevolod Stakhov2018-11-082-3/+69
* [CritFix] Fix words decay one more time (affects long messages)Vsevolod Stakhov2018-09-251-4/+8
* [Fix] Fix words decay algorithmVsevolod Stakhov2018-09-111-1/+1
* [Minor] Properly set flag on text tokensVsevolod Stakhov2018-09-071-3/+4
* [Minor] Further fixes in tokenization algorithmVsevolod Stakhov2018-09-071-20/+28
* [Feature] Implement new text tokenizer based on libicuVsevolod Stakhov2018-09-062-203/+218
* [Rework] Rework utf content processing in text partsVsevolod Stakhov2018-09-052-5/+5
* [Project] Start unicode reworkVsevolod Stakhov2018-08-232-20/+28
* [Minor] Fix out-of-boundary accessVsevolod Stakhov2018-03-271-1/+1
* [Fix] Do not skip the last characterVsevolod Stakhov2017-10-311-0/+1