aboutsummaryrefslogtreecommitdiffstats
path: root/src/libstat/tokenizers/tokenizers.c
Commit message (Expand)AuthorAgeFilesLines
* [Fix] Fix format string and some length issuesVsevolod Stakhov2023-09-261-11/+10
* [Rework] Use clang-format to unify formatting in all sourcesVsevolod Stakhov2023-07-261-240/+236
* [Minor] Get rid of some compiler warningsVsevolod Stakhov2022-11-041-1/+1
* [Rework] Html: Further rework of the tags content extractionVsevolod Stakhov2021-06-221-8/+0
* [Fix] Fix tokenization near exceptionsVsevolod Stakhov2021-06-171-2/+2
* [Project] Add process exceptions for invisible textVsevolod Stakhov2021-06-161-0/+8
* [Minor] Reduce timer calls when doing tokenisationVsevolod Stakhov2021-06-071-1/+4
* [Rework] Rework URL structure: adjust tld partVsevolod Stakhov2020-03-091-1/+1
* [Minor] Oops, check for UBRK_DONE firstVsevolod Stakhov2019-10-251-3/+3
* [Minor] Add safety check when using icu ubrk iteratorsVsevolod Stakhov2019-10-241-6/+40
* [Minor] Fix array sizeVsevolod Stakhov2019-09-261-2/+2
* [Fix] Fix normalization of non-alphabet based languagesVsevolod Stakhov2019-08-271-6/+2
* [Minor] Some more alignment fixesVsevolod Stakhov2019-08-121-4/+0
* [Minor] Add long texts sanity checksVsevolod Stakhov2019-07-251-1/+54
* [Project] Adopt libstat codeVsevolod Stakhov2019-07-121-6/+9
* [Fix] Fix DoS caused by bug in glibVsevolod Stakhov2019-05-081-0/+8
* [Minor] Fix some more suspicious casesVsevolod Stakhov2019-04-071-2/+2
* [Feature] Try to filter bad unicode types during normalisationVsevolod Stakhov2019-02-251-1/+19
* [Minor] Slightly extend what we can treat as wordsVsevolod Stakhov2018-11-301-1/+1
* [Fix] Some fixes for raw partsVsevolod Stakhov2018-11-271-0/+1
* [Minor] Fix for DSNVsevolod Stakhov2018-11-271-1/+1
* [Feature] Ignore bogus whitespaces in the wordsVsevolod Stakhov2018-11-261-1/+8
* [Project] Use URLs TLDs instead of !!EX!! in stat tokensVsevolod Stakhov2018-11-261-16/+39
* [Project] Use more generalised API to produce meta wordsVsevolod Stakhov2018-11-261-48/+79
* [Minor] Check language detector pointer before useVsevolod Stakhov2018-11-261-2/+2
* [Project] Rework parts conversion and serializationVsevolod Stakhov2018-11-251-8/+5
* [Project] Another try to normalize unicode properlyVsevolod Stakhov2018-11-251-109/+136
* [Project] Various unicode fixes in language detectorVsevolod Stakhov2018-11-251-3/+2
* [Project] Rework stemmingVsevolod Stakhov2018-11-241-2/+98
* [Project] Add function to normalize unicode on per words basisVsevolod Stakhov2018-11-241-1/+133
* [Project] Start words unicode structure reworkVsevolod Stakhov2018-11-241-48/+52
* [Minor] Move subject tokenisation to a separate routineVsevolod Stakhov2018-11-081-3/+67
* [CritFix] Fix words decay one more time (affects long messages)Vsevolod Stakhov2018-09-251-4/+8
* [Fix] Fix words decay algorithmVsevolod Stakhov2018-09-111-1/+1
* [Minor] Properly set flag on text tokensVsevolod Stakhov2018-09-071-3/+4
* [Minor] Further fixes in tokenization algorithmVsevolod Stakhov2018-09-071-20/+28
* [Feature] Implement new text tokenizer based on libicuVsevolod Stakhov2018-09-061-203/+215
* [Rework] Rework utf content processing in text partsVsevolod Stakhov2018-09-051-4/+4
* [Project] Start unicode reworkVsevolod Stakhov2018-08-231-17/+17
* [Minor] Fix out-of-boundary accessVsevolod Stakhov2018-03-271-1/+1
* [Fix] Do not skip the last characterVsevolod Stakhov2017-10-311-0/+1
* [Fix] Do not try to dereference last characterVsevolod Stakhov2017-10-311-1/+8
* [Fix] Further tokenization fixesVsevolod Stakhov2017-10-211-1/+1
* [Fix] Deal with another case when processing exceptionsVsevolod Stakhov2017-10-211-0/+8
* [Fix] Do not strip last character in the last wordVsevolod Stakhov2017-10-211-2/+2
* [Fix] Fix another tokenization issueVsevolod Stakhov2017-10-211-1/+31
* [CritFix] Another portion of tokenization fixesVsevolod Stakhov2017-10-181-16/+19
* [Minor] More strict boundaries checks and composites policies fixVsevolod Stakhov2017-04-091-0/+2
* [Rework] Set token data as uint64_t instead of chars arrayVsevolod Stakhov2017-04-041-12/+0
* [Feature] Store text tokens inside bayes tokensVsevolod Stakhov2017-03-311-0/+1