aboutsummaryrefslogtreecommitdiffstats
path: root/src/libstat/tokenizers
Commit message (Expand)AuthorAgeFilesLines
* [Minor] Fix array sizeVsevolod Stakhov2019-09-261-2/+2
* [Fix] Fix normalization of non-alphabet based languagesVsevolod Stakhov2019-08-271-6/+2
* [Minor] Some more alignment fixesVsevolod Stakhov2019-08-121-4/+0
* [Minor] Add long texts sanity checksVsevolod Stakhov2019-07-251-1/+54
* [Project] Adopt libstat codeVsevolod Stakhov2019-07-121-6/+9
* [Rework] Add C++ guards to all headersVsevolod Stakhov2019-07-081-16/+27
* [Fix] Fix DoS caused by bug in glibVsevolod Stakhov2019-05-081-0/+8
* [Minor] Fix some more suspicious casesVsevolod Stakhov2019-04-071-2/+2
* [Feature] Try to filter bad unicode types during normalisationVsevolod Stakhov2019-02-251-1/+19
* [Minor] Slightly extend what we can treat as wordsVsevolod Stakhov2018-11-301-1/+1
* [Fix] Some fixes for raw partsVsevolod Stakhov2018-11-271-0/+1
* [Minor] Fix for DSNVsevolod Stakhov2018-11-271-1/+1
* [Feature] Ignore bogus whitespaces in the wordsVsevolod Stakhov2018-11-261-1/+8
* [Project] Use URLs TLDs instead of !!EX!! in stat tokensVsevolod Stakhov2018-11-261-16/+39
* [Project] Use more generalised API to produce meta wordsVsevolod Stakhov2018-11-262-50/+82
* [Minor] Check language detector pointer before useVsevolod Stakhov2018-11-261-2/+2
* [Project] Finish basic tasks in new unicode projectVsevolod Stakhov2018-11-251-10/+20
* [Project] Rework parts conversion and serializationVsevolod Stakhov2018-11-251-8/+5
* [Project] Another try to normalize unicode properlyVsevolod Stakhov2018-11-252-109/+137
* [Project] Various unicode fixes in language detectorVsevolod Stakhov2018-11-251-3/+2
* [Project] Rework stemmingVsevolod Stakhov2018-11-243-9/+105
* [Project] Add function to normalize unicode on per words basisVsevolod Stakhov2018-11-242-1/+137
* [Project] Start words unicode structure reworkVsevolod Stakhov2018-11-241-48/+52
* [Feature] Skip stop words in statisticsVsevolod Stakhov2018-11-152-19/+31
* [Fix] Rework bayes calculations...Vsevolod Stakhov2018-11-141-1/+1
* [Minor] Move subject tokenisation to a separate routineVsevolod Stakhov2018-11-082-3/+69
* [CritFix] Fix words decay one more time (affects long messages)Vsevolod Stakhov2018-09-251-4/+8
* [Fix] Fix words decay algorithmVsevolod Stakhov2018-09-111-1/+1
* [Minor] Properly set flag on text tokensVsevolod Stakhov2018-09-071-3/+4
* [Minor] Further fixes in tokenization algorithmVsevolod Stakhov2018-09-071-20/+28
* [Feature] Implement new text tokenizer based on libicuVsevolod Stakhov2018-09-062-203/+218
* [Rework] Rework utf content processing in text partsVsevolod Stakhov2018-09-052-5/+5
* [Project] Start unicode reworkVsevolod Stakhov2018-08-232-20/+28
* [Minor] Fix out-of-boundary accessVsevolod Stakhov2018-03-271-1/+1
* [Fix] Do not skip the last characterVsevolod Stakhov2017-10-311-0/+1
* [Fix] Do not try to dereference last characterVsevolod Stakhov2017-10-311-1/+8
* [Minor] Further g_slice cleanupVsevolod Stakhov2017-10-281-2/+2
* [Fix] Further tokenization fixesVsevolod Stakhov2017-10-211-1/+1
* [Fix] Deal with another case when processing exceptionsVsevolod Stakhov2017-10-211-0/+8
* [Fix] Do not strip last character in the last wordVsevolod Stakhov2017-10-211-2/+2
* [Fix] Fix another tokenization issueVsevolod Stakhov2017-10-211-1/+31
* [CritFix] Another portion of tokenization fixesVsevolod Stakhov2017-10-181-16/+19
* [Feature] Add unigramms support in bayesVsevolod Stakhov2017-04-131-0/+12
* [Minor] More strict boundaries checks and composites policies fixVsevolod Stakhov2017-04-091-0/+2
* [Fix] Fix processing of small tokens vectorsVsevolod Stakhov2017-04-041-3/+8
* [Rework] Set token data as uint64_t instead of chars arrayVsevolod Stakhov2017-04-042-17/+3
* [Minor] Some fixes for displaying tokens infoVsevolod Stakhov2017-03-311-2/+3
* [Feature] Store text tokens inside bayes tokensVsevolod Stakhov2017-03-312-11/+23
* [Minor] Fix various style issuesVsevolod Stakhov2017-03-231-1/+0
* [Minor] Use libicu for tokenizersVsevolod Stakhov2017-02-251-18/+22