| Commit message (Expand) | Author | Age | Files | Lines |
* | [Rework] Html: Further rework of the tags content extraction | Vsevolod Stakhov | 2021-06-22 | 1 | -8/+0 |
* | [Fix] Fix tokenization near exceptions | Vsevolod Stakhov | 2021-06-17 | 1 | -2/+2 |
* | [Project] Add process exceptions for invisible text | Vsevolod Stakhov | 2021-06-16 | 1 | -0/+8 |
* | [Minor] Reduce timer calls when doing tokenisation | Vsevolod Stakhov | 2021-06-07 | 1 | -1/+4 |
* | [Rework] Rework URL structure: adjust tld part | Vsevolod Stakhov | 2020-03-09 | 1 | -1/+1 |
* | [Minor] Oops, check for UBRK_DONE first | Vsevolod Stakhov | 2019-10-25 | 1 | -3/+3 |
* | [Minor] Add safety check when using icu ubrk iterators | Vsevolod Stakhov | 2019-10-24 | 1 | -6/+40 |
* | [Minor] Fix array size | Vsevolod Stakhov | 2019-09-26 | 1 | -2/+2 |
* | [Fix] Fix normalization of non-alphabet based languages | Vsevolod Stakhov | 2019-08-27 | 1 | -6/+2 |
* | [Minor] Some more alignment fixes | Vsevolod Stakhov | 2019-08-12 | 1 | -4/+0 |
* | [Minor] Add long texts sanity checks | Vsevolod Stakhov | 2019-07-25 | 1 | -1/+54 |
* | [Project] Adopt libstat code | Vsevolod Stakhov | 2019-07-12 | 1 | -6/+9 |
* | [Fix] Fix DoS caused by bug in glib | Vsevolod Stakhov | 2019-05-08 | 1 | -0/+8 |
* | [Minor] Fix some more suspicious cases | Vsevolod Stakhov | 2019-04-07 | 1 | -2/+2 |
* | [Feature] Try to filter bad unicode types during normalisation | Vsevolod Stakhov | 2019-02-25 | 1 | -1/+19 |
* | [Minor] Slightly extend what we can treat as words | Vsevolod Stakhov | 2018-11-30 | 1 | -1/+1 |
* | [Fix] Some fixes for raw parts | Vsevolod Stakhov | 2018-11-27 | 1 | -0/+1 |
* | [Minor] Fix for DSN | Vsevolod Stakhov | 2018-11-27 | 1 | -1/+1 |
* | [Feature] Ignore bogus whitespaces in the words | Vsevolod Stakhov | 2018-11-26 | 1 | -1/+8 |
* | [Project] Use URLs TLDs instead of !!EX!! in stat tokens | Vsevolod Stakhov | 2018-11-26 | 1 | -16/+39 |
* | [Project] Use more generalised API to produce meta words | Vsevolod Stakhov | 2018-11-26 | 1 | -48/+79 |
* | [Minor] Check language detector pointer before use | Vsevolod Stakhov | 2018-11-26 | 1 | -2/+2 |
* | [Project] Rework parts conversion and serialization | Vsevolod Stakhov | 2018-11-25 | 1 | -8/+5 |
* | [Project] Another try to normalize unicode properly | Vsevolod Stakhov | 2018-11-25 | 1 | -109/+136 |
* | [Project] Various unicode fixes in language detector | Vsevolod Stakhov | 2018-11-25 | 1 | -3/+2 |
* | [Project] Rework stemming | Vsevolod Stakhov | 2018-11-24 | 1 | -2/+98 |
* | [Project] Add function to normalize unicode on per words basis | Vsevolod Stakhov | 2018-11-24 | 1 | -1/+133 |
* | [Project] Start words unicode structure rework | Vsevolod Stakhov | 2018-11-24 | 1 | -48/+52 |
* | [Minor] Move subject tokenisation to a separate routine | Vsevolod Stakhov | 2018-11-08 | 1 | -3/+67 |
* | [CritFix] Fix words decay one more time (affects long messages) | Vsevolod Stakhov | 2018-09-25 | 1 | -4/+8 |
* | [Fix] Fix words decay algorithm | Vsevolod Stakhov | 2018-09-11 | 1 | -1/+1 |
* | [Minor] Properly set flag on text tokens | Vsevolod Stakhov | 2018-09-07 | 1 | -3/+4 |
* | [Minor] Further fixes in tokenization algorithm | Vsevolod Stakhov | 2018-09-07 | 1 | -20/+28 |
* | [Feature] Implement new text tokenizer based on libicu | Vsevolod Stakhov | 2018-09-06 | 1 | -203/+215 |
* | [Rework] Rework utf content processing in text parts | Vsevolod Stakhov | 2018-09-05 | 1 | -4/+4 |
* | [Project] Start unicode rework | Vsevolod Stakhov | 2018-08-23 | 1 | -17/+17 |
* | [Minor] Fix out-of-boundary access | Vsevolod Stakhov | 2018-03-27 | 1 | -1/+1 |
* | [Fix] Do not skip the last character | Vsevolod Stakhov | 2017-10-31 | 1 | -0/+1 |
* | [Fix] Do not try to dereference last character | Vsevolod Stakhov | 2017-10-31 | 1 | -1/+8 |
* | [Fix] Further tokenization fixes | Vsevolod Stakhov | 2017-10-21 | 1 | -1/+1 |
* | [Fix] Deal with another case when processing exceptions | Vsevolod Stakhov | 2017-10-21 | 1 | -0/+8 |
* | [Fix] Do not strip last character in the last word | Vsevolod Stakhov | 2017-10-21 | 1 | -2/+2 |
* | [Fix] Fix another tokenization issue | Vsevolod Stakhov | 2017-10-21 | 1 | -1/+31 |
* | [CritFix] Another portion of tokenization fixes | Vsevolod Stakhov | 2017-10-18 | 1 | -16/+19 |
* | [Minor] More strict boundaries checks and composites policies fix | Vsevolod Stakhov | 2017-04-09 | 1 | -0/+2 |
* | [Rework] Set token data as uint64_t instead of chars array | Vsevolod Stakhov | 2017-04-04 | 1 | -12/+0 |
* | [Feature] Store text tokens inside bayes tokens | Vsevolod Stakhov | 2017-03-31 | 1 | -0/+1 |
* | [Minor] Use libicu for tokenizers | Vsevolod Stakhov | 2017-02-25 | 1 | -18/+22 |
* | [Rework] Use a special structure for stats tokens | Vsevolod Stakhov | 2017-02-14 | 1 | -8/+15 |
* | [Rework] Rework exceptions and newlines processing | Vsevolod Stakhov | 2016-07-13 | 1 | -9/+13 |