diff options
author | Vsevolod Stakhov <vsevolod@highsecure.ru> | 2020-02-25 09:55:31 +0000 |
---|---|---|
committer | Vsevolod Stakhov <vsevolod@highsecure.ru> | 2020-02-25 09:55:31 +0000 |
commit | b87995255fa2ef0de97d509b8cd27860f014e90f (patch) | |
tree | ff7fcc84aa85fcd4cd129d94f6fb23ac5f91d4cb /contrib/snowball/NEWS | |
parent | 52154a6c1dd7e46c174d4aab782494b92f955df5 (diff) | |
download | rspamd-b87995255fa2ef0de97d509b8cd27860f014e90f.tar.gz rspamd-b87995255fa2ef0de97d509b8cd27860f014e90f.zip |
[Rework] Update snowball stemmer to 2.0 and remove all crap aside of UTF8
Diffstat (limited to 'contrib/snowball/NEWS')
-rw-r--r-- | contrib/snowball/NEWS | 407 |
1 files changed, 407 insertions, 0 deletions
diff --git a/contrib/snowball/NEWS b/contrib/snowball/NEWS new file mode 100644 index 000000000..c71c12dd3 --- /dev/null +++ b/contrib/snowball/NEWS @@ -0,0 +1,407 @@ +Snowball 2.0.0 (2019-10-02) +=========================== + +C/C++ +----- + +* Fully handle 4-byte UTF-8 sequences. Previously `hop` and `next` handled + sequences of any length, but commands which look at the character value only + handled sequences up to length 3. Fixes #89. + +* Fix handling of a 3-byte UTF-8 sequence in a grouping in `backwardmode`. + +Java +---- + +* TestApp.java: + + - Always use UTF-8 for I/O. Patch from David Corbett (#80). + + - Allow reading input from stdin. + + - Remove rather pointless "stem n times" feature. + + - Only lower case ASCII to match stemwords.c. + + - Stem empty lines too to match stemwords.c. + +Code Quality Improvements +------------------------- + +* Fix various warnings from newer compilers. + +* Improve use of `const`. + +* Share common functions between compiler backends rather than having multiple + copies of the same code. + +* Assorted code clean-up. + +* Initialise line_labelled member of struct generator to 0. Previously we were + invoking undefined behaviour, though in practice it'll be zero initialised on + most platforms. + +New Code Generators +------------------- + +* Add Python generator (#24). Originally written by Yoshiki Shibukawa, with + additional updates by Dmitry Shachnev. + +* Add Javascript generator. Based on JSX generator (#26) written by Yoshiki + Shibukawa. + +* Add Rust generator from Jakob Demler (#51). + +* Add Go generator from Marty Schoch (#57). + +* Add C# generator. Based on patch from Cesar Souza (#16, #17). + +* Add Pascal generator. Based on Delphi backend from stemming.zip file on old + website (#75). + +New Language Features +--------------------- + +* Add `len` and `lenof` to measure Unicode length. These are similar to `size` + and `sizeof` (respectively), but `size` and `sizeof` return the length in + bytes under `-utf8`, whereas these new commands give the same result whether + using `-utf8`, `-widechars` or neither (but under `-utf8` they are O(n) in + the length of the string). For compatibility with existing code which might + use these as variable or function names, they stop being treated as tokens if + declared to be a variable or function. + +* New `{U+1234}` stringdef notation for Unicode codepoints. + +* More versatile integer tests. Now you can compare any two arithmetic + expressions with a relational operator in parentheses after the `$`, so for + example `$(len > 3)` can now be used when previously a temporary variable was + required: `$tmp = len $tmp > 3` + +Code generation improvements +---------------------------- + +* General: + + + Avoid unnecessarily saving and restoring of the cursor for more commands - + `atlimit`, `do`, `set` and `unset` all leave the cursor alone or always + restore its value, and for C `booltest` (which other languages already + handled). + + + Special case handling for `setlimit tomark AE`. All uses of setlimit in + the current stemmers we ship follow this pattern, and by special-casing we + can avoid having to save and restore the cursor (#74). + + + Merge duplicate actions in the same `among`. This reduces the size of the + switch/if-chain in the generated code which dispatch the among for many of + the stemmers. + + + Generate simpler code for `among`. We always check for a zero return value + when we call the among, so there's no point also checking for that in the + switch/if-chain. We can also avoid the switch/if-chain entirely when + there's only one possible outcome (besides the zero return). + + + Optimise code generated for `do <function call>`. This speeds up "make + check_python" by about 2%, and should speed up other interpreted languages + too (#110). + + + Generate more and better comments referencing snowball source. + + + Add homepage URL and compiler version as comments in generated files. + +* C/C++: + + + Fix `size` and `sizeof` to not report one too high (reported by Assem + Chelli in #32). + + + If signal `f` from a function call would lead to return from the current + function then handle this and bailing out on an error together with a + simple `if (ret <= 0) return ret;` + + + Inline testing for a single character literals. + + + Avoiding generating `|| 0` in corner case - this can result in a compiler + warning when building the generated code. + + + Implement `insert_v()` in terms of `insert_s()`. + + + Add conditional `extern "C"` so `runtime/api.h` can be included from C++ + code. Closes #90, reported by vvarma. + +* Java: + + + Fix functions in `among` to work in Java. We seem to need to make the + methods called from among `public` instead of `private`, and to call them + on `this` instead of the `methodObject` (which is cleaner anyway). No + revision in version control seems to generate working code for this case, + but Richard says it definitely used to work - possibly older JVMs failed to + correctly enforce the access controls when methods were invoked by + reflection. + + + Code after handling `f` by returning from the current function is + unreachable too. + + + Previously we incorrectly decided that code after an `or` was + unreachable in certain cases. None of the current stemmers in the + distribution triggered this, but Martin Porter's snowball version + of the Schinke Latin stemmer does. Fixes #58, reported by Alexander + Myltsev. + + + The reachability logic was failing to consider reachability from + the final command in an `or`. Fixes #82, reported by David Corbett. + + + Fix `maxint` and `minint`. Patch from David Corbett in #31. + + + Fix `$` on strings. The previous generated code was just wrong. This + doesn't affect any of the included algorithms, but for example breaks + Martin Porter's snowball implementation of Schinke's Latin Stemmer. + Issue noted by Jakob Demler while working on the Rust backend in #51, + and reported in the Schinke's Latin Stemmer by Alexander Myltsev + in #58. + + + Make SnowballProgram objects serializable. Patch from Oleg Smirnov in #43. + + + Eliminate range-check implementation for groupings. This was removed from + the C generator 10 years earlier, isn't used for any of the existing + algorithms, and it doesn't seem likely it would be - the grouping would + have to consist entirely of a contiguous block of Unicode code-points. + + + Simplify code generated for `repeat` and `atleast`. + + + Eliminate unused return values and variables from runtime functions. + + + Only import the `among` and `SnowballProgram` classes if they're actually + used. + + + Only generate `copy_from()` method if it's used. + + + Merge runtime functions `eq_s` and `eq_v` functions. + + + Java arrays know their own length so stop storing it separately. + + + Escape char 127 (DEL) in generated Java code. It's unlikely that this + character would actually be used in a real stemmer, so this was more of a + theoretical bug. + + + Drop unused import of InvocationTargetException from SnowballStemmer. + Reported by GerritDeMeulder in #72. + + + Fix lint check issues in generated Java code. The stemmer classes are only + referenced in the example app via reflection, so add + @SuppressWarnings("unused") for them. The stemmer classes override + equals() and hashCode() methods from the standard java Object class, so + mark these with @Override. Both suggested by GerritDeMeulder in #72. + + + Declare Java variables at point of use in generated code. Putting all + declarations at the top of the function was adding unnecessary complexity + to the Java generator code for no benefit. + + + Improve formatting of generated code. + +New stemming algorithms +----------------------- + +* Add Tamil stemmer from Damodharan Rajalingam (#2, #3). + +* Add Arabic stemmer from Assem Chelli (#32, #50). + +* Add Irish stemmer Jim O'Regan (#48). + +* Add Nepali stemmer from Arthur Zakirov (#70). + +* Add Indonesian stemmer from Olly Betts (#71). + +* Add Hindi stemmer from Olly Betts (#73). Thanks to David Corbett for review. + +* Add Lithuanian stemmer from Dainius Jocas (#22, #76). + +* Add Greek stemmer from Oleg Smirnov (#44). + +* Add Catalan and Basque stemmers from Israel Olalla (#104). + +Behavioural changes to existing algorithms +------------------------------------------ + +* Portuguese: + + + Replace incorrect Spanish suffixes by Portuguese suffixes (#1). + +* French: + + + The MSDOS CP850 version of the French algorithm was missing changes present + in the ISO8859-1 and Unicode versions. There's now a single version of + each algorithm which was based on the Unicode version. + + + Recognize French suffixes even when they begin with diaereses. Patch from + David Corbett in #78. + +* Russian: + + + We now normalise 'ё' to 'е' before stemming. The documentation has long + said "we assume ['ё'] is mapped into ['е']" but it's more convenient for + the stemmer to actually perform this normalisation. This change has no + effect if the caller is already normalising as we recommend. It's a change + in behaviour they aren't, but 'ё' occurs rarely (there are currently no + instances in our test vocabulary) and this improves behaviour when it does + occur. Patch from Eugene Mirotin (#65, #68). + +* Finish: + + + Adjust the Finnish algorithm not to mangle numbers. This change also + means it tends to leave foreign words alone. Fixes #66. + +* Danish: + + + Adjust Danish algorithm not to mangle alphanumeric codes. In particular + alphanumeric codes ending in a double digit (e.g. 0x0e00, hal9000, + space1999) are no longer mangled. See #81. + +Optimisations to existing algorithms +------------------------------------ + +* Turkish: + + + Simplify uses of `test` in stemmer code. + + + Check for 'ad' or 'soyad' more efficiently, and without needing the + strlen variable. This speeds up "make check_utf8_turkish" by 11% + on x86 Linux. + +* Kraaij-Pohlmann: + + + Eliminate variable x `$p1 <= cursor` is simpler and a little more efficient + than `setmark x $x >= p1`. + +Code clarity improvements to existing algorithms +------------------------------------------------ + +* Turkish: + + + Use , for cedilla to match the conventions used in other stemmers. + +* Kraaij-Pohlmann: + + + Avoid cryptic `[among ( (])` ... `)` construct - instead use the same + `[substring] among (` ... `)` construct we do in other stemmers. + +Compiler +-------- + +* Support conventional --help and --version options. + +* Warn if -r or -ep used with backend other than C/C++. + +* Warn if encoding command line options are specified when generating code in a + language with a fixed encoding. + +* The default classname is now set based on the output filename, so `-n` is now + often no longer needed. Fixes #64. + +* Avoid potential one byte buffer over-read when parsing snowball code. + +* Avoid comparing with uninitialised array element during compilation. + +* Improve `-syntax` output for `setlimit L for C`. + +* Optimise away double negation so generators don't have to worry about + generating `--` (decrement operator in many languages). Fixes #52, reported + by David Corbett. + +* Improved compiler error and warning messages: + + - We now report FILE:LINE: before each diagnostic message. + + - Improve warnings for unused declarations/definitions. + + - Warn for variables which are used, but either never initialised + or never read. + + - Flag non-ASCII literal strings. This is an error for wide Unicode, but + only a warning for single-byte and UTF-8 which work so long as the source + encoding matches the encoding used in the generated stemmer code. + + - Improve error recovery after an undeclared `define`. We now sniff the + token after the identifier and if it is `as` we parse as a routine, + otherwise we parse as a grouping. Previously we always just assumed it was + a routine, which gave a confusing second error if it was a grouping. + + - Improve error recovery after an unexpected token in `among`. Previously + we acted as if the unexpected token closed the `among` (this probably + wasn't intended but just a missing `break;` in a switch statement). Now we + issue an error and try the next token. + +* Report error instead of silently truncating character values (e.g. `hex 123` + previously silently became byte 0x23 which is `#` rather than a + g-with-cedilla). + +* Enlarge the initial input buffer size to 8192 bytes and double each time we + hit the end. Snowball programs are typically a few KB in size (with the + current largest we ship being the Greek stemmer at 27KB) so the previous + approach of starting with a 10 byte input buffer and increasing its size by + 50% plus 40 bytes each time it filled was inefficient, needing up to 15 + reallocations to load greek.sbl. + +* Identify variables only used by one `routine`/`external`. This information + isn't yet used, but such variables which are also always written to before + being read can be emitted as local variables in most target languages. + +* We now allow multiple source files on command line, and allow them to be + after (or even interspersed) with options to better match modern Unix + conventions. Support for multiple source files allows specifying a single + byte character set mapping via a source file of `stringdef`. + +* Avoid infinite recursion in compiler when optimising a recursive snowball + function. Recursive functions aren't typical in snowball programs, but + the compiler shouldn't crash for any input, especially not a valid one. + We now simply limit on how deep the compiler will recurse and make the + pessimistic assumption in the unlikely event we hit this limit. + +Build system: + +* `make clean` in C libstemmer_c distribution now removes `examples/*.o`. + (#59) + +* Fix all the places which previously had to have a list of stemmers to work + dynamically or be generated, so now only modules.txt needs updating to add + a new stemmer. + +* Add check_java make target which runs tests for java. + +* Support gzipped test data (the uncompressed arabic test data is too big for + github). + +* GNUmakefile: Drop useless `-eprefix` and `-r` options from snowball + invocations for Java - these are only meaningful when generating C code. + +* Pass CFLAGS when linking which matches convention (e.g. automake does it) and + facilitates use of tools such as ASan. Fixes #84, reported by Thomas + Pointhuber. + +* Add CI builds with -std=c90 to check compiler and generated code are C90 + (#54) + +libstemmer stuff: + +* Split out CPPFLAGS from CFLAGS and use CFLAGS when linking stemwords. + +* Add -O2 to CFLAGS. + +* Make generated tables of encodings and modules const. + +* Fix clang static analyzer memory leak warning (in practice this code path + can never actually be taken). Patch from Patrick O. Perry (#56) + +documentation + +* Added copyright and licensing details (#10). + +* Document that libstemmer supports ISO_8859_2 encoding. Currently hungarian + and romanian are available in ISO_8859_2. + +* Remove documentation falsely claiming that libstemmer supports CP850 + encoding. + +* CONTRIBUTING.rst: Add guidance for contributing new stemming algorithms and + new language backends. + +* Overhaul libstemmer_python_README. Most notably, replace the benchmark data + which was very out of date. |