aboutsummaryrefslogtreecommitdiffstats
path: root/contrib/snowball/NEWS
diff options
context:
space:
mode:
authorVsevolod Stakhov <vsevolod@highsecure.ru>2020-02-25 09:55:31 +0000
committerVsevolod Stakhov <vsevolod@highsecure.ru>2020-02-25 09:55:31 +0000
commitb87995255fa2ef0de97d509b8cd27860f014e90f (patch)
treeff7fcc84aa85fcd4cd129d94f6fb23ac5f91d4cb /contrib/snowball/NEWS
parent52154a6c1dd7e46c174d4aab782494b92f955df5 (diff)
downloadrspamd-b87995255fa2ef0de97d509b8cd27860f014e90f.tar.gz
rspamd-b87995255fa2ef0de97d509b8cd27860f014e90f.zip
[Rework] Update snowball stemmer to 2.0 and remove all crap aside of UTF8
Diffstat (limited to 'contrib/snowball/NEWS')
-rw-r--r--contrib/snowball/NEWS407
1 files changed, 407 insertions, 0 deletions
diff --git a/contrib/snowball/NEWS b/contrib/snowball/NEWS
new file mode 100644
index 000000000..c71c12dd3
--- /dev/null
+++ b/contrib/snowball/NEWS
@@ -0,0 +1,407 @@
+Snowball 2.0.0 (2019-10-02)
+===========================
+
+C/C++
+-----
+
+* Fully handle 4-byte UTF-8 sequences. Previously `hop` and `next` handled
+ sequences of any length, but commands which look at the character value only
+ handled sequences up to length 3. Fixes #89.
+
+* Fix handling of a 3-byte UTF-8 sequence in a grouping in `backwardmode`.
+
+Java
+----
+
+* TestApp.java:
+
+ - Always use UTF-8 for I/O. Patch from David Corbett (#80).
+
+ - Allow reading input from stdin.
+
+ - Remove rather pointless "stem n times" feature.
+
+ - Only lower case ASCII to match stemwords.c.
+
+ - Stem empty lines too to match stemwords.c.
+
+Code Quality Improvements
+-------------------------
+
+* Fix various warnings from newer compilers.
+
+* Improve use of `const`.
+
+* Share common functions between compiler backends rather than having multiple
+ copies of the same code.
+
+* Assorted code clean-up.
+
+* Initialise line_labelled member of struct generator to 0. Previously we were
+ invoking undefined behaviour, though in practice it'll be zero initialised on
+ most platforms.
+
+New Code Generators
+-------------------
+
+* Add Python generator (#24). Originally written by Yoshiki Shibukawa, with
+ additional updates by Dmitry Shachnev.
+
+* Add Javascript generator. Based on JSX generator (#26) written by Yoshiki
+ Shibukawa.
+
+* Add Rust generator from Jakob Demler (#51).
+
+* Add Go generator from Marty Schoch (#57).
+
+* Add C# generator. Based on patch from Cesar Souza (#16, #17).
+
+* Add Pascal generator. Based on Delphi backend from stemming.zip file on old
+ website (#75).
+
+New Language Features
+---------------------
+
+* Add `len` and `lenof` to measure Unicode length. These are similar to `size`
+ and `sizeof` (respectively), but `size` and `sizeof` return the length in
+ bytes under `-utf8`, whereas these new commands give the same result whether
+ using `-utf8`, `-widechars` or neither (but under `-utf8` they are O(n) in
+ the length of the string). For compatibility with existing code which might
+ use these as variable or function names, they stop being treated as tokens if
+ declared to be a variable or function.
+
+* New `{U+1234}` stringdef notation for Unicode codepoints.
+
+* More versatile integer tests. Now you can compare any two arithmetic
+ expressions with a relational operator in parentheses after the `$`, so for
+ example `$(len > 3)` can now be used when previously a temporary variable was
+ required: `$tmp = len $tmp > 3`
+
+Code generation improvements
+----------------------------
+
+* General:
+
+ + Avoid unnecessarily saving and restoring of the cursor for more commands -
+ `atlimit`, `do`, `set` and `unset` all leave the cursor alone or always
+ restore its value, and for C `booltest` (which other languages already
+ handled).
+
+ + Special case handling for `setlimit tomark AE`. All uses of setlimit in
+ the current stemmers we ship follow this pattern, and by special-casing we
+ can avoid having to save and restore the cursor (#74).
+
+ + Merge duplicate actions in the same `among`. This reduces the size of the
+ switch/if-chain in the generated code which dispatch the among for many of
+ the stemmers.
+
+ + Generate simpler code for `among`. We always check for a zero return value
+ when we call the among, so there's no point also checking for that in the
+ switch/if-chain. We can also avoid the switch/if-chain entirely when
+ there's only one possible outcome (besides the zero return).
+
+ + Optimise code generated for `do <function call>`. This speeds up "make
+ check_python" by about 2%, and should speed up other interpreted languages
+ too (#110).
+
+ + Generate more and better comments referencing snowball source.
+
+ + Add homepage URL and compiler version as comments in generated files.
+
+* C/C++:
+
+ + Fix `size` and `sizeof` to not report one too high (reported by Assem
+ Chelli in #32).
+
+ + If signal `f` from a function call would lead to return from the current
+ function then handle this and bailing out on an error together with a
+ simple `if (ret <= 0) return ret;`
+
+ + Inline testing for a single character literals.
+
+ + Avoiding generating `|| 0` in corner case - this can result in a compiler
+ warning when building the generated code.
+
+ + Implement `insert_v()` in terms of `insert_s()`.
+
+ + Add conditional `extern "C"` so `runtime/api.h` can be included from C++
+ code. Closes #90, reported by vvarma.
+
+* Java:
+
+ + Fix functions in `among` to work in Java. We seem to need to make the
+ methods called from among `public` instead of `private`, and to call them
+ on `this` instead of the `methodObject` (which is cleaner anyway). No
+ revision in version control seems to generate working code for this case,
+ but Richard says it definitely used to work - possibly older JVMs failed to
+ correctly enforce the access controls when methods were invoked by
+ reflection.
+
+ + Code after handling `f` by returning from the current function is
+ unreachable too.
+
+ + Previously we incorrectly decided that code after an `or` was
+ unreachable in certain cases. None of the current stemmers in the
+ distribution triggered this, but Martin Porter's snowball version
+ of the Schinke Latin stemmer does. Fixes #58, reported by Alexander
+ Myltsev.
+
+ + The reachability logic was failing to consider reachability from
+ the final command in an `or`. Fixes #82, reported by David Corbett.
+
+ + Fix `maxint` and `minint`. Patch from David Corbett in #31.
+
+ + Fix `$` on strings. The previous generated code was just wrong. This
+ doesn't affect any of the included algorithms, but for example breaks
+ Martin Porter's snowball implementation of Schinke's Latin Stemmer.
+ Issue noted by Jakob Demler while working on the Rust backend in #51,
+ and reported in the Schinke's Latin Stemmer by Alexander Myltsev
+ in #58.
+
+ + Make SnowballProgram objects serializable. Patch from Oleg Smirnov in #43.
+
+ + Eliminate range-check implementation for groupings. This was removed from
+ the C generator 10 years earlier, isn't used for any of the existing
+ algorithms, and it doesn't seem likely it would be - the grouping would
+ have to consist entirely of a contiguous block of Unicode code-points.
+
+ + Simplify code generated for `repeat` and `atleast`.
+
+ + Eliminate unused return values and variables from runtime functions.
+
+ + Only import the `among` and `SnowballProgram` classes if they're actually
+ used.
+
+ + Only generate `copy_from()` method if it's used.
+
+ + Merge runtime functions `eq_s` and `eq_v` functions.
+
+ + Java arrays know their own length so stop storing it separately.
+
+ + Escape char 127 (DEL) in generated Java code. It's unlikely that this
+ character would actually be used in a real stemmer, so this was more of a
+ theoretical bug.
+
+ + Drop unused import of InvocationTargetException from SnowballStemmer.
+ Reported by GerritDeMeulder in #72.
+
+ + Fix lint check issues in generated Java code. The stemmer classes are only
+ referenced in the example app via reflection, so add
+ @SuppressWarnings("unused") for them. The stemmer classes override
+ equals() and hashCode() methods from the standard java Object class, so
+ mark these with @Override. Both suggested by GerritDeMeulder in #72.
+
+ + Declare Java variables at point of use in generated code. Putting all
+ declarations at the top of the function was adding unnecessary complexity
+ to the Java generator code for no benefit.
+
+ + Improve formatting of generated code.
+
+New stemming algorithms
+-----------------------
+
+* Add Tamil stemmer from Damodharan Rajalingam (#2, #3).
+
+* Add Arabic stemmer from Assem Chelli (#32, #50).
+
+* Add Irish stemmer Jim O'Regan (#48).
+
+* Add Nepali stemmer from Arthur Zakirov (#70).
+
+* Add Indonesian stemmer from Olly Betts (#71).
+
+* Add Hindi stemmer from Olly Betts (#73). Thanks to David Corbett for review.
+
+* Add Lithuanian stemmer from Dainius Jocas (#22, #76).
+
+* Add Greek stemmer from Oleg Smirnov (#44).
+
+* Add Catalan and Basque stemmers from Israel Olalla (#104).
+
+Behavioural changes to existing algorithms
+------------------------------------------
+
+* Portuguese:
+
+ + Replace incorrect Spanish suffixes by Portuguese suffixes (#1).
+
+* French:
+
+ + The MSDOS CP850 version of the French algorithm was missing changes present
+ in the ISO8859-1 and Unicode versions. There's now a single version of
+ each algorithm which was based on the Unicode version.
+
+ + Recognize French suffixes even when they begin with diaereses. Patch from
+ David Corbett in #78.
+
+* Russian:
+
+ + We now normalise 'ё' to 'е' before stemming. The documentation has long
+ said "we assume ['ё'] is mapped into ['е']" but it's more convenient for
+ the stemmer to actually perform this normalisation. This change has no
+ effect if the caller is already normalising as we recommend. It's a change
+ in behaviour they aren't, but 'ё' occurs rarely (there are currently no
+ instances in our test vocabulary) and this improves behaviour when it does
+ occur. Patch from Eugene Mirotin (#65, #68).
+
+* Finish:
+
+ + Adjust the Finnish algorithm not to mangle numbers. This change also
+ means it tends to leave foreign words alone. Fixes #66.
+
+* Danish:
+
+ + Adjust Danish algorithm not to mangle alphanumeric codes. In particular
+ alphanumeric codes ending in a double digit (e.g. 0x0e00, hal9000,
+ space1999) are no longer mangled. See #81.
+
+Optimisations to existing algorithms
+------------------------------------
+
+* Turkish:
+
+ + Simplify uses of `test` in stemmer code.
+
+ + Check for 'ad' or 'soyad' more efficiently, and without needing the
+ strlen variable. This speeds up "make check_utf8_turkish" by 11%
+ on x86 Linux.
+
+* Kraaij-Pohlmann:
+
+ + Eliminate variable x `$p1 <= cursor` is simpler and a little more efficient
+ than `setmark x $x >= p1`.
+
+Code clarity improvements to existing algorithms
+------------------------------------------------
+
+* Turkish:
+
+ + Use , for cedilla to match the conventions used in other stemmers.
+
+* Kraaij-Pohlmann:
+
+ + Avoid cryptic `[among ( (])` ... `)` construct - instead use the same
+ `[substring] among (` ... `)` construct we do in other stemmers.
+
+Compiler
+--------
+
+* Support conventional --help and --version options.
+
+* Warn if -r or -ep used with backend other than C/C++.
+
+* Warn if encoding command line options are specified when generating code in a
+ language with a fixed encoding.
+
+* The default classname is now set based on the output filename, so `-n` is now
+ often no longer needed. Fixes #64.
+
+* Avoid potential one byte buffer over-read when parsing snowball code.
+
+* Avoid comparing with uninitialised array element during compilation.
+
+* Improve `-syntax` output for `setlimit L for C`.
+
+* Optimise away double negation so generators don't have to worry about
+ generating `--` (decrement operator in many languages). Fixes #52, reported
+ by David Corbett.
+
+* Improved compiler error and warning messages:
+
+ - We now report FILE:LINE: before each diagnostic message.
+
+ - Improve warnings for unused declarations/definitions.
+
+ - Warn for variables which are used, but either never initialised
+ or never read.
+
+ - Flag non-ASCII literal strings. This is an error for wide Unicode, but
+ only a warning for single-byte and UTF-8 which work so long as the source
+ encoding matches the encoding used in the generated stemmer code.
+
+ - Improve error recovery after an undeclared `define`. We now sniff the
+ token after the identifier and if it is `as` we parse as a routine,
+ otherwise we parse as a grouping. Previously we always just assumed it was
+ a routine, which gave a confusing second error if it was a grouping.
+
+ - Improve error recovery after an unexpected token in `among`. Previously
+ we acted as if the unexpected token closed the `among` (this probably
+ wasn't intended but just a missing `break;` in a switch statement). Now we
+ issue an error and try the next token.
+
+* Report error instead of silently truncating character values (e.g. `hex 123`
+ previously silently became byte 0x23 which is `#` rather than a
+ g-with-cedilla).
+
+* Enlarge the initial input buffer size to 8192 bytes and double each time we
+ hit the end. Snowball programs are typically a few KB in size (with the
+ current largest we ship being the Greek stemmer at 27KB) so the previous
+ approach of starting with a 10 byte input buffer and increasing its size by
+ 50% plus 40 bytes each time it filled was inefficient, needing up to 15
+ reallocations to load greek.sbl.
+
+* Identify variables only used by one `routine`/`external`. This information
+ isn't yet used, but such variables which are also always written to before
+ being read can be emitted as local variables in most target languages.
+
+* We now allow multiple source files on command line, and allow them to be
+ after (or even interspersed) with options to better match modern Unix
+ conventions. Support for multiple source files allows specifying a single
+ byte character set mapping via a source file of `stringdef`.
+
+* Avoid infinite recursion in compiler when optimising a recursive snowball
+ function. Recursive functions aren't typical in snowball programs, but
+ the compiler shouldn't crash for any input, especially not a valid one.
+ We now simply limit on how deep the compiler will recurse and make the
+ pessimistic assumption in the unlikely event we hit this limit.
+
+Build system:
+
+* `make clean` in C libstemmer_c distribution now removes `examples/*.o`.
+ (#59)
+
+* Fix all the places which previously had to have a list of stemmers to work
+ dynamically or be generated, so now only modules.txt needs updating to add
+ a new stemmer.
+
+* Add check_java make target which runs tests for java.
+
+* Support gzipped test data (the uncompressed arabic test data is too big for
+ github).
+
+* GNUmakefile: Drop useless `-eprefix` and `-r` options from snowball
+ invocations for Java - these are only meaningful when generating C code.
+
+* Pass CFLAGS when linking which matches convention (e.g. automake does it) and
+ facilitates use of tools such as ASan. Fixes #84, reported by Thomas
+ Pointhuber.
+
+* Add CI builds with -std=c90 to check compiler and generated code are C90
+ (#54)
+
+libstemmer stuff:
+
+* Split out CPPFLAGS from CFLAGS and use CFLAGS when linking stemwords.
+
+* Add -O2 to CFLAGS.
+
+* Make generated tables of encodings and modules const.
+
+* Fix clang static analyzer memory leak warning (in practice this code path
+ can never actually be taken). Patch from Patrick O. Perry (#56)
+
+documentation
+
+* Added copyright and licensing details (#10).
+
+* Document that libstemmer supports ISO_8859_2 encoding. Currently hungarian
+ and romanian are available in ISO_8859_2.
+
+* Remove documentation falsely claiming that libstemmer supports CP850
+ encoding.
+
+* CONTRIBUTING.rst: Add guidance for contributing new stemming algorithms and
+ new language backends.
+
+* Overhaul libstemmer_python_README. Most notably, replace the benchmark data
+ which was very out of date.