mirror of
https://github.com/rspamd/rspamd.git
synced 2024-08-21 05:54:32 +02:00
408 lines
15 KiB
Plaintext
408 lines
15 KiB
Plaintext
|
Snowball 2.0.0 (2019-10-02)
|
|||
|
===========================
|
|||
|
|
|||
|
C/C++
|
|||
|
-----
|
|||
|
|
|||
|
* Fully handle 4-byte UTF-8 sequences. Previously `hop` and `next` handled
|
|||
|
sequences of any length, but commands which look at the character value only
|
|||
|
handled sequences up to length 3. Fixes #89.
|
|||
|
|
|||
|
* Fix handling of a 3-byte UTF-8 sequence in a grouping in `backwardmode`.
|
|||
|
|
|||
|
Java
|
|||
|
----
|
|||
|
|
|||
|
* TestApp.java:
|
|||
|
|
|||
|
- Always use UTF-8 for I/O. Patch from David Corbett (#80).
|
|||
|
|
|||
|
- Allow reading input from stdin.
|
|||
|
|
|||
|
- Remove rather pointless "stem n times" feature.
|
|||
|
|
|||
|
- Only lower case ASCII to match stemwords.c.
|
|||
|
|
|||
|
- Stem empty lines too to match stemwords.c.
|
|||
|
|
|||
|
Code Quality Improvements
|
|||
|
-------------------------
|
|||
|
|
|||
|
* Fix various warnings from newer compilers.
|
|||
|
|
|||
|
* Improve use of `const`.
|
|||
|
|
|||
|
* Share common functions between compiler backends rather than having multiple
|
|||
|
copies of the same code.
|
|||
|
|
|||
|
* Assorted code clean-up.
|
|||
|
|
|||
|
* Initialise line_labelled member of struct generator to 0. Previously we were
|
|||
|
invoking undefined behaviour, though in practice it'll be zero initialised on
|
|||
|
most platforms.
|
|||
|
|
|||
|
New Code Generators
|
|||
|
-------------------
|
|||
|
|
|||
|
* Add Python generator (#24). Originally written by Yoshiki Shibukawa, with
|
|||
|
additional updates by Dmitry Shachnev.
|
|||
|
|
|||
|
* Add Javascript generator. Based on JSX generator (#26) written by Yoshiki
|
|||
|
Shibukawa.
|
|||
|
|
|||
|
* Add Rust generator from Jakob Demler (#51).
|
|||
|
|
|||
|
* Add Go generator from Marty Schoch (#57).
|
|||
|
|
|||
|
* Add C# generator. Based on patch from Cesar Souza (#16, #17).
|
|||
|
|
|||
|
* Add Pascal generator. Based on Delphi backend from stemming.zip file on old
|
|||
|
website (#75).
|
|||
|
|
|||
|
New Language Features
|
|||
|
---------------------
|
|||
|
|
|||
|
* Add `len` and `lenof` to measure Unicode length. These are similar to `size`
|
|||
|
and `sizeof` (respectively), but `size` and `sizeof` return the length in
|
|||
|
bytes under `-utf8`, whereas these new commands give the same result whether
|
|||
|
using `-utf8`, `-widechars` or neither (but under `-utf8` they are O(n) in
|
|||
|
the length of the string). For compatibility with existing code which might
|
|||
|
use these as variable or function names, they stop being treated as tokens if
|
|||
|
declared to be a variable or function.
|
|||
|
|
|||
|
* New `{U+1234}` stringdef notation for Unicode codepoints.
|
|||
|
|
|||
|
* More versatile integer tests. Now you can compare any two arithmetic
|
|||
|
expressions with a relational operator in parentheses after the `$`, so for
|
|||
|
example `$(len > 3)` can now be used when previously a temporary variable was
|
|||
|
required: `$tmp = len $tmp > 3`
|
|||
|
|
|||
|
Code generation improvements
|
|||
|
----------------------------
|
|||
|
|
|||
|
* General:
|
|||
|
|
|||
|
+ Avoid unnecessarily saving and restoring of the cursor for more commands -
|
|||
|
`atlimit`, `do`, `set` and `unset` all leave the cursor alone or always
|
|||
|
restore its value, and for C `booltest` (which other languages already
|
|||
|
handled).
|
|||
|
|
|||
|
+ Special case handling for `setlimit tomark AE`. All uses of setlimit in
|
|||
|
the current stemmers we ship follow this pattern, and by special-casing we
|
|||
|
can avoid having to save and restore the cursor (#74).
|
|||
|
|
|||
|
+ Merge duplicate actions in the same `among`. This reduces the size of the
|
|||
|
switch/if-chain in the generated code which dispatch the among for many of
|
|||
|
the stemmers.
|
|||
|
|
|||
|
+ Generate simpler code for `among`. We always check for a zero return value
|
|||
|
when we call the among, so there's no point also checking for that in the
|
|||
|
switch/if-chain. We can also avoid the switch/if-chain entirely when
|
|||
|
there's only one possible outcome (besides the zero return).
|
|||
|
|
|||
|
+ Optimise code generated for `do <function call>`. This speeds up "make
|
|||
|
check_python" by about 2%, and should speed up other interpreted languages
|
|||
|
too (#110).
|
|||
|
|
|||
|
+ Generate more and better comments referencing snowball source.
|
|||
|
|
|||
|
+ Add homepage URL and compiler version as comments in generated files.
|
|||
|
|
|||
|
* C/C++:
|
|||
|
|
|||
|
+ Fix `size` and `sizeof` to not report one too high (reported by Assem
|
|||
|
Chelli in #32).
|
|||
|
|
|||
|
+ If signal `f` from a function call would lead to return from the current
|
|||
|
function then handle this and bailing out on an error together with a
|
|||
|
simple `if (ret <= 0) return ret;`
|
|||
|
|
|||
|
+ Inline testing for a single character literals.
|
|||
|
|
|||
|
+ Avoiding generating `|| 0` in corner case - this can result in a compiler
|
|||
|
warning when building the generated code.
|
|||
|
|
|||
|
+ Implement `insert_v()` in terms of `insert_s()`.
|
|||
|
|
|||
|
+ Add conditional `extern "C"` so `runtime/api.h` can be included from C++
|
|||
|
code. Closes #90, reported by vvarma.
|
|||
|
|
|||
|
* Java:
|
|||
|
|
|||
|
+ Fix functions in `among` to work in Java. We seem to need to make the
|
|||
|
methods called from among `public` instead of `private`, and to call them
|
|||
|
on `this` instead of the `methodObject` (which is cleaner anyway). No
|
|||
|
revision in version control seems to generate working code for this case,
|
|||
|
but Richard says it definitely used to work - possibly older JVMs failed to
|
|||
|
correctly enforce the access controls when methods were invoked by
|
|||
|
reflection.
|
|||
|
|
|||
|
+ Code after handling `f` by returning from the current function is
|
|||
|
unreachable too.
|
|||
|
|
|||
|
+ Previously we incorrectly decided that code after an `or` was
|
|||
|
unreachable in certain cases. None of the current stemmers in the
|
|||
|
distribution triggered this, but Martin Porter's snowball version
|
|||
|
of the Schinke Latin stemmer does. Fixes #58, reported by Alexander
|
|||
|
Myltsev.
|
|||
|
|
|||
|
+ The reachability logic was failing to consider reachability from
|
|||
|
the final command in an `or`. Fixes #82, reported by David Corbett.
|
|||
|
|
|||
|
+ Fix `maxint` and `minint`. Patch from David Corbett in #31.
|
|||
|
|
|||
|
+ Fix `$` on strings. The previous generated code was just wrong. This
|
|||
|
doesn't affect any of the included algorithms, but for example breaks
|
|||
|
Martin Porter's snowball implementation of Schinke's Latin Stemmer.
|
|||
|
Issue noted by Jakob Demler while working on the Rust backend in #51,
|
|||
|
and reported in the Schinke's Latin Stemmer by Alexander Myltsev
|
|||
|
in #58.
|
|||
|
|
|||
|
+ Make SnowballProgram objects serializable. Patch from Oleg Smirnov in #43.
|
|||
|
|
|||
|
+ Eliminate range-check implementation for groupings. This was removed from
|
|||
|
the C generator 10 years earlier, isn't used for any of the existing
|
|||
|
algorithms, and it doesn't seem likely it would be - the grouping would
|
|||
|
have to consist entirely of a contiguous block of Unicode code-points.
|
|||
|
|
|||
|
+ Simplify code generated for `repeat` and `atleast`.
|
|||
|
|
|||
|
+ Eliminate unused return values and variables from runtime functions.
|
|||
|
|
|||
|
+ Only import the `among` and `SnowballProgram` classes if they're actually
|
|||
|
used.
|
|||
|
|
|||
|
+ Only generate `copy_from()` method if it's used.
|
|||
|
|
|||
|
+ Merge runtime functions `eq_s` and `eq_v` functions.
|
|||
|
|
|||
|
+ Java arrays know their own length so stop storing it separately.
|
|||
|
|
|||
|
+ Escape char 127 (DEL) in generated Java code. It's unlikely that this
|
|||
|
character would actually be used in a real stemmer, so this was more of a
|
|||
|
theoretical bug.
|
|||
|
|
|||
|
+ Drop unused import of InvocationTargetException from SnowballStemmer.
|
|||
|
Reported by GerritDeMeulder in #72.
|
|||
|
|
|||
|
+ Fix lint check issues in generated Java code. The stemmer classes are only
|
|||
|
referenced in the example app via reflection, so add
|
|||
|
@SuppressWarnings("unused") for them. The stemmer classes override
|
|||
|
equals() and hashCode() methods from the standard java Object class, so
|
|||
|
mark these with @Override. Both suggested by GerritDeMeulder in #72.
|
|||
|
|
|||
|
+ Declare Java variables at point of use in generated code. Putting all
|
|||
|
declarations at the top of the function was adding unnecessary complexity
|
|||
|
to the Java generator code for no benefit.
|
|||
|
|
|||
|
+ Improve formatting of generated code.
|
|||
|
|
|||
|
New stemming algorithms
|
|||
|
-----------------------
|
|||
|
|
|||
|
* Add Tamil stemmer from Damodharan Rajalingam (#2, #3).
|
|||
|
|
|||
|
* Add Arabic stemmer from Assem Chelli (#32, #50).
|
|||
|
|
|||
|
* Add Irish stemmer Jim O'Regan (#48).
|
|||
|
|
|||
|
* Add Nepali stemmer from Arthur Zakirov (#70).
|
|||
|
|
|||
|
* Add Indonesian stemmer from Olly Betts (#71).
|
|||
|
|
|||
|
* Add Hindi stemmer from Olly Betts (#73). Thanks to David Corbett for review.
|
|||
|
|
|||
|
* Add Lithuanian stemmer from Dainius Jocas (#22, #76).
|
|||
|
|
|||
|
* Add Greek stemmer from Oleg Smirnov (#44).
|
|||
|
|
|||
|
* Add Catalan and Basque stemmers from Israel Olalla (#104).
|
|||
|
|
|||
|
Behavioural changes to existing algorithms
|
|||
|
------------------------------------------
|
|||
|
|
|||
|
* Portuguese:
|
|||
|
|
|||
|
+ Replace incorrect Spanish suffixes by Portuguese suffixes (#1).
|
|||
|
|
|||
|
* French:
|
|||
|
|
|||
|
+ The MSDOS CP850 version of the French algorithm was missing changes present
|
|||
|
in the ISO8859-1 and Unicode versions. There's now a single version of
|
|||
|
each algorithm which was based on the Unicode version.
|
|||
|
|
|||
|
+ Recognize French suffixes even when they begin with diaereses. Patch from
|
|||
|
David Corbett in #78.
|
|||
|
|
|||
|
* Russian:
|
|||
|
|
|||
|
+ We now normalise 'ё' to 'е' before stemming. The documentation has long
|
|||
|
said "we assume ['ё'] is mapped into ['е']" but it's more convenient for
|
|||
|
the stemmer to actually perform this normalisation. This change has no
|
|||
|
effect if the caller is already normalising as we recommend. It's a change
|
|||
|
in behaviour they aren't, but 'ё' occurs rarely (there are currently no
|
|||
|
instances in our test vocabulary) and this improves behaviour when it does
|
|||
|
occur. Patch from Eugene Mirotin (#65, #68).
|
|||
|
|
|||
|
* Finish:
|
|||
|
|
|||
|
+ Adjust the Finnish algorithm not to mangle numbers. This change also
|
|||
|
means it tends to leave foreign words alone. Fixes #66.
|
|||
|
|
|||
|
* Danish:
|
|||
|
|
|||
|
+ Adjust Danish algorithm not to mangle alphanumeric codes. In particular
|
|||
|
alphanumeric codes ending in a double digit (e.g. 0x0e00, hal9000,
|
|||
|
space1999) are no longer mangled. See #81.
|
|||
|
|
|||
|
Optimisations to existing algorithms
|
|||
|
------------------------------------
|
|||
|
|
|||
|
* Turkish:
|
|||
|
|
|||
|
+ Simplify uses of `test` in stemmer code.
|
|||
|
|
|||
|
+ Check for 'ad' or 'soyad' more efficiently, and without needing the
|
|||
|
strlen variable. This speeds up "make check_utf8_turkish" by 11%
|
|||
|
on x86 Linux.
|
|||
|
|
|||
|
* Kraaij-Pohlmann:
|
|||
|
|
|||
|
+ Eliminate variable x `$p1 <= cursor` is simpler and a little more efficient
|
|||
|
than `setmark x $x >= p1`.
|
|||
|
|
|||
|
Code clarity improvements to existing algorithms
|
|||
|
------------------------------------------------
|
|||
|
|
|||
|
* Turkish:
|
|||
|
|
|||
|
+ Use , for cedilla to match the conventions used in other stemmers.
|
|||
|
|
|||
|
* Kraaij-Pohlmann:
|
|||
|
|
|||
|
+ Avoid cryptic `[among ( (])` ... `)` construct - instead use the same
|
|||
|
`[substring] among (` ... `)` construct we do in other stemmers.
|
|||
|
|
|||
|
Compiler
|
|||
|
--------
|
|||
|
|
|||
|
* Support conventional --help and --version options.
|
|||
|
|
|||
|
* Warn if -r or -ep used with backend other than C/C++.
|
|||
|
|
|||
|
* Warn if encoding command line options are specified when generating code in a
|
|||
|
language with a fixed encoding.
|
|||
|
|
|||
|
* The default classname is now set based on the output filename, so `-n` is now
|
|||
|
often no longer needed. Fixes #64.
|
|||
|
|
|||
|
* Avoid potential one byte buffer over-read when parsing snowball code.
|
|||
|
|
|||
|
* Avoid comparing with uninitialised array element during compilation.
|
|||
|
|
|||
|
* Improve `-syntax` output for `setlimit L for C`.
|
|||
|
|
|||
|
* Optimise away double negation so generators don't have to worry about
|
|||
|
generating `--` (decrement operator in many languages). Fixes #52, reported
|
|||
|
by David Corbett.
|
|||
|
|
|||
|
* Improved compiler error and warning messages:
|
|||
|
|
|||
|
- We now report FILE:LINE: before each diagnostic message.
|
|||
|
|
|||
|
- Improve warnings for unused declarations/definitions.
|
|||
|
|
|||
|
- Warn for variables which are used, but either never initialised
|
|||
|
or never read.
|
|||
|
|
|||
|
- Flag non-ASCII literal strings. This is an error for wide Unicode, but
|
|||
|
only a warning for single-byte and UTF-8 which work so long as the source
|
|||
|
encoding matches the encoding used in the generated stemmer code.
|
|||
|
|
|||
|
- Improve error recovery after an undeclared `define`. We now sniff the
|
|||
|
token after the identifier and if it is `as` we parse as a routine,
|
|||
|
otherwise we parse as a grouping. Previously we always just assumed it was
|
|||
|
a routine, which gave a confusing second error if it was a grouping.
|
|||
|
|
|||
|
- Improve error recovery after an unexpected token in `among`. Previously
|
|||
|
we acted as if the unexpected token closed the `among` (this probably
|
|||
|
wasn't intended but just a missing `break;` in a switch statement). Now we
|
|||
|
issue an error and try the next token.
|
|||
|
|
|||
|
* Report error instead of silently truncating character values (e.g. `hex 123`
|
|||
|
previously silently became byte 0x23 which is `#` rather than a
|
|||
|
g-with-cedilla).
|
|||
|
|
|||
|
* Enlarge the initial input buffer size to 8192 bytes and double each time we
|
|||
|
hit the end. Snowball programs are typically a few KB in size (with the
|
|||
|
current largest we ship being the Greek stemmer at 27KB) so the previous
|
|||
|
approach of starting with a 10 byte input buffer and increasing its size by
|
|||
|
50% plus 40 bytes each time it filled was inefficient, needing up to 15
|
|||
|
reallocations to load greek.sbl.
|
|||
|
|
|||
|
* Identify variables only used by one `routine`/`external`. This information
|
|||
|
isn't yet used, but such variables which are also always written to before
|
|||
|
being read can be emitted as local variables in most target languages.
|
|||
|
|
|||
|
* We now allow multiple source files on command line, and allow them to be
|
|||
|
after (or even interspersed) with options to better match modern Unix
|
|||
|
conventions. Support for multiple source files allows specifying a single
|
|||
|
byte character set mapping via a source file of `stringdef`.
|
|||
|
|
|||
|
* Avoid infinite recursion in compiler when optimising a recursive snowball
|
|||
|
function. Recursive functions aren't typical in snowball programs, but
|
|||
|
the compiler shouldn't crash for any input, especially not a valid one.
|
|||
|
We now simply limit on how deep the compiler will recurse and make the
|
|||
|
pessimistic assumption in the unlikely event we hit this limit.
|
|||
|
|
|||
|
Build system:
|
|||
|
|
|||
|
* `make clean` in C libstemmer_c distribution now removes `examples/*.o`.
|
|||
|
(#59)
|
|||
|
|
|||
|
* Fix all the places which previously had to have a list of stemmers to work
|
|||
|
dynamically or be generated, so now only modules.txt needs updating to add
|
|||
|
a new stemmer.
|
|||
|
|
|||
|
* Add check_java make target which runs tests for java.
|
|||
|
|
|||
|
* Support gzipped test data (the uncompressed arabic test data is too big for
|
|||
|
github).
|
|||
|
|
|||
|
* GNUmakefile: Drop useless `-eprefix` and `-r` options from snowball
|
|||
|
invocations for Java - these are only meaningful when generating C code.
|
|||
|
|
|||
|
* Pass CFLAGS when linking which matches convention (e.g. automake does it) and
|
|||
|
facilitates use of tools such as ASan. Fixes #84, reported by Thomas
|
|||
|
Pointhuber.
|
|||
|
|
|||
|
* Add CI builds with -std=c90 to check compiler and generated code are C90
|
|||
|
(#54)
|
|||
|
|
|||
|
libstemmer stuff:
|
|||
|
|
|||
|
* Split out CPPFLAGS from CFLAGS and use CFLAGS when linking stemwords.
|
|||
|
|
|||
|
* Add -O2 to CFLAGS.
|
|||
|
|
|||
|
* Make generated tables of encodings and modules const.
|
|||
|
|
|||
|
* Fix clang static analyzer memory leak warning (in practice this code path
|
|||
|
can never actually be taken). Patch from Patrick O. Perry (#56)
|
|||
|
|
|||
|
documentation
|
|||
|
|
|||
|
* Added copyright and licensing details (#10).
|
|||
|
|
|||
|
* Document that libstemmer supports ISO_8859_2 encoding. Currently hungarian
|
|||
|
and romanian are available in ISO_8859_2.
|
|||
|
|
|||
|
* Remove documentation falsely claiming that libstemmer supports CP850
|
|||
|
encoding.
|
|||
|
|
|||
|
* CONTRIBUTING.rst: Add guidance for contributing new stemming algorithms and
|
|||
|
new language backends.
|
|||
|
|
|||
|
* Overhaul libstemmer_python_README. Most notably, replace the benchmark data
|
|||
|
which was very out of date.
|