123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407 |
- Snowball 2.0.0 (2019-10-02)
- ===========================
-
- C/C++
- -----
-
- * Fully handle 4-byte UTF-8 sequences. Previously `hop` and `next` handled
- sequences of any length, but commands which look at the character value only
- handled sequences up to length 3. Fixes #89.
-
- * Fix handling of a 3-byte UTF-8 sequence in a grouping in `backwardmode`.
-
- Java
- ----
-
- * TestApp.java:
-
- - Always use UTF-8 for I/O. Patch from David Corbett (#80).
-
- - Allow reading input from stdin.
-
- - Remove rather pointless "stem n times" feature.
-
- - Only lower case ASCII to match stemwords.c.
-
- - Stem empty lines too to match stemwords.c.
-
- Code Quality Improvements
- -------------------------
-
- * Fix various warnings from newer compilers.
-
- * Improve use of `const`.
-
- * Share common functions between compiler backends rather than having multiple
- copies of the same code.
-
- * Assorted code clean-up.
-
- * Initialise line_labelled member of struct generator to 0. Previously we were
- invoking undefined behaviour, though in practice it'll be zero initialised on
- most platforms.
-
- New Code Generators
- -------------------
-
- * Add Python generator (#24). Originally written by Yoshiki Shibukawa, with
- additional updates by Dmitry Shachnev.
-
- * Add Javascript generator. Based on JSX generator (#26) written by Yoshiki
- Shibukawa.
-
- * Add Rust generator from Jakob Demler (#51).
-
- * Add Go generator from Marty Schoch (#57).
-
- * Add C# generator. Based on patch from Cesar Souza (#16, #17).
-
- * Add Pascal generator. Based on Delphi backend from stemming.zip file on old
- website (#75).
-
- New Language Features
- ---------------------
-
- * Add `len` and `lenof` to measure Unicode length. These are similar to `size`
- and `sizeof` (respectively), but `size` and `sizeof` return the length in
- bytes under `-utf8`, whereas these new commands give the same result whether
- using `-utf8`, `-widechars` or neither (but under `-utf8` they are O(n) in
- the length of the string). For compatibility with existing code which might
- use these as variable or function names, they stop being treated as tokens if
- declared to be a variable or function.
-
- * New `{U+1234}` stringdef notation for Unicode codepoints.
-
- * More versatile integer tests. Now you can compare any two arithmetic
- expressions with a relational operator in parentheses after the `$`, so for
- example `$(len > 3)` can now be used when previously a temporary variable was
- required: `$tmp = len $tmp > 3`
-
- Code generation improvements
- ----------------------------
-
- * General:
-
- + Avoid unnecessarily saving and restoring of the cursor for more commands -
- `atlimit`, `do`, `set` and `unset` all leave the cursor alone or always
- restore its value, and for C `booltest` (which other languages already
- handled).
-
- + Special case handling for `setlimit tomark AE`. All uses of setlimit in
- the current stemmers we ship follow this pattern, and by special-casing we
- can avoid having to save and restore the cursor (#74).
-
- + Merge duplicate actions in the same `among`. This reduces the size of the
- switch/if-chain in the generated code which dispatch the among for many of
- the stemmers.
-
- + Generate simpler code for `among`. We always check for a zero return value
- when we call the among, so there's no point also checking for that in the
- switch/if-chain. We can also avoid the switch/if-chain entirely when
- there's only one possible outcome (besides the zero return).
-
- + Optimise code generated for `do <function call>`. This speeds up "make
- check_python" by about 2%, and should speed up other interpreted languages
- too (#110).
-
- + Generate more and better comments referencing snowball source.
-
- + Add homepage URL and compiler version as comments in generated files.
-
- * C/C++:
-
- + Fix `size` and `sizeof` to not report one too high (reported by Assem
- Chelli in #32).
-
- + If signal `f` from a function call would lead to return from the current
- function then handle this and bailing out on an error together with a
- simple `if (ret <= 0) return ret;`
-
- + Inline testing for a single character literals.
-
- + Avoiding generating `|| 0` in corner case - this can result in a compiler
- warning when building the generated code.
-
- + Implement `insert_v()` in terms of `insert_s()`.
-
- + Add conditional `extern "C"` so `runtime/api.h` can be included from C++
- code. Closes #90, reported by vvarma.
-
- * Java:
-
- + Fix functions in `among` to work in Java. We seem to need to make the
- methods called from among `public` instead of `private`, and to call them
- on `this` instead of the `methodObject` (which is cleaner anyway). No
- revision in version control seems to generate working code for this case,
- but Richard says it definitely used to work - possibly older JVMs failed to
- correctly enforce the access controls when methods were invoked by
- reflection.
-
- + Code after handling `f` by returning from the current function is
- unreachable too.
-
- + Previously we incorrectly decided that code after an `or` was
- unreachable in certain cases. None of the current stemmers in the
- distribution triggered this, but Martin Porter's snowball version
- of the Schinke Latin stemmer does. Fixes #58, reported by Alexander
- Myltsev.
-
- + The reachability logic was failing to consider reachability from
- the final command in an `or`. Fixes #82, reported by David Corbett.
-
- + Fix `maxint` and `minint`. Patch from David Corbett in #31.
-
- + Fix `$` on strings. The previous generated code was just wrong. This
- doesn't affect any of the included algorithms, but for example breaks
- Martin Porter's snowball implementation of Schinke's Latin Stemmer.
- Issue noted by Jakob Demler while working on the Rust backend in #51,
- and reported in the Schinke's Latin Stemmer by Alexander Myltsev
- in #58.
-
- + Make SnowballProgram objects serializable. Patch from Oleg Smirnov in #43.
-
- + Eliminate range-check implementation for groupings. This was removed from
- the C generator 10 years earlier, isn't used for any of the existing
- algorithms, and it doesn't seem likely it would be - the grouping would
- have to consist entirely of a contiguous block of Unicode code-points.
-
- + Simplify code generated for `repeat` and `atleast`.
-
- + Eliminate unused return values and variables from runtime functions.
-
- + Only import the `among` and `SnowballProgram` classes if they're actually
- used.
-
- + Only generate `copy_from()` method if it's used.
-
- + Merge runtime functions `eq_s` and `eq_v` functions.
-
- + Java arrays know their own length so stop storing it separately.
-
- + Escape char 127 (DEL) in generated Java code. It's unlikely that this
- character would actually be used in a real stemmer, so this was more of a
- theoretical bug.
-
- + Drop unused import of InvocationTargetException from SnowballStemmer.
- Reported by GerritDeMeulder in #72.
-
- + Fix lint check issues in generated Java code. The stemmer classes are only
- referenced in the example app via reflection, so add
- @SuppressWarnings("unused") for them. The stemmer classes override
- equals() and hashCode() methods from the standard java Object class, so
- mark these with @Override. Both suggested by GerritDeMeulder in #72.
-
- + Declare Java variables at point of use in generated code. Putting all
- declarations at the top of the function was adding unnecessary complexity
- to the Java generator code for no benefit.
-
- + Improve formatting of generated code.
-
- New stemming algorithms
- -----------------------
-
- * Add Tamil stemmer from Damodharan Rajalingam (#2, #3).
-
- * Add Arabic stemmer from Assem Chelli (#32, #50).
-
- * Add Irish stemmer Jim O'Regan (#48).
-
- * Add Nepali stemmer from Arthur Zakirov (#70).
-
- * Add Indonesian stemmer from Olly Betts (#71).
-
- * Add Hindi stemmer from Olly Betts (#73). Thanks to David Corbett for review.
-
- * Add Lithuanian stemmer from Dainius Jocas (#22, #76).
-
- * Add Greek stemmer from Oleg Smirnov (#44).
-
- * Add Catalan and Basque stemmers from Israel Olalla (#104).
-
- Behavioural changes to existing algorithms
- ------------------------------------------
-
- * Portuguese:
-
- + Replace incorrect Spanish suffixes by Portuguese suffixes (#1).
-
- * French:
-
- + The MSDOS CP850 version of the French algorithm was missing changes present
- in the ISO8859-1 and Unicode versions. There's now a single version of
- each algorithm which was based on the Unicode version.
-
- + Recognize French suffixes even when they begin with diaereses. Patch from
- David Corbett in #78.
-
- * Russian:
-
- + We now normalise 'ё' to 'е' before stemming. The documentation has long
- said "we assume ['ё'] is mapped into ['е']" but it's more convenient for
- the stemmer to actually perform this normalisation. This change has no
- effect if the caller is already normalising as we recommend. It's a change
- in behaviour they aren't, but 'ё' occurs rarely (there are currently no
- instances in our test vocabulary) and this improves behaviour when it does
- occur. Patch from Eugene Mirotin (#65, #68).
-
- * Finish:
-
- + Adjust the Finnish algorithm not to mangle numbers. This change also
- means it tends to leave foreign words alone. Fixes #66.
-
- * Danish:
-
- + Adjust Danish algorithm not to mangle alphanumeric codes. In particular
- alphanumeric codes ending in a double digit (e.g. 0x0e00, hal9000,
- space1999) are no longer mangled. See #81.
-
- Optimisations to existing algorithms
- ------------------------------------
-
- * Turkish:
-
- + Simplify uses of `test` in stemmer code.
-
- + Check for 'ad' or 'soyad' more efficiently, and without needing the
- strlen variable. This speeds up "make check_utf8_turkish" by 11%
- on x86 Linux.
-
- * Kraaij-Pohlmann:
-
- + Eliminate variable x `$p1 <= cursor` is simpler and a little more efficient
- than `setmark x $x >= p1`.
-
- Code clarity improvements to existing algorithms
- ------------------------------------------------
-
- * Turkish:
-
- + Use , for cedilla to match the conventions used in other stemmers.
-
- * Kraaij-Pohlmann:
-
- + Avoid cryptic `[among ( (])` ... `)` construct - instead use the same
- `[substring] among (` ... `)` construct we do in other stemmers.
-
- Compiler
- --------
-
- * Support conventional --help and --version options.
-
- * Warn if -r or -ep used with backend other than C/C++.
-
- * Warn if encoding command line options are specified when generating code in a
- language with a fixed encoding.
-
- * The default classname is now set based on the output filename, so `-n` is now
- often no longer needed. Fixes #64.
-
- * Avoid potential one byte buffer over-read when parsing snowball code.
-
- * Avoid comparing with uninitialised array element during compilation.
-
- * Improve `-syntax` output for `setlimit L for C`.
-
- * Optimise away double negation so generators don't have to worry about
- generating `--` (decrement operator in many languages). Fixes #52, reported
- by David Corbett.
-
- * Improved compiler error and warning messages:
-
- - We now report FILE:LINE: before each diagnostic message.
-
- - Improve warnings for unused declarations/definitions.
-
- - Warn for variables which are used, but either never initialised
- or never read.
-
- - Flag non-ASCII literal strings. This is an error for wide Unicode, but
- only a warning for single-byte and UTF-8 which work so long as the source
- encoding matches the encoding used in the generated stemmer code.
-
- - Improve error recovery after an undeclared `define`. We now sniff the
- token after the identifier and if it is `as` we parse as a routine,
- otherwise we parse as a grouping. Previously we always just assumed it was
- a routine, which gave a confusing second error if it was a grouping.
-
- - Improve error recovery after an unexpected token in `among`. Previously
- we acted as if the unexpected token closed the `among` (this probably
- wasn't intended but just a missing `break;` in a switch statement). Now we
- issue an error and try the next token.
-
- * Report error instead of silently truncating character values (e.g. `hex 123`
- previously silently became byte 0x23 which is `#` rather than a
- g-with-cedilla).
-
- * Enlarge the initial input buffer size to 8192 bytes and double each time we
- hit the end. Snowball programs are typically a few KB in size (with the
- current largest we ship being the Greek stemmer at 27KB) so the previous
- approach of starting with a 10 byte input buffer and increasing its size by
- 50% plus 40 bytes each time it filled was inefficient, needing up to 15
- reallocations to load greek.sbl.
-
- * Identify variables only used by one `routine`/`external`. This information
- isn't yet used, but such variables which are also always written to before
- being read can be emitted as local variables in most target languages.
-
- * We now allow multiple source files on command line, and allow them to be
- after (or even interspersed) with options to better match modern Unix
- conventions. Support for multiple source files allows specifying a single
- byte character set mapping via a source file of `stringdef`.
-
- * Avoid infinite recursion in compiler when optimising a recursive snowball
- function. Recursive functions aren't typical in snowball programs, but
- the compiler shouldn't crash for any input, especially not a valid one.
- We now simply limit on how deep the compiler will recurse and make the
- pessimistic assumption in the unlikely event we hit this limit.
-
- Build system:
-
- * `make clean` in C libstemmer_c distribution now removes `examples/*.o`.
- (#59)
-
- * Fix all the places which previously had to have a list of stemmers to work
- dynamically or be generated, so now only modules.txt needs updating to add
- a new stemmer.
-
- * Add check_java make target which runs tests for java.
-
- * Support gzipped test data (the uncompressed arabic test data is too big for
- github).
-
- * GNUmakefile: Drop useless `-eprefix` and `-r` options from snowball
- invocations for Java - these are only meaningful when generating C code.
-
- * Pass CFLAGS when linking which matches convention (e.g. automake does it) and
- facilitates use of tools such as ASan. Fixes #84, reported by Thomas
- Pointhuber.
-
- * Add CI builds with -std=c90 to check compiler and generated code are C90
- (#54)
-
- libstemmer stuff:
-
- * Split out CPPFLAGS from CFLAGS and use CFLAGS when linking stemwords.
-
- * Add -O2 to CFLAGS.
-
- * Make generated tables of encodings and modules const.
-
- * Fix clang static analyzer memory leak warning (in practice this code path
- can never actually be taken). Patch from Patrick O. Perry (#56)
-
- documentation
-
- * Added copyright and licensing details (#10).
-
- * Document that libstemmer supports ISO_8859_2 encoding. Currently hungarian
- and romanian are available in ISO_8859_2.
-
- * Remove documentation falsely claiming that libstemmer supports CP850
- encoding.
-
- * CONTRIBUTING.rst: Add guidance for contributing new stemming algorithms and
- new language backends.
-
- * Overhaul libstemmer_python_README. Most notably, replace the benchmark data
- which was very out of date.
|