You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407
  1. Snowball 2.0.0 (2019-10-02)
  2. ===========================
  3. C/C++
  4. -----
  5. * Fully handle 4-byte UTF-8 sequences. Previously `hop` and `next` handled
  6. sequences of any length, but commands which look at the character value only
  7. handled sequences up to length 3. Fixes #89.
  8. * Fix handling of a 3-byte UTF-8 sequence in a grouping in `backwardmode`.
  9. Java
  10. ----
  11. * TestApp.java:
  12. - Always use UTF-8 for I/O. Patch from David Corbett (#80).
  13. - Allow reading input from stdin.
  14. - Remove rather pointless "stem n times" feature.
  15. - Only lower case ASCII to match stemwords.c.
  16. - Stem empty lines too to match stemwords.c.
  17. Code Quality Improvements
  18. -------------------------
  19. * Fix various warnings from newer compilers.
  20. * Improve use of `const`.
  21. * Share common functions between compiler backends rather than having multiple
  22. copies of the same code.
  23. * Assorted code clean-up.
  24. * Initialise line_labelled member of struct generator to 0. Previously we were
  25. invoking undefined behaviour, though in practice it'll be zero initialised on
  26. most platforms.
  27. New Code Generators
  28. -------------------
  29. * Add Python generator (#24). Originally written by Yoshiki Shibukawa, with
  30. additional updates by Dmitry Shachnev.
  31. * Add Javascript generator. Based on JSX generator (#26) written by Yoshiki
  32. Shibukawa.
  33. * Add Rust generator from Jakob Demler (#51).
  34. * Add Go generator from Marty Schoch (#57).
  35. * Add C# generator. Based on patch from Cesar Souza (#16, #17).
  36. * Add Pascal generator. Based on Delphi backend from stemming.zip file on old
  37. website (#75).
  38. New Language Features
  39. ---------------------
  40. * Add `len` and `lenof` to measure Unicode length. These are similar to `size`
  41. and `sizeof` (respectively), but `size` and `sizeof` return the length in
  42. bytes under `-utf8`, whereas these new commands give the same result whether
  43. using `-utf8`, `-widechars` or neither (but under `-utf8` they are O(n) in
  44. the length of the string). For compatibility with existing code which might
  45. use these as variable or function names, they stop being treated as tokens if
  46. declared to be a variable or function.
  47. * New `{U+1234}` stringdef notation for Unicode codepoints.
  48. * More versatile integer tests. Now you can compare any two arithmetic
  49. expressions with a relational operator in parentheses after the `$`, so for
  50. example `$(len > 3)` can now be used when previously a temporary variable was
  51. required: `$tmp = len $tmp > 3`
  52. Code generation improvements
  53. ----------------------------
  54. * General:
  55. + Avoid unnecessarily saving and restoring of the cursor for more commands -
  56. `atlimit`, `do`, `set` and `unset` all leave the cursor alone or always
  57. restore its value, and for C `booltest` (which other languages already
  58. handled).
  59. + Special case handling for `setlimit tomark AE`. All uses of setlimit in
  60. the current stemmers we ship follow this pattern, and by special-casing we
  61. can avoid having to save and restore the cursor (#74).
  62. + Merge duplicate actions in the same `among`. This reduces the size of the
  63. switch/if-chain in the generated code which dispatch the among for many of
  64. the stemmers.
  65. + Generate simpler code for `among`. We always check for a zero return value
  66. when we call the among, so there's no point also checking for that in the
  67. switch/if-chain. We can also avoid the switch/if-chain entirely when
  68. there's only one possible outcome (besides the zero return).
  69. + Optimise code generated for `do <function call>`. This speeds up "make
  70. check_python" by about 2%, and should speed up other interpreted languages
  71. too (#110).
  72. + Generate more and better comments referencing snowball source.
  73. + Add homepage URL and compiler version as comments in generated files.
  74. * C/C++:
  75. + Fix `size` and `sizeof` to not report one too high (reported by Assem
  76. Chelli in #32).
  77. + If signal `f` from a function call would lead to return from the current
  78. function then handle this and bailing out on an error together with a
  79. simple `if (ret <= 0) return ret;`
  80. + Inline testing for a single character literals.
  81. + Avoiding generating `|| 0` in corner case - this can result in a compiler
  82. warning when building the generated code.
  83. + Implement `insert_v()` in terms of `insert_s()`.
  84. + Add conditional `extern "C"` so `runtime/api.h` can be included from C++
  85. code. Closes #90, reported by vvarma.
  86. * Java:
  87. + Fix functions in `among` to work in Java. We seem to need to make the
  88. methods called from among `public` instead of `private`, and to call them
  89. on `this` instead of the `methodObject` (which is cleaner anyway). No
  90. revision in version control seems to generate working code for this case,
  91. but Richard says it definitely used to work - possibly older JVMs failed to
  92. correctly enforce the access controls when methods were invoked by
  93. reflection.
  94. + Code after handling `f` by returning from the current function is
  95. unreachable too.
  96. + Previously we incorrectly decided that code after an `or` was
  97. unreachable in certain cases. None of the current stemmers in the
  98. distribution triggered this, but Martin Porter's snowball version
  99. of the Schinke Latin stemmer does. Fixes #58, reported by Alexander
  100. Myltsev.
  101. + The reachability logic was failing to consider reachability from
  102. the final command in an `or`. Fixes #82, reported by David Corbett.
  103. + Fix `maxint` and `minint`. Patch from David Corbett in #31.
  104. + Fix `$` on strings. The previous generated code was just wrong. This
  105. doesn't affect any of the included algorithms, but for example breaks
  106. Martin Porter's snowball implementation of Schinke's Latin Stemmer.
  107. Issue noted by Jakob Demler while working on the Rust backend in #51,
  108. and reported in the Schinke's Latin Stemmer by Alexander Myltsev
  109. in #58.
  110. + Make SnowballProgram objects serializable. Patch from Oleg Smirnov in #43.
  111. + Eliminate range-check implementation for groupings. This was removed from
  112. the C generator 10 years earlier, isn't used for any of the existing
  113. algorithms, and it doesn't seem likely it would be - the grouping would
  114. have to consist entirely of a contiguous block of Unicode code-points.
  115. + Simplify code generated for `repeat` and `atleast`.
  116. + Eliminate unused return values and variables from runtime functions.
  117. + Only import the `among` and `SnowballProgram` classes if they're actually
  118. used.
  119. + Only generate `copy_from()` method if it's used.
  120. + Merge runtime functions `eq_s` and `eq_v` functions.
  121. + Java arrays know their own length so stop storing it separately.
  122. + Escape char 127 (DEL) in generated Java code. It's unlikely that this
  123. character would actually be used in a real stemmer, so this was more of a
  124. theoretical bug.
  125. + Drop unused import of InvocationTargetException from SnowballStemmer.
  126. Reported by GerritDeMeulder in #72.
  127. + Fix lint check issues in generated Java code. The stemmer classes are only
  128. referenced in the example app via reflection, so add
  129. @SuppressWarnings("unused") for them. The stemmer classes override
  130. equals() and hashCode() methods from the standard java Object class, so
  131. mark these with @Override. Both suggested by GerritDeMeulder in #72.
  132. + Declare Java variables at point of use in generated code. Putting all
  133. declarations at the top of the function was adding unnecessary complexity
  134. to the Java generator code for no benefit.
  135. + Improve formatting of generated code.
  136. New stemming algorithms
  137. -----------------------
  138. * Add Tamil stemmer from Damodharan Rajalingam (#2, #3).
  139. * Add Arabic stemmer from Assem Chelli (#32, #50).
  140. * Add Irish stemmer Jim O'Regan (#48).
  141. * Add Nepali stemmer from Arthur Zakirov (#70).
  142. * Add Indonesian stemmer from Olly Betts (#71).
  143. * Add Hindi stemmer from Olly Betts (#73). Thanks to David Corbett for review.
  144. * Add Lithuanian stemmer from Dainius Jocas (#22, #76).
  145. * Add Greek stemmer from Oleg Smirnov (#44).
  146. * Add Catalan and Basque stemmers from Israel Olalla (#104).
  147. Behavioural changes to existing algorithms
  148. ------------------------------------------
  149. * Portuguese:
  150. + Replace incorrect Spanish suffixes by Portuguese suffixes (#1).
  151. * French:
  152. + The MSDOS CP850 version of the French algorithm was missing changes present
  153. in the ISO8859-1 and Unicode versions. There's now a single version of
  154. each algorithm which was based on the Unicode version.
  155. + Recognize French suffixes even when they begin with diaereses. Patch from
  156. David Corbett in #78.
  157. * Russian:
  158. + We now normalise 'ё' to 'е' before stemming. The documentation has long
  159. said "we assume ['ё'] is mapped into ['е']" but it's more convenient for
  160. the stemmer to actually perform this normalisation. This change has no
  161. effect if the caller is already normalising as we recommend. It's a change
  162. in behaviour they aren't, but 'ё' occurs rarely (there are currently no
  163. instances in our test vocabulary) and this improves behaviour when it does
  164. occur. Patch from Eugene Mirotin (#65, #68).
  165. * Finish:
  166. + Adjust the Finnish algorithm not to mangle numbers. This change also
  167. means it tends to leave foreign words alone. Fixes #66.
  168. * Danish:
  169. + Adjust Danish algorithm not to mangle alphanumeric codes. In particular
  170. alphanumeric codes ending in a double digit (e.g. 0x0e00, hal9000,
  171. space1999) are no longer mangled. See #81.
  172. Optimisations to existing algorithms
  173. ------------------------------------
  174. * Turkish:
  175. + Simplify uses of `test` in stemmer code.
  176. + Check for 'ad' or 'soyad' more efficiently, and without needing the
  177. strlen variable. This speeds up "make check_utf8_turkish" by 11%
  178. on x86 Linux.
  179. * Kraaij-Pohlmann:
  180. + Eliminate variable x `$p1 <= cursor` is simpler and a little more efficient
  181. than `setmark x $x >= p1`.
  182. Code clarity improvements to existing algorithms
  183. ------------------------------------------------
  184. * Turkish:
  185. + Use , for cedilla to match the conventions used in other stemmers.
  186. * Kraaij-Pohlmann:
  187. + Avoid cryptic `[among ( (])` ... `)` construct - instead use the same
  188. `[substring] among (` ... `)` construct we do in other stemmers.
  189. Compiler
  190. --------
  191. * Support conventional --help and --version options.
  192. * Warn if -r or -ep used with backend other than C/C++.
  193. * Warn if encoding command line options are specified when generating code in a
  194. language with a fixed encoding.
  195. * The default classname is now set based on the output filename, so `-n` is now
  196. often no longer needed. Fixes #64.
  197. * Avoid potential one byte buffer over-read when parsing snowball code.
  198. * Avoid comparing with uninitialised array element during compilation.
  199. * Improve `-syntax` output for `setlimit L for C`.
  200. * Optimise away double negation so generators don't have to worry about
  201. generating `--` (decrement operator in many languages). Fixes #52, reported
  202. by David Corbett.
  203. * Improved compiler error and warning messages:
  204. - We now report FILE:LINE: before each diagnostic message.
  205. - Improve warnings for unused declarations/definitions.
  206. - Warn for variables which are used, but either never initialised
  207. or never read.
  208. - Flag non-ASCII literal strings. This is an error for wide Unicode, but
  209. only a warning for single-byte and UTF-8 which work so long as the source
  210. encoding matches the encoding used in the generated stemmer code.
  211. - Improve error recovery after an undeclared `define`. We now sniff the
  212. token after the identifier and if it is `as` we parse as a routine,
  213. otherwise we parse as a grouping. Previously we always just assumed it was
  214. a routine, which gave a confusing second error if it was a grouping.
  215. - Improve error recovery after an unexpected token in `among`. Previously
  216. we acted as if the unexpected token closed the `among` (this probably
  217. wasn't intended but just a missing `break;` in a switch statement). Now we
  218. issue an error and try the next token.
  219. * Report error instead of silently truncating character values (e.g. `hex 123`
  220. previously silently became byte 0x23 which is `#` rather than a
  221. g-with-cedilla).
  222. * Enlarge the initial input buffer size to 8192 bytes and double each time we
  223. hit the end. Snowball programs are typically a few KB in size (with the
  224. current largest we ship being the Greek stemmer at 27KB) so the previous
  225. approach of starting with a 10 byte input buffer and increasing its size by
  226. 50% plus 40 bytes each time it filled was inefficient, needing up to 15
  227. reallocations to load greek.sbl.
  228. * Identify variables only used by one `routine`/`external`. This information
  229. isn't yet used, but such variables which are also always written to before
  230. being read can be emitted as local variables in most target languages.
  231. * We now allow multiple source files on command line, and allow them to be
  232. after (or even interspersed) with options to better match modern Unix
  233. conventions. Support for multiple source files allows specifying a single
  234. byte character set mapping via a source file of `stringdef`.
  235. * Avoid infinite recursion in compiler when optimising a recursive snowball
  236. function. Recursive functions aren't typical in snowball programs, but
  237. the compiler shouldn't crash for any input, especially not a valid one.
  238. We now simply limit on how deep the compiler will recurse and make the
  239. pessimistic assumption in the unlikely event we hit this limit.
  240. Build system:
  241. * `make clean` in C libstemmer_c distribution now removes `examples/*.o`.
  242. (#59)
  243. * Fix all the places which previously had to have a list of stemmers to work
  244. dynamically or be generated, so now only modules.txt needs updating to add
  245. a new stemmer.
  246. * Add check_java make target which runs tests for java.
  247. * Support gzipped test data (the uncompressed arabic test data is too big for
  248. github).
  249. * GNUmakefile: Drop useless `-eprefix` and `-r` options from snowball
  250. invocations for Java - these are only meaningful when generating C code.
  251. * Pass CFLAGS when linking which matches convention (e.g. automake does it) and
  252. facilitates use of tools such as ASan. Fixes #84, reported by Thomas
  253. Pointhuber.
  254. * Add CI builds with -std=c90 to check compiler and generated code are C90
  255. (#54)
  256. libstemmer stuff:
  257. * Split out CPPFLAGS from CFLAGS and use CFLAGS when linking stemwords.
  258. * Add -O2 to CFLAGS.
  259. * Make generated tables of encodings and modules const.
  260. * Fix clang static analyzer memory leak warning (in practice this code path
  261. can never actually be taken). Patch from Patrick O. Perry (#56)
  262. documentation
  263. * Added copyright and licensing details (#10).
  264. * Document that libstemmer supports ISO_8859_2 encoding. Currently hungarian
  265. and romanian are available in ISO_8859_2.
  266. * Remove documentation falsely claiming that libstemmer supports CP850
  267. encoding.
  268. * CONTRIBUTING.rst: Add guidance for contributing new stemming algorithms and
  269. new language backends.
  270. * Overhaul libstemmer_python_README. Most notably, replace the benchmark data
  271. which was very out of date.