You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

hyphenation.xml 14KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236
  1. <?xml version="1.0" standalone="no"?>
  2. <!--
  3. Copyright 1999-2005 The Apache Software Foundation
  4. Licensed under the Apache License, Version 2.0 (the "License");
  5. you may not use this file except in compliance with the License.
  6. You may obtain a copy of the License at
  7. http://www.apache.org/licenses/LICENSE-2.0
  8. Unless required by applicable law or agreed to in writing, software
  9. distributed under the License is distributed on an "AS IS" BASIS,
  10. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  11. See the License for the specific language governing permissions and
  12. limitations under the License.
  13. -->
  14. <!-- $Id$ -->
  15. <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
  16. <document>
  17. <header>
  18. <title>Apache FOP: Hyphenation</title>
  19. <version>$Revision$</version>
  20. </header>
  21. <body>
  22. <section id="support">
  23. <title>Hyphenation Support</title>
  24. <section id="intro">
  25. <title>Introduction</title>
  26. <p>FOP uses Liang's hyphenation algorithm, well known from TeX. It needs
  27. language specific pattern and other data for operation.</p>
  28. <p>Because of <a href="#license-issues">licensing issues</a> (and for
  29. convenience), all hyphenation patterns for FOP are made available through
  30. the <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">Objects For
  31. Formatting Objects</a> project.</p>
  32. <note>If you have made improvements to an existing FOP hyphenation pattern,
  33. or if you have created one from scratch, please consider contributing these
  34. to OFFO so that they can benefit other FOP users as well.
  35. Please inquire on the <a href="../maillist.html#fop-user">FOP User
  36. mailing list</a>.</note>
  37. </section>
  38. <section id="license-issues">
  39. <title>License Issues</title>
  40. <p>Many of the hyphenation files distributed with TeX and its offspring are
  41. licenced under the <a class="fork" href="http://www.latex-project.org/lppl.html">LaTeX
  42. Project Public License (LPPL)</a>, which prevents them from being
  43. distributed with Apache software. The LPPL puts restrictions on file names
  44. in redistributed derived works which we feel can't guarantee. Some
  45. hyphenation pattern files have other or additional restrictions, for
  46. example against use for commercial purposes.</p>
  47. <p>Although Apache FOP cannot redistribute hyphenation pattern files that do
  48. not conform with its license scheme, that does not necessarily prevent users
  49. from using such hyphenation patterns with FOP. However, it does place on
  50. the user the responsibility for determining whether the user can rightly use
  51. such hyphenation patterns under the hyphenation pattern license.</p>
  52. <warning>The user is responsible to settle license issues for hyphenation
  53. pattern files that are obtained from non-Apache sources.</warning>
  54. </section>
  55. <section id="sources">
  56. <title>Sources of Custom Hyphenation Pattern Files</title>
  57. <p>The most important source of hyphenation pattern files is the
  58. <a class="fork" href="http://www.ctan.org/tex-archive/language/hyphenation/">CTAN TeX
  59. Archive</a>.</p>
  60. </section>
  61. <section id="install">
  62. <title>Installing Custom Hyphenation Patterns</title>
  63. <p>To install a custom hyphenation pattern for use with FOP:</p>
  64. <ol>
  65. <li>Convert the TeX hyphenation pattern file to the FOP format. The FOP
  66. format is an xml file conforming to the DTD found at
  67. <code>{fop-dir}/hyph/hyphenation.dtd</code>.</li>
  68. <li>Name this new file following this schema:
  69. <code>languageCode_countryCode.xml</code>. The country code is
  70. optional, and should be used only if needed. For example:
  71. <ul>
  72. <li><code>en_US.xml</code> would be the file name for American
  73. English hyphenation patterns.</li>
  74. <li><code>it.xml</code> would be the file name for Italian
  75. hyphenation patterns.</li>
  76. </ul>
  77. The language and country codes must match the XSL-FO input, which
  78. follows <a href="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt">ISO
  79. 639</a> (languages) and <a href="http://www.ics.uci.edu/pub/ietf/http/related/iso3166.txt">ISO
  80. 3166</a> (countries). NOTE: The ISO 639/ISO 3166 convention is that
  81. language names are written in lower case, while country codes are written
  82. in upper case. FOP does not check whether the language and country specified
  83. in the FO source are actually from the current standard, but it relies
  84. on it being two letter strings in a few places. So you can make up your
  85. own codes for custom hyphenation patterns, but they should be two
  86. letter strings too (patches for proper handling extensions are welcome)</li>
  87. <li>There are basically three ways to make the FOP-compatible hyphenation pattern
  88. file(s) accessible to FOP:
  89. <ul>
  90. <li>Download the precompiled JAR from <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">OFFO
  91. </a> and place it either in the <code>{fop-dir}/lib</code> directory, or
  92. in a directory of your choice (and append the full path to the JAR to
  93. the environment variable <code>FOP_HYPHENATION_PATH</code>).</li>
  94. <li>Download the desired FOP-compatible hyphenation pattern file(s) from
  95. <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">OFFO</a>,
  96. and/or take your self created hyphenation pattern file(s),
  97. <ul>
  98. <li>place them in the directory <code>{fop-dir}/hyph</code>, </li>
  99. <li>or place them in a directory of your choice and set the Ant variable
  100. <code>user.hyph.dir</code> to point to that directory (in
  101. <code>build-local.properties</code>),</li>
  102. </ul>
  103. and run Ant with build target
  104. <code>jar-hyphenation</code>. This will create a JAR containing the
  105. compiled patterns in <code>{fop-dir}/build</code> that will be added to the
  106. classpath on the next run.
  107. (When FOP is built from scratch, and there are pattern source file(s)
  108. present in the directory pointed to by the
  109. <code>user.hyph.dir</code> variable, this JAR will automatically
  110. be created from the supplied pattern(s)).</li>
  111. <li>Put the pattern source file(s) into a directory of your choice and
  112. configure FOP to look for custom patterns in this directory, by setting the
  113. <a href="configuration.html#hyphenation-dir">&lt;hyphenation-dir&gt;</a>
  114. configuration option.</li>
  115. </ul>
  116. </li>
  117. </ol>
  118. <warning>
  119. Either of these three options will ensure hyphenation is working when using
  120. FOP from the command-line. If FOP is being embedded, remember to add the location(s)
  121. of the hyphenation JAR(s) to the CLASSPATH (option 1 and 2) or to set the
  122. <a href="configuration.html#hyphenation-dir">&lt;hyphenation-dir&gt;</a>
  123. configuration option programmatically (option 3).
  124. </warning>
  125. </section>
  126. </section>
  127. <section id="patterns">
  128. <title>Hyphenation Patterns</title>
  129. <p>If you would like to build your own hyphenation pattern files, or modify
  130. existing ones, this section will help you understand how to do so. Even
  131. when creating a pattern file from scratch, it may be beneficial to start
  132. with an existing file and modify it. See <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">
  133. OFFO's Hyphenation page</a> for examples.
  134. Here is a brief explanation of the contents of FOP's hyphenation patterns:</p>
  135. <warning>The remaining content of this section should be considered "draft"
  136. quality. It was drafted from theoretical literature, and has not been
  137. tested against actual FOP behavior. It may contain errors or omissions.
  138. Do not rely on these instructions without testing everything stated here.
  139. If you use these instructions, please provide feedback on the
  140. <a href="../maillist.html#fop-user">FOP User mailing list</a>, either
  141. confirming their accuracy, or raising specific problems that we can
  142. address.</warning>
  143. <ul>
  144. <li>The root of the pattern file is the &lt;hyphenation-info> element.</li>
  145. <li>&lt;hyphen-char>: its attribute "value" contains the character signalling
  146. a hyphen in the &lt;exceptions> section. It has nothing to do with the
  147. hyphenation character used in FOP, use the XSLFO hyphenation-character
  148. property for defining the hyphenation character there. At some points
  149. a dash U+002D is hardwired in the code, so you'd better use this too
  150. (patches to rectify the situation are welcome). There is no default,
  151. if you declare exceptions with hyphenations, you must declare the
  152. hyphen-char too.</li>
  153. <li>&lt;hyphen-min> contains two attributes:
  154. <ul>
  155. <li>before: the minimum number of characters in a word allowed to exist
  156. on a line immediately preceding a hyphenated word-break.</li>
  157. <li>after: the minimum number of characters in a word allowed to exist
  158. on a line immediately after a hyphenated word-break.</li>
  159. </ul>
  160. This element is unused and not even read. It should be considered a
  161. documentation for parameters used during pattern generation.
  162. </li>
  163. <li>&lt;classes> contains whitespace-separated character sets. The members
  164. of each set should be treated as equivalent for purposes of hyphenation,
  165. usually upper and lower case of the same character. The first character
  166. of the set is the canonical character, the patterns and exceptions
  167. should only contain these canonical representation characters (except
  168. digits for weight, the period (.) as word delimiter in the patterns and
  169. the hyphen char in exceptions, of course).</li>
  170. <li>&lt;exceptions> contains whitespace-separated words, each of which
  171. has either explicit hyphen characters to denote acceptable breakage
  172. points, or no hyphen characters, to indicate that this word should
  173. never be hyphenated, or contain explicit &lt;hyp> elements for specifying
  174. changes of spelling due to hyphenation (like backen -> bak-ken or
  175. Stoffarbe -> Stoff-farbe in the old german spelling). Exceptions override
  176. the patterns described below. Explicit &lt;hyp> declarations don't work
  177. yet (patches welcome). Exceptions are generally a bit brittle, test
  178. carefully.</li>
  179. <li>&lt;patterns> includes whitespace-separated patterns, which are what
  180. drive most hyphenation decisions. The characters in these patterns are
  181. explained as follows:
  182. <ul>
  183. <li>non-numeric characters represent characters in a sub-word to be
  184. evaluated</li>
  185. <li>the period character (.) represents a word boundary, i.e. either
  186. the beginning or ending of a word</li>
  187. <li>numeric characters represent a scoring system for indicating the
  188. acceptability of a hyphen in this location. Odd numbers represent an
  189. acceptable location for a hyphen, with higher values overriding lower
  190. inhibiting values. Even numbers indicate an unacceptable location, with
  191. higher values overriding lower values indicating an acceptable position.
  192. A value of zero (inhibiting) is implied when there is no number present.
  193. Generally patterns are constructed so that valuse greater than 4 are rare.
  194. Due to a bug currently patterns with values of 8 and greater don't
  195. have an effect, so don't wonder.</li>
  196. </ul>
  197. Here are some examples from the English patterns file:
  198. <ul>
  199. <li>Knuth (<em>The TeXBook</em>, Appendix H) uses the example <strong>hach4</strong>, which indicates that it is extremely undesirable to place a hyphen after the substring "hach", for example in the word "toothach-es".</li>
  200. <li><strong>.leg5e</strong> indicates that "leg-e", when it occurs at the beginning of a word, is a very good place to place a hyphen, if one is needed. Words like "leg-end" and "leg-er-de-main" fit this pattern.</li>
  201. </ul>
  202. Note that the algorithm that uses this data searches for each of the word's substrings in the patterns, and chooses the <em>highest</em> value found for letter combination.
  203. </li>
  204. </ul>
  205. <p>If you want to convert a TeX hyphenation pattern file, you have to undo
  206. the TeX encoding for non-ASCII text. FOP uses Unicode, and the patterns
  207. must be proper Unicode too. You should be aware of the XML encoding issues,
  208. preferably use a good Unicode editor.</p>
  209. <p>Note that FOP does not do Unicode character normalization. If you use
  210. combining chars for accents and other character decorations, you must
  211. declare character classes for them, and use the same sequence of base character
  212. and combining marks in the XSLFO source, otherwise the pattern wouldn't match.
  213. Fortunately, Unicode provides precomposed characters for all important cases
  214. in common languages, until now nobody run seriously into this issue. Some dead
  215. languages and dialects, especially ancient ones, may pose a real problem
  216. though.</p>
  217. <p>If you want to generate your own patterns, an open-source utility called
  218. patgen is available on many Unix/Linux distributions and every TeX
  219. distribution which can be used to assist in
  220. creating pattern files from dictionaries. Pattern creation for languages like
  221. english or german is an art. If you can, read Frank Liang's original paper
  222. "Word Hy-phen-a-tion by Com-pu-ter" (yes, with hyphens). It is not available
  223. online. The original patgen.web source, included in the TeX source distributions,
  224. contains valuable comments, unfortunately technical details obscure often the
  225. high level issues. Another important source is
  226. <a class="fork" href="http://www.ctan.org/tex-archive/systems/knuth/tex/texbook.tex">The
  227. TeX Book</a>, appendix H (either read the TeX source, or run it through
  228. TeX to typeset it). Secondary articles, for example the works by Petr Sojka,
  229. may also give some much needed insight into problems arising in automated
  230. hyphenation.</p>
  231. </section>
  232. </body>
  233. </document>