|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237 |
- <?xml version="1.0" encoding="UTF-8" standalone="no"?>
- <!--
- Licensed to the Apache Software Foundation (ASF) under one or more
- contributor license agreements. See the NOTICE file distributed with
- this work for additional information regarding copyright ownership.
- The ASF licenses this file to You under the Apache License, Version 2.0
- (the "License"); you may not use this file except in compliance with
- the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
- -->
- <!-- $Id$ -->
- <!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd">
- <document>
- <header>
- <title>Apache™ FOP: Hyphenation</title>
- <version>$Revision$</version>
- </header>
- <body>
- <section id="support">
- <title>Hyphenation Support</title>
- <section id="intro">
- <title>Introduction</title>
- <p>Apache™ FOP uses Liang's hyphenation algorithm, well known from TeX. It needs
- language specific pattern and other data for operation.</p>
- <p>Because of <a href="#license-issues">licensing issues</a> (and for
- convenience), all hyphenation patterns for FOP are made available through
- the <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">Objects For
- Formatting Objects</a> project.</p>
- <note>If you have made improvements to an existing FOP hyphenation pattern,
- or if you have created one from scratch, please consider contributing these
- to OFFO so that they can benefit other FOP users as well.
- Please inquire on the <a href="../maillist.html#fop-user">FOP User
- mailing list</a>.</note>
- </section>
- <section id="license-issues">
- <title>License Issues</title>
- <p>Many of the hyphenation files distributed with TeX and its offspring are
- licenced under the <a class="fork" href="http://www.latex-project.org/lppl.html">LaTeX
- Project Public License (LPPL)</a>, which prevents them from being
- distributed with Apache software. The LPPL puts restrictions on file names
- in redistributed derived works which we feel can't guarantee. Some
- hyphenation pattern files have other or additional restrictions, for
- example against use for commercial purposes.</p>
- <p>Although Apache FOP cannot redistribute hyphenation pattern files that do
- not conform with its license scheme, that does not necessarily prevent users
- from using such hyphenation patterns with FOP. However, it does place on
- the user the responsibility for determining whether the user can rightly use
- such hyphenation patterns under the hyphenation pattern license.</p>
- <warning>The user is responsible to settle license issues for hyphenation
- pattern files that are obtained from non-Apache sources.</warning>
- </section>
- <section id="sources">
- <title>Sources of Custom Hyphenation Pattern Files</title>
- <p>The most important source of hyphenation pattern files is the
- <a class="fork" href="http://www.ctan.org/tex-archive/language/hyphenation/">CTAN TeX
- Archive</a>.</p>
- </section>
- <section id="install">
- <title>Installing Custom Hyphenation Patterns</title>
- <p>To install a custom hyphenation pattern for use with FOP:</p>
- <ol>
- <li>Convert the TeX hyphenation pattern file to the FOP format. The FOP
- format is an xml file conforming to the DTD found at
- <code>{fop-dir}/hyph/hyphenation.dtd</code>.</li>
- <li>Name this new file following this schema:
- <code>languageCode_countryCode.xml</code>. The country code is
- optional, and should be used only if needed. For example:
- <ul>
- <li><code>en_US.xml</code> would be the file name for American
- English hyphenation patterns.</li>
- <li><code>it.xml</code> would be the file name for Italian
- hyphenation patterns.</li>
- </ul>
- The language and country codes must match the XSL-FO input, which
- follows <a href="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt">ISO
- 639</a> (languages) and <a href="http://www.ics.uci.edu/pub/ietf/http/related/iso3166.txt">ISO
- 3166</a> (countries). NOTE: The ISO 639/ISO 3166 convention is that
- language names are written in lower case, while country codes are written
- in upper case. FOP does not check whether the language and country specified
- in the FO source are actually from the current standard, but it relies
- on it being two letter strings in a few places. So you can make up your
- own codes for custom hyphenation patterns, but they should be two
- letter strings too (patches for proper handling extensions are welcome)</li>
- <li>There are basically three ways to make the FOP-compatible hyphenation pattern
- file(s) accessible to FOP:
- <ul>
- <li>Download the precompiled JAR from <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">OFFO
- </a> and place it either in the <code>{fop-dir}/lib</code> directory, or
- in a directory of your choice (and append the full path to the JAR to
- the environment variable <code>FOP_HYPHENATION_PATH</code>).</li>
- <li>Download the desired FOP-compatible hyphenation pattern file(s) from
- <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">OFFO</a>,
- and/or take your self created hyphenation pattern file(s),
- <ul>
- <li>place them in the directory <code>{fop-dir}/hyph</code>, </li>
- <li>or place them in a directory of your choice and set the Ant variable
- <code>user.hyph.dir</code> to point to that directory (in
- <code>build-local.properties</code>),</li>
- </ul>
- and run Ant with build target
- <code>jar-hyphenation</code>. This will create a JAR containing the
- compiled patterns in <code>{fop-dir}/build</code> that will be added to the
- classpath on the next run.
- (When FOP is built from scratch, and there are pattern source file(s)
- present in the directory pointed to by the
- <code>user.hyph.dir</code> variable, this JAR will automatically
- be created from the supplied pattern(s)).</li>
- <li>Put the pattern source file(s) into a directory of your choice and
- configure FOP to look for custom patterns in this directory, by setting the
- <a href="configuration.html"><hyphenation-base></a>
- configuration option.</li>
- </ul>
- </li>
- </ol>
- <warning>
- Either of these three options will ensure hyphenation is working when using
- FOP from the command-line. If FOP is being embedded, remember to add the location(s)
- of the hyphenation JAR(s) to the CLASSPATH (option 1 and 2) or to set the
- <a href="configuration.html#hyphenation-dir"><hyphenation-dir></a>
- configuration option programmatically (option 3).
- </warning>
- </section>
- </section>
- <section id="patterns">
- <title>Hyphenation Patterns</title>
- <p>If you would like to build your own hyphenation pattern files, or modify
- existing ones, this section will help you understand how to do so. Even
- when creating a pattern file from scratch, it may be beneficial to start
- with an existing file and modify it. See <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">
- OFFO's Hyphenation page</a> for examples.
- Here is a brief explanation of the contents of FOP's hyphenation patterns:</p>
- <warning>The remaining content of this section should be considered "draft"
- quality. It was drafted from theoretical literature, and has not been
- tested against actual FOP behavior. It may contain errors or omissions.
- Do not rely on these instructions without testing everything stated here.
- If you use these instructions, please provide feedback on the
- <a href="../maillist.html#fop-user">FOP User mailing list</a>, either
- confirming their accuracy, or raising specific problems that we can
- address.</warning>
- <ul>
- <li>The root of the pattern file is the <hyphenation-info> element.</li>
- <li><hyphen-char>: its attribute "value" contains the character signalling
- a hyphen in the <exceptions> section. It has nothing to do with the
- hyphenation character used in FOP, use the XSLFO hyphenation-character
- property for defining the hyphenation character there. At some points
- a dash U+002D is hardwired in the code, so you'd better use this too
- (patches to rectify the situation are welcome). There is no default,
- if you declare exceptions with hyphenations, you must declare the
- hyphen-char too.</li>
- <li><hyphen-min> contains two attributes:
- <ul>
- <li>before: the minimum number of characters in a word allowed to exist
- on a line immediately preceding a hyphenated word-break.</li>
- <li>after: the minimum number of characters in a word allowed to exist
- on a line immediately after a hyphenated word-break.</li>
- </ul>
- This element is unused and not even read. It should be considered a
- documentation for parameters used during pattern generation.
- </li>
- <li><classes> contains whitespace-separated character sets. The members
- of each set should be treated as equivalent for purposes of hyphenation,
- usually upper and lower case of the same character. The first character
- of the set is the canonical character, the patterns and exceptions
- should only contain these canonical representation characters (except
- digits for weight, the period (.) as word delimiter in the patterns and
- the hyphen char in exceptions, of course).</li>
- <li><exceptions> contains whitespace-separated words, each of which
- has either explicit hyphen characters to denote acceptable breakage
- points, or no hyphen characters, to indicate that this word should
- never be hyphenated, or contain explicit <hyp> elements for specifying
- changes of spelling due to hyphenation (like backen -> bak-ken or
- Stoffarbe -> Stoff-farbe in the old german spelling). Exceptions override
- the patterns described below. Explicit <hyp> declarations don't work
- yet (patches welcome). Exceptions are generally a bit brittle, test
- carefully.</li>
- <li><patterns> includes whitespace-separated patterns, which are what
- drive most hyphenation decisions. The characters in these patterns are
- explained as follows:
- <ul>
- <li>non-numeric characters represent characters in a sub-word to be
- evaluated</li>
- <li>the period character (.) represents a word boundary, i.e. either
- the beginning or ending of a word</li>
- <li>numeric characters represent a scoring system for indicating the
- acceptability of a hyphen in this location. Odd numbers represent an
- acceptable location for a hyphen, with higher values overriding lower
- inhibiting values. Even numbers indicate an unacceptable location, with
- higher values overriding lower values indicating an acceptable position.
- A value of zero (inhibiting) is implied when there is no number present.
- Generally patterns are constructed so that valuse greater than 4 are rare.
- Due to a bug currently patterns with values of 8 and greater don't
- have an effect, so don't wonder.</li>
- </ul>
- Here are some examples from the English patterns file:
- <ul>
- <li>Knuth (<em>The TeXBook</em>, Appendix H) uses the example <strong>hach4</strong>, which indicates that it is extremely undesirable to place a hyphen after the substring "hach", for example in the word "toothach-es".</li>
- <li><strong>.leg5e</strong> indicates that "leg-e", when it occurs at the beginning of a word, is a very good place to place a hyphen, if one is needed. Words like "leg-end" and "leg-er-de-main" fit this pattern.</li>
- </ul>
- Note that the algorithm that uses this data searches for each of the word's substrings in the patterns, and chooses the <em>highest</em> value found for letter combination.
- </li>
- </ul>
- <p>If you want to convert a TeX hyphenation pattern file, you have to undo
- the TeX encoding for non-ASCII text. FOP uses Unicode, and the patterns
- must be proper Unicode too. You should be aware of the XML encoding issues,
- preferably use a good Unicode editor.</p>
- <p>Note that FOP does not do Unicode character normalization. If you use
- combining chars for accents and other character decorations, you must
- declare character classes for them, and use the same sequence of base character
- and combining marks in the XSLFO source, otherwise the pattern wouldn't match.
- Fortunately, Unicode provides precomposed characters for all important cases
- in common languages, until now nobody run seriously into this issue. Some dead
- languages and dialects, especially ancient ones, may pose a real problem
- though.</p>
- <p>If you want to generate your own patterns, an open-source utility called
- patgen is available on many Unix/Linux distributions and every TeX
- distribution which can be used to assist in
- creating pattern files from dictionaries. Pattern creation for languages like
- english or german is an art. If you can, read Frank Liang's original paper
- "Word Hy-phen-a-tion by Com-pu-ter" (yes, with hyphens). It is not available
- online. The original patgen.web source, included in the TeX source distributions,
- contains valuable comments, unfortunately technical details obscure often the
- high level issues. Another important source is
- <a class="fork" href="http://www.ctan.org/tex-archive/systems/knuth/tex/texbook.tex">The
- TeX Book</a>, appendix H (either read the TeX source, or run it through
- TeX to typeset it). Secondary articles, for example the works by Petr Sojka,
- may also give some much needed insight into problems arising in automated
- hyphenation.</p>
- </section>
- </body>
- </document>
|