diff options
author | Glenn Adams <gadams@apache.org> | 2012-07-21 03:22:26 +0000 |
---|---|---|
committer | Glenn Adams <gadams@apache.org> | 2012-07-21 03:22:26 +0000 |
commit | 56517e87d298dea311c231fc446072ea5a58cb21 (patch) | |
tree | 553ce8673b2f689a712d9eb5ca5690a0e6f9b10c /src/documentation/content/xdocs/1.1rc1/hyphenation.xml | |
parent | adbd5aac98fc107156ee108092ab25cf502c1491 (diff) | |
download | xmlgraphics-fop-56517e87d298dea311c231fc446072ea5a58cb21.tar.gz xmlgraphics-fop-56517e87d298dea311c231fc446072ea5a58cb21.zip |
Merge from ^/branches/fop-1_1.
git-svn-id: https://svn.apache.org/repos/asf/xmlgraphics/fop/trunk@1364036 13f79535-47bb-0310-9956-ffa450edef68
Diffstat (limited to 'src/documentation/content/xdocs/1.1rc1/hyphenation.xml')
-rw-r--r-- | src/documentation/content/xdocs/1.1rc1/hyphenation.xml | 237 |
1 files changed, 237 insertions, 0 deletions
diff --git a/src/documentation/content/xdocs/1.1rc1/hyphenation.xml b/src/documentation/content/xdocs/1.1rc1/hyphenation.xml new file mode 100644 index 000000000..e6f666826 --- /dev/null +++ b/src/documentation/content/xdocs/1.1rc1/hyphenation.xml @@ -0,0 +1,237 @@ +<?xml version="1.0" encoding="UTF-8" standalone="no"?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<!-- $Id$ --> +<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "http://forrest.apache.org/dtd/document-v20.dtd"> +<document> + <header> + <title>Apache™ FOP: Hyphenation</title> + <version>$Revision$</version> + </header> + <body> + <section id="support"> + <title>Hyphenation Support</title> + <section id="intro"> + <title>Introduction</title> + <p>Apache™ FOP uses Liang's hyphenation algorithm, well known from TeX. It needs + language specific pattern and other data for operation.</p> + <p>Because of <a href="#license-issues">licensing issues</a> (and for + convenience), all hyphenation patterns for FOP are made available through + the <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">Objects For + Formatting Objects</a> project.</p> + <note>If you have made improvements to an existing Apache™ FOP hyphenation pattern, + or if you have created one from scratch, please consider contributing these + to OFFO so that they can benefit other FOP users as well. + Please inquire on the <a href="../maillist.html#fop-user">FOP User + mailing list</a>.</note> + </section> + <section id="license-issues"> + <title>License Issues</title> + <p>Many of the hyphenation files distributed with TeX and its offspring are + licenced under the <a class="fork" href="http://www.latex-project.org/lppl.html">LaTeX + Project Public License (LPPL)</a>, which prevents them from being + distributed with Apache software. The LPPL puts restrictions on file names + in redistributed derived works which we feel can't guarantee. Some + hyphenation pattern files have other or additional restrictions, for + example against use for commercial purposes.</p> + <p>Although Apache FOP cannot redistribute hyphenation pattern files that do + not conform with its license scheme, that does not necessarily prevent users + from using such hyphenation patterns with FOP. However, it does place on + the user the responsibility for determining whether the user can rightly use + such hyphenation patterns under the hyphenation pattern license.</p> + <warning>The user is responsible to settle license issues for hyphenation + pattern files that are obtained from non-Apache sources.</warning> + </section> + <section id="sources"> + <title>Sources of Custom Hyphenation Pattern Files</title> + <p>The most important source of hyphenation pattern files is the + <a class="fork" href="http://www.ctan.org/tex-archive/language/hyphenation/">CTAN TeX + Archive</a>.</p> + </section> + <section id="install"> + <title>Installing Custom Hyphenation Patterns</title> + <p>To install a custom hyphenation pattern for use with FOP:</p> + <ol> + <li>Convert the TeX hyphenation pattern file to the FOP format. The FOP + format is an xml file conforming to the DTD found at + <code>{fop-dir}/hyph/hyphenation.dtd</code>.</li> + <li>Name this new file following this schema: + <code>languageCode_countryCode.xml</code>. The country code is + optional, and should be used only if needed. For example: + <ul> + <li><code>en_US.xml</code> would be the file name for American + English hyphenation patterns.</li> + <li><code>it.xml</code> would be the file name for Italian + hyphenation patterns.</li> + </ul> + The language and country codes must match the XSL-FO input, which + follows <a href="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt">ISO + 639</a> (languages) and <a href="http://www.ics.uci.edu/pub/ietf/http/related/iso3166.txt">ISO + 3166</a> (countries). NOTE: The ISO 639/ISO 3166 convention is that + language names are written in lower case, while country codes are written + in upper case. FOP does not check whether the language and country specified + in the FO source are actually from the current standard, but it relies + on it being two letter strings in a few places. So you can make up your + own codes for custom hyphenation patterns, but they should be two + letter strings too (patches for proper handling extensions are welcome)</li> + <li>There are basically three ways to make the FOP-compatible hyphenation pattern + file(s) accessible to FOP: + <ul> + <li>Download the precompiled JAR from <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">OFFO + </a> and place it either in the <code>{fop-dir}/lib</code> directory, or + in a directory of your choice (and append the full path to the JAR to + the environment variable <code>FOP_HYPHENATION_PATH</code>).</li> + <li>Download the desired FOP-compatible hyphenation pattern file(s) from + <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html">OFFO</a>, + and/or take your self created hyphenation pattern file(s), + <ul> + <li>place them in the directory <code>{fop-dir}/hyph</code>, </li> + <li>or place them in a directory of your choice and set the Ant variable + <code>user.hyph.dir</code> to point to that directory (in + <code>build-local.properties</code>),</li> + </ul> + and run Ant with build target + <code>jar-hyphenation</code>. This will create a JAR containing the + compiled patterns in <code>{fop-dir}/build</code> that will be added to the + classpath on the next run. + (When FOP is built from scratch, and there are pattern source file(s) + present in the directory pointed to by the + <code>user.hyph.dir</code> variable, this JAR will automatically + be created from the supplied pattern(s)).</li> + <li>Put the pattern source file(s) into a directory of your choice and + configure FOP to look for custom patterns in this directory, by setting the + <a href="configuration.html"><hyphenation-base></a> + configuration option.</li> + </ul> + </li> + </ol> + <warning> + Either of these three options will ensure hyphenation is working when using + FOP from the command-line. If FOP is being embedded, remember to add the location(s) + of the hyphenation JAR(s) to the CLASSPATH (option 1 and 2) or to set the + <a href="configuration.html#hyphenation-dir"><hyphenation-dir></a> + configuration option programmatically (option 3). + </warning> + </section> + </section> + <section id="patterns"> + <title>Hyphenation Patterns</title> + <p>If you would like to build your own hyphenation pattern files, or modify + existing ones, this section will help you understand how to do so. Even + when creating a pattern file from scratch, it may be beneficial to start + with an existing file and modify it. See <a class="fork" href="http://offo.sourceforge.net/hyphenation/index.html"> + OFFO's Hyphenation page</a> for examples. + Here is a brief explanation of the contents of FOP's hyphenation patterns:</p> + <warning>The remaining content of this section should be considered "draft" + quality. It was drafted from theoretical literature, and has not been + tested against actual FOP behavior. It may contain errors or omissions. + Do not rely on these instructions without testing everything stated here. + If you use these instructions, please provide feedback on the + <a href="../maillist.html#fop-user">FOP User mailing list</a>, either + confirming their accuracy, or raising specific problems that we can + address.</warning> + <ul> + <li>The root of the pattern file is the <hyphenation-info> element.</li> + <li><hyphen-char>: its attribute "value" contains the character signalling + a hyphen in the <exceptions> section. It has nothing to do with the + hyphenation character used in FOP, use the XSLFO hyphenation-character + property for defining the hyphenation character there. At some points + a dash U+002D is hardwired in the code, so you'd better use this too + (patches to rectify the situation are welcome). There is no default, + if you declare exceptions with hyphenations, you must declare the + hyphen-char too.</li> + <li><hyphen-min> contains two attributes: + <ul> + <li>before: the minimum number of characters in a word allowed to exist + on a line immediately preceding a hyphenated word-break.</li> + <li>after: the minimum number of characters in a word allowed to exist + on a line immediately after a hyphenated word-break.</li> + </ul> + This element is unused and not even read. It should be considered a + documentation for parameters used during pattern generation. + </li> + <li><classes> contains whitespace-separated character sets. The members + of each set should be treated as equivalent for purposes of hyphenation, + usually upper and lower case of the same character. The first character + of the set is the canonical character, the patterns and exceptions + should only contain these canonical representation characters (except + digits for weight, the period (.) as word delimiter in the patterns and + the hyphen char in exceptions, of course).</li> + <li><exceptions> contains whitespace-separated words, each of which + has either explicit hyphen characters to denote acceptable breakage + points, or no hyphen characters, to indicate that this word should + never be hyphenated, or contain explicit <hyp> elements for specifying + changes of spelling due to hyphenation (like backen -> bak-ken or + Stoffarbe -> Stoff-farbe in the old german spelling). Exceptions override + the patterns described below. Explicit <hyp> declarations don't work + yet (patches welcome). Exceptions are generally a bit brittle, test + carefully.</li> + <li><patterns> includes whitespace-separated patterns, which are what + drive most hyphenation decisions. The characters in these patterns are + explained as follows: + <ul> + <li>non-numeric characters represent characters in a sub-word to be + evaluated</li> + <li>the period character (.) represents a word boundary, i.e. either + the beginning or ending of a word</li> + <li>numeric characters represent a scoring system for indicating the + acceptability of a hyphen in this location. Odd numbers represent an + acceptable location for a hyphen, with higher values overriding lower + inhibiting values. Even numbers indicate an unacceptable location, with + higher values overriding lower values indicating an acceptable position. + A value of zero (inhibiting) is implied when there is no number present. + Generally patterns are constructed so that valuse greater than 4 are rare. + Due to a bug currently patterns with values of 8 and greater don't + have an effect, so don't wonder.</li> + </ul> + Here are some examples from the English patterns file: + <ul> + <li>Knuth (<em>The TeXBook</em>, Appendix H) uses the example <strong>hach4</strong>, which indicates that it is extremely undesirable to place a hyphen after the substring "hach", for example in the word "toothach-es".</li> + <li><strong>.leg5e</strong> indicates that "leg-e", when it occurs at the beginning of a word, is a very good place to place a hyphen, if one is needed. Words like "leg-end" and "leg-er-de-main" fit this pattern.</li> + </ul> + Note that the algorithm that uses this data searches for each of the word's substrings in the patterns, and chooses the <em>highest</em> value found for letter combination. + </li> + </ul> + <p>If you want to convert a TeX hyphenation pattern file, you have to undo + the TeX encoding for non-ASCII text. FOP uses Unicode, and the patterns + must be proper Unicode too. You should be aware of the XML encoding issues, + preferably use a good Unicode editor.</p> + <p>Note that FOP does not do Unicode character normalization. If you use + combining chars for accents and other character decorations, you must + declare character classes for them, and use the same sequence of base character + and combining marks in the XSLFO source, otherwise the pattern wouldn't match. + Fortunately, Unicode provides precomposed characters for all important cases + in common languages, until now nobody run seriously into this issue. Some dead + languages and dialects, especially ancient ones, may pose a real problem + though.</p> + <p>If you want to generate your own patterns, an open-source utility called + patgen can be used to assist in creating pattern files from dictionaries. + It is available in many Unix/Linux distributions and every TeX distribution. + Pattern creation for languages like english or german is an art. Read + Frank Liang's original paper <a class="fork" href="http://www.tug.org/docs/liang/">"Word + Hy-phen-a-tion by Com-pu-ter"</a> (yes, with hyphens) for details. + The original patgen.web source, included in the TeX source distributions, + contains valuable comments, unfortunately technical details often obscure the + high level issues. Another important source of information is + <a class="fork" href="http://mirrors.ctan.org/systems/knuth/dist/tex/texbook.tex">The + TeX Book</a>, appendix H (either read the TeX source, or run it through + TeX to typeset it). Secondary articles, for example the works by Petr Sojka, + may also give some much needed insight into problems arising in automated + hyphenation.</p> + </section> + </body> +</document>
\ No newline at end of file |