summaryrefslogtreecommitdiffstats
path: root/src/documentation/content/xdocs/design/parsing.xml
blob: 396a1a26313ebf944c9e80450ec2195eeb07b74d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
<?xml version="1.0" standalone="no"?>
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN"
    "http://cvs.apache.org/viewcvs.cgi/*checkout*/xml-forrest/src/resources/schema/dtd/document-v11.dtd">
<document>
  <header>
    <title>XML Parsing</title>
  </header>
  <body>
    <section id="intro">
      <title>Introduction</title>
      <p>Parsing is the process of reading the XSL-FO input and making the information in it available to FOP.</p>
    </section>
    <section id="input">
      <title>SAX for Input</title>
      <p>The two standard ways of dealing with XML input are SAX and DOM.
SAX basically creates events as it parses an XML document in a serial fashion; a program using SAX (and not storing anything internally) will only see a small window of the document at any point in time, and can never look forward in the document.
DOM creates and stores a tree representation of the document, allowing a view of the entire document as an integrated whole.
One issue that may seem counter-intuitive to some new FOP developers, and which has from time to time been contentious, is that FOP uses SAX for input.
(DOM can be used as input as well, but it is converted into SAX events before entering FOP, effectively negating its advantages).</p>
      <p>Since FOP essentially needs a tree representation of the FO input, at first glance it seems to make sense to use DOM.
Instead, FOP takes SAX events and builds its own tree-like structure. Why?</p>
      <ul>
        <li>DOM has a relatively large memory footprint. FOP's FO Tree is a lighter-weight structure.</li>
        <li>DOM contains an entire document. FOP is able to process individual fo:page-sequence objects discretely, without the need to have the entire document in memory. For documents that have only one fo:page-sequence object, FOP's approach is no advantage, but in other cases it is a huge advantage. A 500-page book that is broken into 100 5-page chapters, each in its own fo:page-sequence, essentially needs only 1% of the document memory that would be required if using DOM as input.</li>
      </ul>
      <p>See the <link href="../embedding.html#input">Input Section of the User Embedding Document</link> for a discussion of input usage patterns and some implementation details.</p>
    </section>
    <section id="validation">
      <title>Validation</title>
      <p>If the input XML is not well-formed, that will be reported.</p>
      <p>There is no DTD for XSL-FO, so no formal validation is possible at the parser level.</p>
      <p>The SAX handler will report an error for unrecognized <link href="#namespaces">namespaces</link>.</p>
    </section>
    <section id="namespaces">
      <title>Namespaces</title>
      <p>To allow for extensions to the XSL-FO language, FOP provides a mechanism for handling foreign namespaces.</p>
      <p>See <link href="../extensions.html">User Extensions</link> for a discussion of standard extensions shipped with FOP, and their related namespaces.</p>
      <p>See <link href="../dev/extenstions.html">Developer Extensions</link> for a discussion of the mechanisms in place to allow developers to add their own extensions, including how to tell FOP about the foreign namespace.</p>
    </section>
    <section>
      <title>Tree Building</title>
      <p>The SAX Events will fire all the information for the document with start element, end element, text data etc.
This information is used to build up a representation of the FO document.
To do this for a namespace there is a set of element mappings.
When an element + namepsace mapping is found then it can create an object for that element.
If the element is not found then it creates a dummy object or a generic DOM for unknown namespaces.</p>
      <p>The object is then setup and then given attributes for the element.
For the FO Tree the attributes are converted into properties.
The FO objects use a property list mapping to convert the attributes into a list of properties for the element.
For other XML, for example SVG, a DOM of the XML is constructed.
This DOM can then be passed through to the renderer.
Other element mappings can be used in different ways, for example to create elements that create areas during the layout process or setup information for the renderer etc.</p>
      <p>While the tree building is mainly about creating the FO Tree there are some stages that can propagate to the renderer.
At the end of a page sequence we know that all pages in the page sequence can be laid out without being effected by any further XML.
The significance of this is that the FO Tree for the page sequence may be able to be disposed of.
The end of the XML document also tells us that we can finalise the output document.
(The layout of individual pages is accomplished by the layout managers page at a time; i.e. they do not need to wait for the end of the page sequence.
The page may not yet be complete, however, containing forward page number references, for example.)</p>
    </section>
  </body>
</document>