aboutsummaryrefslogtreecommitdiffstats
path: root/src/documentation/content/xdocs/design/alt.design/xml-parsing.ehtml
diff options
context:
space:
mode:
Diffstat (limited to 'src/documentation/content/xdocs/design/alt.design/xml-parsing.ehtml')
-rw-r--r--src/documentation/content/xdocs/design/alt.design/xml-parsing.ehtml366
1 files changed, 0 insertions, 366 deletions
diff --git a/src/documentation/content/xdocs/design/alt.design/xml-parsing.ehtml b/src/documentation/content/xdocs/design/alt.design/xml-parsing.ehtml
deleted file mode 100644
index 43033270a..000000000
--- a/src/documentation/content/xdocs/design/alt.design/xml-parsing.ehtml
+++ /dev/null
@@ -1,366 +0,0 @@
-<?xml version="1.0"?>
-<!--
- Copyright 1999-2004 The Apache Software Foundation
-
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
--->
-<!-- $Id$ -->
-<html>
- <body text="#000000" bgcolor="#FFFFFF">
- <script type="text/javascript" src="codedisplay.js" />
- <div class="content">
- <h1>Implementing Pull Parsing</h1>
- <p>
- <font size="-2">by Peter B. West</font>
- </p>
- <ul class="minitoc">
- <li>
- <a href="#An+alternative+parsing+methodology">An alternative
- parsing methodology</a>
- <ul class="minitoc">
- <li>
- <a href="#Structure+of+SAX+parsing">Structure of SAX parsing</a>
- </li>
- <li>
- <a href="#Cluttered+callbacks">Cluttered callbacks</a>
- </li>
- <li>
- <a href="#From+">From push to pull parsing</a>
- </li>
- <li>
- <a href="#FoXMLEvent+me%5Bthods">FoXMLEvent me[thods</a>
- </li>
- <li>
- <a href="#FOP+modularisation">FOP modularisation</a>
- </li>
- </ul>
- </li>
- </ul>
-
- <a name="N101C5"></a><a name="An+alternative+parsing+methodology"></a>
- <h3>An alternative parsing methodology</h3>
- <div style="margin-left: 0 ; border: 2px">
- <p>
- This note proposes an alternative method of integrating the
- output of the SAX parsing of the Flow Object (FO) tree into
- FOP processing. The pupose of the proposed changes is to
- provide for:
- </p>
- <ul>
-
- <li>
- better decomposition of FOP into processing phases
- </li>
-
- <li>
- top-down FO tree building, providing
- </li>
-
- <li>
- integrated validation of FO tree input.
- </li>
-
- </ul>
- <a name="N101DA"></a><a name="Structure+of+SAX+parsing"></a>
- <h4>Structure of SAX parsing</h4>
- <div style="margin-left: 0 ; border: 2px">
- <p>
- Figure 1 is a schematic representation of the process of
- SAX parsing of an input source. SAX parsing involves the
- registration, with an object implementing the <span
- class="codefrag">XMLReader</span> interface, of a <span
- class="codefrag">ContentHandler</span> which contains a
- callback routine for each of the event types encountered
- by the parser, e.g., <span
- class="codefrag">startDocument()</span>, <span
- class="codefrag">startElement()</span>, <span
- class="codefrag">characters()</span>, <span
- class="codefrag">endElement()</span> and <span
- class="codefrag">endDocument()</span>. Parsing is
- initiated by a call to the <span
- class="codefrag">parser()</span> method of the <span
- class="codefrag">XMLReader</span>. Note that the call to
- <span class="codefrag">parser()</span> and the calls to
- individual callback methods are synchronous: <span
- class="codefrag">parser()</span> will only return when the
- last callback method returns, and each callback must
- complete before the next is called.<br/> <br/>
-
- <strong>Figure 1</strong>
-
- </p>
- <div align="center">
- <img class="figure" alt="SAX parsing schematic"
- src="images/design/alt.design/SAXParsing.png" /></div>
- <p>
- In the process of parsing, the hierarchical structure of the
- original FO tree is flattened into a number of streams of
- events of the same type which are reported in the sequence
- in which they are encountered. Apart from that, the API
- imposes no structure or constraint which expresses the
- relationship between, e.g., a startElement event and the
- endElement event for the same element. To the extent that
- such relationship information is required, it must be
- managed by the callback routines.
- </p>
- <p>
- The most direct approach here is to build the tree
- "invisibly"; to bury within the callback routines the
- necessary code to construct the tree. In the simplest
- case, the whole of the FO tree is built within the call
- to <span class="codefrag">parser()</span>, and that
- in-memory tree is subsequently processed to (a) validate
- the FO structure, and (b) construct the Area tree. The
- problem with this approach is the potential size of the
- FO tree in memory. FOP has suffered from this problem
- in the past.
- </p>
- </div>
- <a name="N10218"></a><a name="Cluttered+callbacks"></a>
- <h4>Cluttered callbacks</h4>
- <div style="margin-left: 0 ; border: 2px">
- <p>
- On the other hand, the callback code may become
- increasingly complex as tree validation and the triggering
- of the Area tree processing and subsequent rendering is
- moved into the callbacks, typically the <span
- class="codefrag">endElement()</span> method. In order to
- overcome acute memory problems, the FOP code was recently
- modified in this way, to trigger Area tree building and
- rendering in the <span
- class="codefrag">endElement()</span> method, when the end
- of a page-sequence was detected.
- </p>
- <p>
- The drawback with such a method is that it becomes difficult
- to detemine the order of events and the circumstances in
- which any particular processing events are triggered. When
- the processing events are inherently self-contained, this is
- irrelevant. But the more complex and context-dependent the
- relationships are among the processing elements, the more
- obscurity is engendered in the code by such "side-effect"
- processing.
- </p>
- </div>
- <a name="N1022B"></a><a name="From+"></a>
- <h4>From push to pull parsing</h4>
- <div style="margin-left: 0 ; border: 2px">
- <p>
- In order to solve the simultaneous problems of exposing
- the structure of the processing and minimising in-memory
- requirements, the experimental code separates the
- parsing of the input source from the building of the FO
- tree and all downstream processing. The callback
- routines become minimal, consisting of the creation and
- buffering of <span class="codefrag">XMLEvent</span>
- objects as a <em>producer</em>. All of these objects
- are effectively merged into a single event stream, in
- strict event order, for subsequent access by the FO tree
- building process, acting as a <em>consumer</em>. This,
- essentially, is the difference between <em>push</em> and
- <em>pull</em> parsing. In itself, this does not reduce
- the footprint. This occurs when the approach is
- generalised to modularise FOP processing.<br/> <br/>
- <strong>Figure 2</strong>
-
- </p>
- <div align="center">
- <img class="figure" alt="XML event buffer"
- src="images/design/alt.design/pull-parsing.png" /></div>
- <p>
- The most useful change that this brings about is the switch
- from <em>passive</em> to <em>active</em> XML element
- processing. The process of parsing now becomes visible to
- the controlling process. All local validation requirements,
- all object and data structure building, are initiated by the
- process(es) <em>get</em>ting from the queue - in the case
- above, the FO tree builder.
- </p>
- </div>
- <a name="N10260"></a><a name="FoXMLEvent+methods"></a>
- <h4>FoXMLEvent methods</h4>
- <div style="margin-left: 0 ; border: 2px">
- <a name="FoXMLEvent-methods"></a>
- <p>
- The experimental code uses a class <span id = "span00"
- /><span class = "codefrag" ><a
- href="javascript:toggleCode( 'span00',
- 'FoXMLEvent.html#FoXMLEventClass', '400', '100%'
- )">FoXMLEvent</a></span > to provide the objects which are
- placed in the queue. <em>FoXMLEvent</em> includes a
- variety of methods to access elements in the queue.
- Namespace URIs encountered in parsing are maintained in an
- <span id = "span01" /><span class="codefrag"><a
- href="javascript:toggleCode( 'span01',
- 'XMLNamespaces.html#XMLNamespacesClass', '400', '100%'
- )">XMLNamespaces</a></span> object where they are
- associated with a unique integer index. This integer
- value is used in the signature of some of the access
- methods.
- </p>
- <p>
- The class which manages the buffer is <span id = "span02"
- /><span class = "codefrag" ><a href =
- "javascript:toggleCode( 'span02',
- 'SyncedFoXmlEventsBuffer.html#SyncedFoXmlEventsBufferClass',
- '400', '100%' )" >SyncedFoXmlEventsBuffer</a>.</span >
- </p>
- <dl>
-
- <dt>
- <span id = "span03" /><a href="javascript:toggleCode(
- 'span03', 'SyncedFoXmlEventsBuffer.html#getEvent',
- '400', '100%' )">FoXMLEvent
- getEvent(SyncedCircularBuffer events)</a>
- </dt>
-
- <dd>
- This is the basis of all of the queue access methods. It
- returns the next element from the queue, which may be a
- pushback element.
- </dd>
-
- <dt>
- <span id = "span04" /><a href="javascript:toggleCode(
- 'span04', 'SyncedFoXmlEventsBuffer.html#getTypedEvent',
- '400', '100%' )">FoXMLEvent getTypedEvent()</a>
- </dt>
-
- <dd>
- A series of these methods provide for the recovery only
- of events of a particular event type, and possibly other
- specific characteristics. <em>Get</em> methods discard
- input which does not meet the requirements. E.g.
- <dl>
- <dt>
- <span id = "span040" /><a
- href="javascript:toggleCode( 'span040',
- 'SyncedFoXmlEventsBuffer.html#getEndDocument',
- '400', '100%' )">FoXMLEvent getEndDocument()</a>
- </dt>
- <dd>
- Discard input until and EndDocument event occurs.
- Return this event.
- </dd>
- <dt>
- <span id = "span041" /><a
- href="javascript:toggleCode( 'span041',
- 'SyncedFoXmlEventsBuffer.html#getStartElement',
- '400', '100%' )">FoXMLEvent getStartElement()</a>
- </dt>
- <dd>
- A series of <span class = "codefrag"
- >getStartElement</span > methods provide for
- discarding input until a StartElement event of the
- appropriate type occurs. This event is returned.
- This series of methods includes some which accept a
- list of Element specifiers.
- </dd>
- </dl>
- </dd>
-
- <dt>
- <span id = "span05" /><a href="javascript:toggleCode(
- 'span05',
- 'SyncedFoXmlEventsBuffer.html#expectTypedEvent', '400',
- '100%' )">FoXMLEvent expectTypedEvent()</a>
- </dt>
-
- <dd>
- A series of these methods provide for the recovery only
- of events of a particular event type, and possibly other
- specific characteristics. <em>Expect</em> methods throw
- an exception on input which does not meet the
- requirements. <em>Expect</em> methods generally take a
- <span class = "codefrag" >boolean</span> argument
- specifying whitespace treatment. Examples include:
- <dl>
- <dt>
- <span id = "span050" /><a
- href="javascript:toggleCode( 'span050',
- 'SyncedFoXmlEventsBuffer.html#expectEndDocument',
- '400', '100%' )">FoXMLEvent expectEndDocument()</a>
- </dt>
- <dd>
- Expect an EndDocument event. Return this event.
- </dd>
- <dt>
- <span id = "span051" /><a
- href="javascript:toggleCode( 'span051',
- 'SyncedFoXmlEventsBuffer.html#expectStartElement',
- '400', '100%' )">FoXMLEvent expectStartElement()</a>
- </dt>
- <dd>
- A series of <span class = "codefrag"
- >expectStartElement</span > methods provide for
- examinging the pending input for a StartElement
- event of the appropriate type. This event is
- returned. This series of methods includes some
- which accept a list of Element specifiers.
- </dd>
- </dl>
- </dd>
- </dl>
- </div>
- <a name="N102FE"></a><a name="FOP+modularisation"></a>
- <h4>FOP modularisation</h4>
- <div style="margin-left: 0 ; border: 2px">
- <p>
- This same principle can be extended to the other major
- sub-systems of FOP processing. In each case, while it is
- possible to hold a complete intermediate result in memory,
- the memory costs of that approach are too high. The
- sub-systems - xml parsing, FO tree construction, Area tree
- construction and rendering - must run in parallel if the
- footprint is to be kept manageable. By creating a series of
- producer-consumer pairs linked by synchronized buffers,
- logical isolation can be achieved while rates of processing
- remain coupled. By introducing feedback loops conveying
- information about the completion of processing of the
- elements, sub-systems can dispose of or precis those
- elements without having to be tightly coupled to downstream
- processes.
- <br/>
- <br/>
-
- <strong>Figure 3</strong>
-
- </p>
- <div align="center">
- <img class="figure" alt="FOP modularisation"
- src="images/design/alt.design/processPlumbing.png" />
- </div>
-
- <p>
- In the case of communication between the FO tree
- building process and the layout process, feedback is
- required in order to parse expressions containing
- lengths expressed as a percentage of some enclosing
- area. This communication is incorporated within the
- general model of inter-phase communication discussed above.
- <br/><br/>
- <strong>Figure 4</strong>
-
- </p>
- <div align="center">
- <img class="figure" alt="FO - layout interaction"
- src="images/design/alt.design/fo-layout-interaction.png" />
- </div>
-
-
- </div>
- </div>
-
- </div>
- </body>
-</html>