--- /dev/null
+<?xml version="1.0"?>
+<html>
+ <body text="#000000" bgcolor="#FFFFFF">
+ <script type="text/javascript" src="codedisplay.js" />
+ <div class="content">
+ <h1>Implementing Pull Parsing</h1>
+ <p>
+ <font size="-2">by Peter B. West</font>
+ </p>
+ <ul class="minitoc">
+ <li>
+ <a href="#An+alternative+parsing+methodology">An alternative
+ parsing methodology</a>
+ <ul class="minitoc">
+ <li>
+ <a href="#Structure+of+SAX+parsing">Structure of SAX parsing</a>
+ </li>
+ <li>
+ <a href="#Cluttered+callbacks">Cluttered callbacks</a>
+ </li>
+ <li>
+ <a href="#From+">From push to pull parsing</a>
+ </li>
+ <li>
+ <a href="#FoXMLEvent+me%5Bthods">FoXMLEvent me[thods</a>
+ </li>
+ <li>
+ <a href="#FOP+modularisation">FOP modularisation</a>
+ </li>
+ </ul>
+ </li>
+ </ul>
+
+ <a name="N101C5"></a><a name="An+alternative+parsing+methodology"></a>
+ <h3>An alternative parsing methodology</h3>
+ <div style="margin-left: 0 ; border: 2px">
+ <p>
+ This note proposes an alternative method of integrating the
+ output of the SAX parsing of the Flow Object (FO) tree into
+ FOP processing. The pupose of the proposed changes is to
+ provide for:
+ </p>
+ <ul>
+
+ <li>
+ better decomposition of FOP into processing phases
+ </li>
+
+ <li>
+ top-down FO tree building, providing
+ </li>
+
+ <li>
+ integrated validation of FO tree input.
+ </li>
+
+ </ul>
+ <a name="N101DA"></a><a name="Structure+of+SAX+parsing"></a>
+ <h4>Structure of SAX parsing</h4>
+ <div style="margin-left: 0 ; border: 2px">
+ <p>
+ Figure 1 is a schematic representation of the process of
+ SAX parsing of an input source. SAX parsing involves the
+ registration, with an object implementing the <span
+ class="codefrag">XMLReader</span> interface, of a <span
+ class="codefrag">ContentHandler</span> which contains a
+ callback routine for each of the event types encountered
+ by the parser, e.g., <span
+ class="codefrag">startDocument()</span>, <span
+ class="codefrag">startElement()</span>, <span
+ class="codefrag">characters()</span>, <span
+ class="codefrag">endElement()</span> and <span
+ class="codefrag">endDocument()</span>. Parsing is
+ initiated by a call to the <span
+ class="codefrag">parser()</span> method of the <span
+ class="codefrag">XMLReader</span>. Note that the call to
+ <span class="codefrag">parser()</span> and the calls to
+ individual callback methods are synchronous: <span
+ class="codefrag">parser()</span> will only return when the
+ last callback method returns, and each callback must
+ complete before the next is called.<br/> <br/>
+
+ <strong>Figure 1</strong>
+
+ </p>
+ <div align="center">
+ <img class="figure" alt="SAX parsing schematic"
+ src="images/design/alt.design/SAXParsing.png" /></div>
+ <p>
+ In the process of parsing, the hierarchical structure of the
+ original FO tree is flattened into a number of streams of
+ events of the same type which are reported in the sequence
+ in which they are encountered. Apart from that, the API
+ imposes no structure or constraint which expresses the
+ relationship between, e.g., a startElement event and the
+ endElement event for the same element. To the extent that
+ such relationship information is required, it must be
+ managed by the callback routines.
+ </p>
+ <p>
+ The most direct approach here is to build the tree
+ "invisibly"; to bury within the callback routines the
+ necessary code to construct the tree. In the simplest
+ case, the whole of the FO tree is built within the call
+ to <span class="codefrag">parser()</span>, and that
+ in-memory tree is subsequently processed to (a) validate
+ the FO structure, and (b) construct the Area tree. The
+ problem with this approach is the potential size of the
+ FO tree in memory. FOP has suffered from this problem
+ in the past.
+ </p>
+ </div>
+ <a name="N10218"></a><a name="Cluttered+callbacks"></a>
+ <h4>Cluttered callbacks</h4>
+ <div style="margin-left: 0 ; border: 2px">
+ <p>
+ On the other hand, the callback code may become
+ increasingly complex as tree validation and the triggering
+ of the Area tree processing and subsequent rendering is
+ moved into the callbacks, typically the <span
+ class="codefrag">endElement()</span> method. In order to
+ overcome acute memory problems, the FOP code was recently
+ modified in this way, to trigger Area tree building and
+ rendering in the <span
+ class="codefrag">endElement()</span> method, when the end
+ of a page-sequence was detected.
+ </p>
+ <p>
+ The drawback with such a method is that it becomes difficult
+ to detemine the order of events and the circumstances in
+ which any particular processing events are triggered. When
+ the processing events are inherently self-contained, this is
+ irrelevant. But the more complex and context-dependent the
+ relationships are among the processing elements, the more
+ obscurity is engendered in the code by such "side-effect"
+ processing.
+ </p>
+ </div>
+ <a name="N1022B"></a><a name="From+"></a>
+ <h4>From push to pull parsing</h4>
+ <div style="margin-left: 0 ; border: 2px">
+ <p>
+ In order to solve the simultaneous problems of exposing
+ the structure of the processing and minimising in-memory
+ requirements, the experimental code separates the
+ parsing of the input source from the building of the FO
+ tree and all downstream processing. The callback
+ routines become minimal, consisting of the creation and
+ buffering of <span class="codefrag">XMLEvent</span>
+ objects as a <em>producer</em>. All of these objects
+ are effectively merged into a single event stream, in
+ strict event order, for subsequent access by the FO tree
+ building process, acting as a <em>consumer</em>. This,
+ essentially, is the difference between <em>push</em> and
+ <em>pull</em> parsing. In itself, this does not reduce
+ the footprint. This occurs when the approach is
+ generalised to modularise FOP processing.<br/> <br/>
+ <strong>Figure 2</strong>
+
+ </p>
+ <div align="center">
+ <img class="figure" alt="XML event buffer"
+ src="images/design/alt.design/pull-parsing.png" /></div>
+ <p>
+ The most useful change that this brings about is the switch
+ from <em>passive</em> to <em>active</em> XML element
+ processing. The process of parsing now becomes visible to
+ the controlling process. All local validation requirements,
+ all object and data structure building, are initiated by the
+ process(es) <em>get</em>ting from the queue - in the case
+ above, the FO tree builder.
+ </p>
+ </div>
+ <a name="N10260"></a><a name="FoXMLEvent+methods"></a>
+ <h4>FoXMLEvent methods</h4>
+ <div style="margin-left: 0 ; border: 2px">
+ <a name="FoXMLEvent-methods"></a>
+ <p>
+ The experimental code uses a class <span id = "span00"
+ /><span class = "codefrag" ><a
+ href="javascript:toggleCode( 'span00',
+ 'FoXMLEvent.html#FoXMLEventClass', '400', '100%'
+ )">FoXMLEvent</a></span > to provide the objects which are
+ placed in the queue. <em>FoXMLEvent</em> includes a
+ variety of methods to access elements in the queue.
+ Namespace URIs encountered in parsing are maintained in an
+ <span id = "span01" /><span class="codefrag"><a
+ href="javascript:toggleCode( 'span01',
+ 'XMLNamespaces.html#XMLNamespacesClass', '400', '100%'
+ )">XMLNamespaces</a></span> object where they are
+ associated with a unique integer index. This integer
+ value is used in the signature of some of the access
+ methods.
+ </p>
+ <p>
+ The class which manages the buffer is <span id = "span02"
+ /><span class = "codefrag" ><a href =
+ "javascript:toggleCode( 'span02',
+ 'SyncedFoXmlEventsBuffer.html#SyncedFoXmlEventsBufferClass',
+ '400', '100%' )" >SyncedFoXmlEventsBuffer</a>.</span >
+ </p>
+ <dl>
+
+ <dt>
+ <span id = "span03" /><a href="javascript:toggleCode(
+ 'span03', 'SyncedFoXmlEventsBuffer.html#getEvent',
+ '400', '100%' )">FoXMLEvent
+ getEvent(SyncedCircularBuffer events)</a>
+ </dt>
+
+ <dd>
+ This is the basis of all of the queue access methods. It
+ returns the next element from the queue, which may be a
+ pushback element.
+ </dd>
+
+ <dt>
+ <span id = "span04" /><a href="javascript:toggleCode(
+ 'span04', 'SyncedFoXmlEventsBuffer.html#getTypedEvent',
+ '400', '100%' )">FoXMLEvent getTypedEvent()</a>
+ </dt>
+
+ <dd>
+ A series of these methods provide for the recovery only
+ of events of a particular event type, and possibly other
+ specific characteristics. <em>Get</em> methods discard
+ input which does not meet the requirements. E.g.
+ <dl>
+ <dt>
+ <span id = "span040" /><a
+ href="javascript:toggleCode( 'span040',
+ 'SyncedFoXmlEventsBuffer.html#getEndDocument',
+ '400', '100%' )">FoXMLEvent getEndDocument()</a>
+ </dt>
+ <dd>
+ Discard input until and EndDocument event occurs.
+ Return this event.
+ </dd>
+ <dt>
+ <span id = "span041" /><a
+ href="javascript:toggleCode( 'span041',
+ 'SyncedFoXmlEventsBuffer.html#getStartElement',
+ '400', '100%' )">FoXMLEvent getStartElement()</a>
+ </dt>
+ <dd>
+ A series of <span class = "codefrag"
+ >getStartElement</span > methods provide for
+ discarding input until a StartElement event of the
+ appropriate type occurs. This event is returned.
+ This series of methods includes some which accept a
+ list of Element specifiers.
+ </dd>
+ </dl>
+ </dd>
+
+ <dt>
+ <span id = "span05" /><a href="javascript:toggleCode(
+ 'span05',
+ 'SyncedFoXmlEventsBuffer.html#expectTypedEvent', '400',
+ '100%' )">FoXMLEvent expectTypedEvent()</a>
+ </dt>
+
+ <dd>
+ A series of these methods provide for the recovery only
+ of events of a particular event type, and possibly other
+ specific characteristics. <em>Expect</em> methods throw
+ an exception on input which does not meet the
+ requirements. <em>Expect</em> methods generally take a
+ <span class = "codefrag" >boolean</span> argument
+ specifying whitespace treatment. Examples include:
+ <dl>
+ <dt>
+ <span id = "span050" /><a
+ href="javascript:toggleCode( 'span050',
+ 'SyncedFoXmlEventsBuffer.html#expectEndDocument',
+ '400', '100%' )">FoXMLEvent expectEndDocument()</a>
+ </dt>
+ <dd>
+ Expect an EndDocument event. Return this event.
+ </dd>
+ <dt>
+ <span id = "span051" /><a
+ href="javascript:toggleCode( 'span051',
+ 'SyncedFoXmlEventsBuffer.html#expectStartElement',
+ '400', '100%' )">FoXMLEvent expectStartElement()</a>
+ </dt>
+ <dd>
+ A series of <span class = "codefrag"
+ >expectStartElement</span > methods provide for
+ examinging the pending input for a StartElement
+ event of the appropriate type. This event is
+ returned. This series of methods includes some
+ which accept a list of Element specifiers.
+ </dd>
+ </dl>
+ </dd>
+ </dl>
+ </div>
+ <a name="N102FE"></a><a name="FOP+modularisation"></a>
+ <h4>FOP modularisation</h4>
+ <div style="margin-left: 0 ; border: 2px">
+ <p>
+ This same principle can be extended to the other major
+ sub-systems of FOP processing. In each case, while it is
+ possible to hold a complete intermediate result in memory,
+ the memory costs of that approach are too high. The
+ sub-systems - xml parsing, FO tree construction, Area tree
+ construction and rendering - must run in parallel if the
+ footprint is to be kept manageable. By creating a series of
+ producer-consumer pairs linked by synchronized buffers,
+ logical isolation can be achieved while rates of processing
+ remain coupled. By introducing feedback loops conveying
+ information about the completion of processing of the
+ elements, sub-systems can dispose of or precis those
+ elements without having to be tightly coupled to downstream
+ processes.
+ <br/>
+ <br/>
+
+ <strong>Figure 3</strong>
+
+ </p>
+ <div align="center">
+ <img class="figure" alt="FOP modularisation"
+ src="images/design/alt.design/processPlumbing.png" />
+ </div>
+
+ <p>
+ In the case of communication between the FO tree
+ building process and the layout process, feedback is
+ required in order to parse expressions containing
+ lengths expressed as a percentage of some enclosing
+ area. This communication is incorporated within the
+ general model of inter-phase communication discussed above.
+ <br/><br/>
+ <strong>Figure 4</strong>
+
+ </p>
+ <div align="center">
+ <img class="figure" alt="FO - layout interaction"
+ src="images/design/alt.design/fo-layout-interaction.png" />
+ </div>
+
+
+ </div>
+ </div>
+
+ </div>
+ </body>
+</html>