src/documentation/content/xdocs/design/alt.design/xml-parsing.ehtml


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366

<?xml version="1.0"?>
<!--
  Copyright 1999-2004 The Apache Software Foundation

  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<!-- $Id$ -->
<html>
  <body text="#000000" bgcolor="#FFFFFF">
    <script type="text/javascript" src="codedisplay.js" />
    <div class="content">
      <h1>Implementing Pull Parsing</h1>
      <p>
        <font size="-2">by Peter B. West</font>
      </p>
      <ul class="minitoc">
        <li>
          <a href="#An+alternative+parsing+methodology">An alternative
            parsing methodology</a>
          <ul class="minitoc">
            <li>
              <a href="#Structure+of+SAX+parsing">Structure of SAX parsing</a>
            </li>
            <li>
              <a href="#Cluttered+callbacks">Cluttered callbacks</a>
            </li>
            <li>
              <a href="#From+">From push to pull parsing</a>
            </li>
            <li>
              <a href="#FoXMLEvent+me%5Bthods">FoXMLEvent me[thods</a>
            </li>
            <li>
              <a href="#FOP+modularisation">FOP modularisation</a>
            </li>
          </ul>
        </li>
      </ul>

      <a name="N101C5"></a><a name="An+alternative+parsing+methodology"></a>
      <h3>An alternative parsing methodology</h3>
      <div style="margin-left: 0 ; border: 2px">
        <p>
          This note proposes an alternative method of integrating the
          output of the SAX parsing of the Flow Object (FO) tree into
          FOP processing.  The pupose of the proposed changes is to
          provide for:
        </p>
        <ul>

          <li>
            better decomposition of FOP into processing phases
          </li>

          <li>
            top-down FO tree building, providing
          </li>

          <li>
            integrated validation of FO tree input.
          </li>

        </ul>
        <a name="N101DA"></a><a name="Structure+of+SAX+parsing"></a>
        <h4>Structure of SAX parsing</h4>
        <div style="margin-left: 0 ; border: 2px">
          <p>
            Figure 1 is a schematic representation of the process of
            SAX parsing of an input source.  SAX parsing involves the
            registration, with an object implementing the <span
            class="codefrag">XMLReader</span> interface, of a <span
            class="codefrag">ContentHandler</span> which contains a
            callback routine for each of the event types encountered
            by the parser, e.g., <span
            class="codefrag">startDocument()</span>, <span
            class="codefrag">startElement()</span>, <span
            class="codefrag">characters()</span>, <span
            class="codefrag">endElement()</span> and <span
            class="codefrag">endDocument()</span>.  Parsing is
            initiated by a call to the <span
            class="codefrag">parser()</span> method of the <span
            class="codefrag">XMLReader</span>.  Note that the call to
            <span class="codefrag">parser()</span> and the calls to
            individual callback methods are synchronous: <span
            class="codefrag">parser()</span> will only return when the
            last callback method returns, and each callback must
            complete before the next is called.<br/> <br/>

            <strong>Figure 1</strong>

          </p>
          <div align="center">
            <img class="figure" alt="SAX parsing schematic"
                 src="images/design/alt.design/SAXParsing.png" /></div>
          <p>
            In the process of parsing, the hierarchical structure of the
            original FO tree is flattened into a number of streams of
            events of the same type which are reported in the sequence
            in which they are encountered.  Apart from that, the API
            imposes no structure or constraint which expresses the
            relationship between, e.g., a startElement event and the
            endElement event for the same element.  To the extent that
            such relationship information is required, it must be
            managed by the callback routines.
          </p>
          <p>
            The most direct approach here is to build the tree
            "invisibly"; to bury within the callback routines the
            necessary code to construct the tree.  In the simplest
            case, the whole of the FO tree is built within the call
            to <span class="codefrag">parser()</span>, and that
            in-memory tree is subsequently processed to (a) validate
            the FO structure, and (b) construct the Area tree.  The
            problem with this approach is the potential size of the
            FO tree in memory.  FOP has suffered from this problem
            in the past.
          </p>
        </div>
        <a name="N10218"></a><a name="Cluttered+callbacks"></a>
        <h4>Cluttered callbacks</h4>
        <div style="margin-left: 0 ; border: 2px">
          <p>
            On the other hand, the callback code may become
            increasingly complex as tree validation and the triggering
            of the Area tree processing and subsequent rendering is
            moved into the callbacks, typically the <span
            class="codefrag">endElement()</span> method.  In order to
            overcome acute memory problems, the FOP code was recently
            modified in this way, to trigger Area tree building and
            rendering in the <span
            class="codefrag">endElement()</span> method, when the end
            of a page-sequence was detected.
          </p>
          <p>
            The drawback with such a method is that it becomes difficult
            to detemine the order of events and the circumstances in
            which any particular processing events are triggered.  When
            the processing events are inherently self-contained, this is
            irrelevant.  But the more complex and context-dependent the
            relationships are among the processing elements, the more
            obscurity is engendered in the code by such "side-effect"
            processing.
          </p>
        </div>
        <a name="N1022B"></a><a name="From+"></a>
        <h4>From push to pull parsing</h4>
        <div style="margin-left: 0 ; border: 2px">
          <p>
            In order to solve the simultaneous problems of exposing
            the structure of the processing and minimising in-memory
            requirements, the experimental code separates the
            parsing of the input source from the building of the FO
            tree and all downstream processing.  The callback
            routines become minimal, consisting of the creation and
            buffering of <span class="codefrag">XMLEvent</span>
            objects as a <em>producer</em>.  All of these objects
            are effectively merged into a single event stream, in
            strict event order, for subsequent access by the FO tree
            building process, acting as a <em>consumer</em>.  This,
            essentially, is the difference between <em>push</em> and
            <em>pull</em> parsing.  In itself, this does not reduce
            the footprint.  This occurs when the approach is
            generalised to modularise FOP processing.<br/> <br/>
            <strong>Figure 2</strong>

          </p>
          <div align="center">
            <img class="figure" alt="XML event buffer"
                 src="images/design/alt.design/pull-parsing.png" /></div>
          <p>
            The most useful change that this brings about is the switch
            from <em>passive</em> to <em>active</em> XML element
            processing.  The process of parsing now becomes visible to
            the controlling process.  All local validation requirements,
            all object and data structure building, are initiated by the
            process(es) <em>get</em>ting from the queue - in the case
            above, the FO tree builder.
          </p>
        </div>
        <a name="N10260"></a><a name="FoXMLEvent+methods"></a>
        <h4>FoXMLEvent methods</h4>
        <div style="margin-left: 0 ; border: 2px">
          <a name="FoXMLEvent-methods"></a>
          <p>
            The experimental code uses a class <span id = "span00"
            /><span class = "codefrag" ><a
            href="javascript:toggleCode( 'span00',
            'FoXMLEvent.html#FoXMLEventClass', '400', '100%'
            )">FoXMLEvent</a></span > to provide the objects which are
            placed in the queue.  <em>FoXMLEvent</em> includes a
            variety of methods to access elements in the queue.
            Namespace URIs encountered in parsing are maintained in an
            <span id = "span01" /><span class="codefrag"><a
            href="javascript:toggleCode( 'span01',
            'XMLNamespaces.html#XMLNamespacesClass', '400', '100%'
            )">XMLNamespaces</a></span> object where they are
            associated with a unique integer index.  This integer
            value is used in the signature of some of the access
            methods.
          </p>
          <p>
            The class which manages the buffer is <span id = "span02"
            /><span class = "codefrag" ><a href =
            "javascript:toggleCode( 'span02',
            'SyncedFoXmlEventsBuffer.html#SyncedFoXmlEventsBufferClass',
            '400', '100%' )" >SyncedFoXmlEventsBuffer</a>.</span >
          </p>
          <dl>

            <dt>
              <span id = "span03" /><a href="javascript:toggleCode(
              'span03', 'SyncedFoXmlEventsBuffer.html#getEvent',
              '400', '100%' )">FoXMLEvent
              getEvent(SyncedCircularBuffer events)</a>
            </dt>

            <dd>
              This is the basis of all of the queue access methods.  It
              returns the next element from the queue, which may be a
              pushback element.
            </dd>

            <dt>
              <span id = "span04" /><a href="javascript:toggleCode(
              'span04', 'SyncedFoXmlEventsBuffer.html#getTypedEvent',
              '400', '100%' )">FoXMLEvent getTypedEvent()</a>
            </dt>

            <dd>
              A series of these methods provide for the recovery only
              of events of a particular event type, and possibly other
              specific characteristics.  <em>Get</em> methods discard
              input which does not meet the requirements.  E.g.
              <dl>
                <dt>
                  <span id = "span040" /><a
                  href="javascript:toggleCode( 'span040',
                  'SyncedFoXmlEventsBuffer.html#getEndDocument',
                  '400', '100%' )">FoXMLEvent getEndDocument()</a>
                </dt>
                <dd>
                  Discard input until and EndDocument event occurs.
                  Return this event.
                </dd>
                <dt>
                  <span id = "span041" /><a
                  href="javascript:toggleCode( 'span041',
                  'SyncedFoXmlEventsBuffer.html#getStartElement',
                  '400', '100%' )">FoXMLEvent getStartElement()</a>
                </dt>
                <dd>
                  A series of <span class = "codefrag"
                  >getStartElement</span > methods provide for
                  discarding input until a StartElement event of the
                  appropriate type occurs.  This event is returned.
                  This series of methods includes some which accept a
                  list of Element specifiers.
                </dd>
              </dl>
            </dd>

            <dt>
              <span id = "span05" /><a href="javascript:toggleCode(
              'span05',
              'SyncedFoXmlEventsBuffer.html#expectTypedEvent', '400',
              '100%' )">FoXMLEvent expectTypedEvent()</a>
            </dt>

            <dd>
              A series of these methods provide for the recovery only
              of events of a particular event type, and possibly other
              specific characteristics.  <em>Expect</em> methods throw
              an exception on input which does not meet the
              requirements.  <em>Expect</em> methods generally take a
              <span class = "codefrag" >boolean</span> argument
              specifying whitespace treatment.  Examples include:
              <dl>
                <dt>
                  <span id = "span050" /><a
                  href="javascript:toggleCode( 'span050',
                  'SyncedFoXmlEventsBuffer.html#expectEndDocument',
                  '400', '100%' )">FoXMLEvent expectEndDocument()</a>
                </dt>
                <dd>
                  Expect an EndDocument event. Return this event.
                </dd>
                <dt>
                  <span id = "span051" /><a
                  href="javascript:toggleCode( 'span051',
                  'SyncedFoXmlEventsBuffer.html#expectStartElement',
                  '400', '100%' )">FoXMLEvent expectStartElement()</a>
                </dt>
                <dd>
                  A series of <span class = "codefrag"
                  >expectStartElement</span > methods provide for
                  examinging the pending input for a StartElement
                  event of the appropriate type.  This event is
                  returned.  This series of methods includes some
                  which accept a list of Element specifiers.
                </dd>
              </dl>
            </dd>
          </dl>
        </div>
        <a name="N102FE"></a><a name="FOP+modularisation"></a>
        <h4>FOP modularisation</h4>
        <div style="margin-left: 0 ; border: 2px">
          <p>
            This same principle can be extended to the other major
            sub-systems of FOP processing.  In each case, while it is
            possible to hold a complete intermediate result in memory,
            the memory costs of that approach are too high.  The
            sub-systems - xml parsing, FO tree construction, Area tree
            construction and rendering - must run in parallel if the
            footprint is to be kept manageable.  By creating a series of
            producer-consumer pairs linked by synchronized buffers,
            logical isolation can be achieved while rates of processing
            remain coupled.  By introducing feedback loops conveying
            information about the completion of processing of the
            elements, sub-systems can dispose of or precis those
            elements without having to be tightly coupled to downstream
            processes.
            <br/>
            <br/>

            <strong>Figure 3</strong>

          </p>
          <div align="center">
            <img class="figure" alt="FOP modularisation"
                 src="images/design/alt.design/processPlumbing.png" />
          </div>

          <p>
            In the case of communication between the FO tree
            building process and the layout process, feedback is
            required in order to parse expressions containing
            lengths expressed as a percentage of some enclosing
            area.  This communication is incorporated within the
            general model of inter-phase communication discussed above.
            <br/><br/>
            <strong>Figure 4</strong>

          </p>
          <div align="center">
            <img class="figure" alt="FO - layout interaction"
                 src="images/design/alt.design/fo-layout-interaction.png" />
          </div>


        </div>
      </div>

    </div>
  </body>
</html>