aboutsummaryrefslogtreecommitdiffstats
path: root/src/documentation/content/xdocs/design/alt.design/xml-parsing.xml
blob: 6151dae74b56b3eb6ff40aa0596d5976eac4efa2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
<?xml version="1.0" standalone="no"?>
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN"
    "http://cvs.apache.org/viewcvs.cgi/*checkout*/xml-forrest/src/resources/schema/dtd/document-v11.dtd">

<document>
    <header>
        <title>Integrating XML Parsing</title>
    <authors>
      <person name="Peter B. West" email="pbwest@powerup.com.au"/>
    </authors>
    </header>
    <body>
    <section>
      <title>An alternative parser integration</title>
      <p>
  This note proposes an alternative method of integrating the
  output of the SAX parsing of the Flow Object (FO) tree into
  FOP processing.  The pupose of the proposed changes is to
  provide for better decomposition of the process of analysing
  and rendering an fo tree such as is represented in the output
  from initial (XSLT) processing of an XML source document.
      </p>
      <section>
        <title>Structure of SAX parsing</title>
  <p>
    Figure 1 is a schematic representation of the process of SAX
    parsing of an input source.  SAX parsing involves the 
    registration, with an object implementing the 
    <code>XMLReader</code> interface, of a
    <code>ContentHandler</code> which contains a callback
    routine for each of the event types encountered by the
    parser, e.g., <code>startDocument()</code>,
    <code>startElement()</code>, <code>characters()</code>,
    <code>endElement()</code> and <code>endDocument()</code>.
    Parsing is initiated by a call to the <code>parser()</code>
    method of the <code>XMLReader</code>.  Note that the call to
    <code>parser()</code> and the calls to individual callback
    methods are synchronous: <code>parser()</code> will only
    return when the last callback method returns, and each
    callback must complete before the next is called.<br/><br/>
    <strong>Figure 1</strong>
  </p>
  <figure src="SAXParsing.png" alt="SAX parsing schematic"/>
  <p>
    In the process of parsing, the hierarchical structure of the
    original FO tree is flattened into a number of streams of
    events of the same type which are reported in the sequence
    in which they are encountered.  Apart from that, the API
    imposes no structure or constraint which expresses the
    relationship between, e.g., a startElement event and the
    endElement event for the same element.  To the extent that
    such relationship information is required, it must be
    managed by the callback routines.
  </p>
  <p>
    The most direct approach here is to build the tree
    "invisibly"; to bury within the callback routines the
    necessary code to construct the tree.  In the simplest case,
    the whole of the FO tree is built within the call to
    <code>parser()</code>, and that in-memory tree is subsequently
    processed to (a) validate the FO structure, and (b)
    construct the Area tree.  The problem with this approach is
    the potential size of the FO tree in memory.  FOP has
    suffered from this problem in the past.
  </p>
      </section>
      <section>
        <title>Cluttered callbacks</title>
  <p>
    On the other hand, the callback code may become increasingly
    complex as tree validation and the triggering of the Area
    tree processing and subsequent rendering is moved into the
    callbacks, typically the <code>endElement()</code> method.
    In order to overcome acute memory problems, the FOP code was
    recently modified in this way, to trigger Area tree building
    and rendering in the <code>endElement()</code> method, when
    the end of a page-sequence was detected.
  </p>
  <p>
    The drawback with such a method is that it becomes difficult
    to detemine the order of events and the circumstances in
    which any particular processing events are triggered.  When
    the processing events are inherently self-contained, this is
    irrelevant.  But the more complex and context-dependent the
    relationships are among the processing elements, the more
    obscurity is engendered in the code by such "side-effect"
    processing.
  </p>
      </section>
      <section>
        <title>From passive to active parsing</title>
  <p>
    In order to solve the simultaneous problems of exposing the
    structure of the processing and minimising in-memory
    requirements, the experimental code separates the parsing of
    the input source from the building of the FO tree and all
    downstream processing.  The callback routines become
    minimal, consisting of the creation and buffering of
    <code>XMLEvent</code> objects as a <em>producer</em>.  All
    of these objects are effectively merged into a single event
    stream, in strict event order, for subsequent access by the
    FO tree building process, acting as a
    <em>consumer</em>.  In itself, this does not reduce the
    footprint.  This occurs when the approach is generalised to
    modularise FOP processing.<br/><br/> <strong>Figure 2</strong>
  </p>
  <figure src="XML-event-buffer.png" alt="XML event buffer"/>
  <p>
    The most useful change that this brings about is the switch
    from <em>passive</em> to <em>active</em> XML element
    processing.  The process of parsing now becomes visible to
    the controlling process.  All local validation requirements,
    all object and data structure building, is initiated by the
    process(es) <em>get</em>ting from the queue - in the case
    above, the FO tree builder.
  </p>
      </section>
      <section>
        <title>XMLEvent methods</title>
  <anchor id="XMLEvent-methods"/>
  <p>
    The experimental code uses a class <strong>XMLEvent</strong>
    to provide the objects which are placed in the queue.
    <em>XMLEvent</em> includes a variety of methods to access
    elements in the queue.  Namespace URIs encountered in
    parsing are maintined in a <code>static</code>
    <code>HashMap</code> where they are associated with a unique
    integer index.  This integer value is used in the signature
    of some of the access methods.
  </p>
  <dl>
    <dt>XMLEvent getEvent(SyncedCircularBuffer events)</dt>
    <dd>
      This is the basis of all of the queue access methods.  It
      returns the next element from the queue, which may be a
      pushback element.
    </dd>
    <dt>XMLEvent getEndDocument(events)</dt>
    <dd>
      <em>get</em>  and discard elements from the queue
      until an ENDDOCUMENT element is found and returned.
    </dd>
    <dt> XMLEvent expectEndDocument(events)</dt>
    <dd>
      If the next element on the queue is an ENDDOCUMENT event,
      return it.  Otherwise, push the element back and throw an
      exception.  Each of the <em>get</em>  methods (except
      <em>getEvent()</em>  itself) has a corresponding
      <em>expect</em>  method.
    </dd>
    <dt>XMLEvent get/expectStartElement(events)</dt>
    <dd> Return the next STARTELEMENT event from the queue.</dd>
    <dt>XMLEvent get/expectStartElement(events, String
      qName)</dt>
    <dd>
      Return the next STARTELEMENT with a QName matching
      <em>qName</em>.
    </dd>
    <dt>
      XMLEvent get/expectStartElement(events, int uriIndex,
      String localName)
    </dt>
    <dd>
      Return the next STARTELEMENT with a URI indicated by the
      <em>uriIndex</em> and a local name matching <em>localName</em>.
    </dd>
    <dt>
      XMLEvent get/expectStartElement(events, LinkedList list)
    </dt>
    <dd>
      <em>list</em>  contains instances of the nested class
      <code>UriLocalName</code>, which hold a
      <em>uriIndex</em>  and a <em>localName</em>.  Return
      the next STARTELEMENT with a URI indicated by the
      <em>uriIndex</em>  and a local name matching
      <em>localName</em>  from any element of
      <em>list</em>.
    </dd>
    <dt>XMLEvent get/expectEndElement(events)</dt>
    <dd>Return the next ENDELEMENT.</dd>
    <dt>XMLEvent get/expectEndElement(events, qName)</dt>
    <dd>Return the next ENDELEMENT with QName
      <em>qname</em>.</dd>
    <dt>XMLEvent get/expectEndElement(events, uriIndex, localName)</dt>
    <dd>
      Return the next ENDELEMENT with a URI indicated by the
      <em>uriIndex</em>  and a local name matching
      <em>localName</em>.
    </dd>
    <dt>
      XMLEvent get/expectEndElement(events, XMLEvent event)
    </dt>
    <dd>
      Return the next ENDELEMENT with a URI matching the
      <em>uriIndex</em>  and <em>localName</em>
      matching those in the <em>event</em>  argument.  This
      is intended as a quick way to find the ENDELEMENT matching
      a previously returned STARTELEMENT.
    </dd>
    <dt>XMLEvent get/expectCharacters(events)</dt>
    <dd>Return the next CHARACTERS event.</dd>
  </dl>
      </section>
      <section>
        <title>FOP modularisation</title>
  <p>
    This same principle can be extended to the other major
    sub-systems of FOP processing.  In each case, while it is
    possible to hold a complete intermediate result in memory,
    the memory costs of that approach are too high.  The
    sub-systems - xml parsing, FO tree construction, Area tree
    construction and rendering - must run in parallel if the
    footprint is to be kept manageable.  By creating a series of
    producer-consumer pairs linked by synchronized buffers,
    logical isolation can be achieved while rates of processing
    remain coupled.  By introducing feedback loops conveying
    information about the completion of processing of the
    elements, sub-systems can dispose of or precis those
    elements without having to be tightly coupled to downstream
    processes.<br/><br/>
    <strong>Figure 3</strong>
  </p>
  <figure src="processPlumbing.png" alt="FOP modularisation"/>
      </section>
    </section>
    </body>
</document>