1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
|
<?xml version="1.0" standalone="no"?>
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN"
"http://cvs.apache.org/viewcvs.cgi/*checkout*/xml-forrest/src/resources/schema/dtd/document-v11.dtd">
<document>
<header>
<title>Integrating XML Parsing</title>
<authors>
<person name="Peter B. West" email="pbwest@powerup.com.au"/>
</authors>
</header>
<body>
<section>
<title>An alternative parser integration</title>
<p>
This note proposes an alternative method of integrating the
output of the SAX parsing of the Flow Object (FO) tree into
FOP processing. The pupose of the proposed changes is to
provide for better decomposition of the process of analysing
and rendering an fo tree such as is represented in the output
from initial (XSLT) processing of an XML source document.
</p>
<section>
<title>Structure of SAX parsing</title>
<p>
Figure 1 is a schematic representation of the process of SAX
parsing of an input source. SAX parsing involves the
registration, with an object implementing the
<code>XMLReader</code> interface, of a
<code>ContentHandler</code> which contains a callback
routine for each of the event types encountered by the
parser, e.g., <code>startDocument()</code>,
<code>startElement()</code>, <code>characters()</code>,
<code>endElement()</code> and <code>endDocument()</code>.
Parsing is initiated by a call to the <code>parser()</code>
method of the <code>XMLReader</code>. Note that the call to
<code>parser()</code> and the calls to individual callback
methods are synchronous: <code>parser()</code> will only
return when the last callback method returns, and each
callback must complete before the next is called.<br/><br/>
<strong>Figure 1</strong>
</p>
<figure src="SAXParsing.png" alt="SAX parsing schematic"/>
<p>
In the process of parsing, the hierarchical structure of the
original FO tree is flattened into a number of streams of
events of the same type which are reported in the sequence
in which they are encountered. Apart from that, the API
imposes no structure or constraint which expresses the
relationship between, e.g., a startElement event and the
endElement event for the same element. To the extent that
such relationship information is required, it must be
managed by the callback routines.
</p>
<p>
The most direct approach here is to build the tree
"invisibly"; to bury within the callback routines the
necessary code to construct the tree. In the simplest case,
the whole of the FO tree is built within the call to
<code>parser()</code>, and that in-memory tree is subsequently
processed to (a) validate the FO structure, and (b)
construct the Area tree. The problem with this approach is
the potential size of the FO tree in memory. FOP has
suffered from this problem in the past.
</p>
</section>
<section>
<title>Cluttered callbacks</title>
<p>
On the other hand, the callback code may become increasingly
complex as tree validation and the triggering of the Area
tree processing and subsequent rendering is moved into the
callbacks, typically the <code>endElement()</code> method.
In order to overcome acute memory problems, the FOP code was
recently modified in this way, to trigger Area tree building
and rendering in the <code>endElement()</code> method, when
the end of a page-sequence was detected.
</p>
<p>
The drawback with such a method is that it becomes difficult
to detemine the order of events and the circumstances in
which any particular processing events are triggered. When
the processing events are inherently self-contained, this is
irrelevant. But the more complex and context-dependent the
relationships are among the processing elements, the more
obscurity is engendered in the code by such "side-effect"
processing.
</p>
</section>
<section>
<title>From passive to active parsing</title>
<p>
In order to solve the simultaneous problems of exposing the
structure of the processing and minimising in-memory
requirements, the experimental code separates the parsing of
the input source from the building of the FO tree and all
downstream processing. The callback routines become
minimal, consisting of the creation and buffering of
<code>XMLEvent</code> objects as a <em>producer</em>. All
of these objects are effectively merged into a single event
stream, in strict event order, for subsequent access by the
FO tree building process, acting as a
<em>consumer</em>. In itself, this does not reduce the
footprint. This occurs when the approach is generalised to
modularise FOP processing.<br/><br/> <strong>Figure 2</strong>
</p>
<figure src="XML-event-buffer.png" alt="XML event buffer"/>
<p>
The most useful change that this brings about is the switch
from <em>passive</em> to <em>active</em> XML element
processing. The process of parsing now becomes visible to
the controlling process. All local validation requirements,
all object and data structure building, is initiated by the
process(es) <em>get</em>ting from the queue - in the case
above, the FO tree builder.
</p>
</section>
<section>
<title>XMLEvent methods</title>
<anchor id="XMLEvent-methods"/>
<p>
The experimental code uses a class <strong>XMLEvent</strong>
to provide the objects which are placed in the queue.
<em>XMLEvent</em> includes a variety of methods to access
elements in the queue. Namespace URIs encountered in
parsing are maintined in a <code>static</code>
<code>HashMap</code> where they are associated with a unique
integer index. This integer value is used in the signature
of some of the access methods.
</p>
<dl>
<dt>XMLEvent getEvent(SyncedCircularBuffer events)</dt>
<dd>
This is the basis of all of the queue access methods. It
returns the next element from the queue, which may be a
pushback element.
</dd>
<dt>XMLEvent getEndDocument(events)</dt>
<dd>
<em>get</em> and discard elements from the queue
until an ENDDOCUMENT element is found and returned.
</dd>
<dt> XMLEvent expectEndDocument(events)</dt>
<dd>
If the next element on the queue is an ENDDOCUMENT event,
return it. Otherwise, push the element back and throw an
exception. Each of the <em>get</em> methods (except
<em>getEvent()</em> itself) has a corresponding
<em>expect</em> method.
</dd>
<dt>XMLEvent get/expectStartElement(events)</dt>
<dd> Return the next STARTELEMENT event from the queue.</dd>
<dt>XMLEvent get/expectStartElement(events, String
qName)</dt>
<dd>
Return the next STARTELEMENT with a QName matching
<em>qName</em>.
</dd>
<dt>
XMLEvent get/expectStartElement(events, int uriIndex,
String localName)
</dt>
<dd>
Return the next STARTELEMENT with a URI indicated by the
<em>uriIndex</em> and a local name matching <em>localName</em>.
</dd>
<dt>
XMLEvent get/expectStartElement(events, LinkedList list)
</dt>
<dd>
<em>list</em> contains instances of the nested class
<code>UriLocalName</code>, which hold a
<em>uriIndex</em> and a <em>localName</em>. Return
the next STARTELEMENT with a URI indicated by the
<em>uriIndex</em> and a local name matching
<em>localName</em> from any element of
<em>list</em>.
</dd>
<dt>XMLEvent get/expectEndElement(events)</dt>
<dd>Return the next ENDELEMENT.</dd>
<dt>XMLEvent get/expectEndElement(events, qName)</dt>
<dd>Return the next ENDELEMENT with QName
<em>qname</em>.</dd>
<dt>XMLEvent get/expectEndElement(events, uriIndex, localName)</dt>
<dd>
Return the next ENDELEMENT with a URI indicated by the
<em>uriIndex</em> and a local name matching
<em>localName</em>.
</dd>
<dt>
XMLEvent get/expectEndElement(events, XMLEvent event)
</dt>
<dd>
Return the next ENDELEMENT with a URI matching the
<em>uriIndex</em> and <em>localName</em>
matching those in the <em>event</em> argument. This
is intended as a quick way to find the ENDELEMENT matching
a previously returned STARTELEMENT.
</dd>
<dt>XMLEvent get/expectCharacters(events)</dt>
<dd>Return the next CHARACTERS event.</dd>
</dl>
</section>
<section>
<title>FOP modularisation</title>
<p>
This same principle can be extended to the other major
sub-systems of FOP processing. In each case, while it is
possible to hold a complete intermediate result in memory,
the memory costs of that approach are too high. The
sub-systems - xml parsing, FO tree construction, Area tree
construction and rendering - must run in parallel if the
footprint is to be kept manageable. By creating a series of
producer-consumer pairs linked by synchronized buffers,
logical isolation can be achieved while rates of processing
remain coupled. By introducing feedback loops conveying
information about the completion of processing of the
elements, sub-systems can dispose of or precis those
elements without having to be tightly coupled to downstream
processes.<br/><br/>
<strong>Figure 3</strong>
</p>
<figure src="processPlumbing.png" alt="FOP modularisation"/>
</section>
</section>
</body>
</document>
|