aboutsummaryrefslogtreecommitdiffstats
path: root/build/jakarta-poi/docs/hpsf/how-to.html
blob: 12655dd39ae9a9ba65199a25cc71b625450f7f85 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta content="text/html; charset=ISO-8859-1">
<title>HPSF HOW-TO</title>
<style type="text/css">
 body { background-color: white; font-size: normal; color: black ; }
 a { color: #525d76; }
 a.black { color: #000000;} 
 table {border-width: 0; width: 100%}
 table.centered {text-align: center}
 table.title {text-align: center; width: 80%} 
 img{border-width: 0;} 
 span.s1 {font-family: Helvetica, Arial, sans-serif; font-weight: bold; color: #000000; }
 span.s1_white { font-family: Helvetica, Arial, sans-serif; font-weight: bold; color: #ffffff; } 
 span.title {font-family: Helvetica, Arial, sans-serif; font-weight: bold; color: #000000; }
 span.c1 {color: #000000; font-family: Helvetica, Arial, sans-serif}
 tr.left {text-align: left}
 hr { width: 100%; size: 2} 
</style>
</head>
<body>
<table width="100%" cellspacing="0" cellpadding="0" border="0">
<tr>
<td valign="top" align="left"><a href="http://jakarta.apache.org/index.html"><img hspace="0" vspace="0" border="0" src="images/jakarta-logo.gif"></a></td><td width="100%" valign="top" align="left" bgcolor="#ffffff"><img hspace="0" vspace="0" border="0" align="right" src="images/header.gif"></td>
</tr>
<tr>
<td colspan="2" bgcolor="#525d76"><span class="c1"><a class="black" href="http://www.apache.org/">www.apache.org &gt;</a><a class="black" href="http://jakarta.apache.org/">jakarta.apache.org &gt;</a><a href="http://jakarta.apache.org/poi/" class="black">jakarta.apache.org/poi</a></span></td>
</tr>
<tr>
<td height="8"></td>
</tr>
</table>
<table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td width="1%">
<br>
</td><td nowrap="1" valign="top" width="14%">
<br>
<span class="s1">Navigation</span>
<br>
<a class="s1" href="../index.html">Main</a>
<br>
<br>
<span class="s1">HPSF</span>
<br>
<a class="s1" href="index.html">Overview</a>
<br>
<a class="s1" href="how-to.html">How To</a>
<br>
<a class="s1" href="internals.html">Internals</a>
<br>
<a class="s1" href="todo.html">To Do</a>
<br>
</td><td width="1%">
<br>
</td><td align="left" valign="top" width="*">
<title>HPSF HOW-TO</title>
<table width="100%" align="center" class="centered">
<tbody>
<tr>
<td align="center">
<table border="0" cellpadding="1" cellspacing="0" class="title">
<tbody>
<tr>
<td bgcolor="#525d76">
<table width="100%" border="0" cellpadding="2" cellspacing="0" class="centered">
<tbody>
<tr>
<td bgcolor="#f3dd61"><span class="title">HPSF HOW-TO</span></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<font size="-2" color="#000000">
<p>
<a href="mailto:"></a>
</p>
</font>
<div align="right">
<table cellspacing="0" cellpadding="2" border="0" width="100%">
<tr>
<td bgcolor="#525D76"><font color="#ffffff" size="+1"><font face="Arial,sans-serif"><b>How To Use the HPSF APIs</b></font></font></td>
</tr>
<tr>
<td>
<br>

   
<p align="justify">This HOW-TO is organized in three section. You should read them
    sequentially because the later sections build upon the earlier ones.</p>

   
<ol>
    
<li>
     
<p align="justify">The <a href="#sec1">first section</a> explains how to read
      the most important standard properties of a Microsoft Office
      document. Standard properties are things like title, author, creation
      date etc. It is quite likely that you will find here what you need and
      don't have to read the other sections.</p>
    
</li>

    
<li>
     
<p align="justify">The <a href="#sec2">second section</a> goes a small step
      further and focusses on reading additional standard properties. It also
      talks about exceptions that may be thrown when dealing with HPSF and
      shows how you can read properties of embedded objects.</p>
     
</li>

    
<li>
     
<p align="justify">The <a href="#sec3">third section</a> tells how to read
      non-standard properties. Non-standard properties are application-specific
      name/value/type triples.</p>
     
</li>
   
</ol>

   
<anchor id="sec1"></anchor>
   
<div align="right">
<table cellspacing="0" cellpadding="2" border="0" width="99%">
<tr>
<td bgcolor="#525D76"><font color="#ffffff" size="+0"><font face="Arial,sans-serif"><b>Reading Standard Properties</b></font></font></td>
</tr>
<tr>
<td>
<br>

    
<note>This section explains how to read
      the most important standard properties of a Microsoft Office
      document. Standard properties are things like title, author, creation
      date etc. Chances are that you will find here what you need and
      don't have to read the other sections.</note>

    
<p align="justify">The first thing you should understand is that properties are stored in
     separate documents inside the POI filesystem. (If you don't know what a
     POI filesystem is, read its <a href="../poifs/index.html">documentation</a>.)  A document in a POI
     filesystem is also called a <em>stream</em>.</p>

    
<p align="justify">The following example shows how to read a POI filesystem's
     "title" property. Reading other properties is similar. Consider the API
     documentation of <code>org.apache.poi.hpsf.SummaryInformation</code>.</p>

    
<p align="justify">The standard properties this section focusses on can be
     found in a document called <em>\005SummaryInformation</em> in the root of
     the POI filesystem. The notation <em>\005</em> in the document's name
     means the character with the decimal value of 5. In order to read the
     title, an application has to perform the following steps:</p>

    
<ol>
     
<li>
      
<p align="justify">Open the document <em>\005SummaryInformation</em> located in the root
       of the POI filesystem.</p>
     
</li>
     
<li>
      
<p align="justify">Create an instance of the class
       <code>SummaryInformation</code> from that
       document.</p>
     
</li>
     
<li>
      
<p align="justify">Call the <code>SummaryInformation</code> instance's
       <code>getTitle()</code> method.</p>
     
</li>
    
</ol>

    
<p align="justify">Sounds easy, doesn't it? Here are the steps in detail.</p>


    
<div align="right">
<table cellspacing="0" cellpadding="2" border="0" width="98%">
<tr>
<td bgcolor="#525D76"><font color="#ffffff" size="-1"><font face="Arial,sans-serif"><b>Open the document \005SummaryInformation in the root of the        POI filesystem</b></font></font></td>
</tr>
<tr>
<td>
<br>

     
<p align="justify">An application that wants to open a document in a POI filesystem
      (POIFS) proceeds as shown by the following code fragment. (The full
      source code of the sample application is available in the
      <em>examples</em> section of the POI source tree as
      <em>ReadTitle.java</em>.)</p>

     
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>
import java.io.*;
import org.apache.poi.hpsf.*;
import org.apache.poi.poifs.eventfilesystem.*;

// ...

public static void main(String[] args)
    throws IOException
{
    final String filename = args[0];
    POIFSReader r = new POIFSReader();
    r.registerListener(new MyPOIFSReaderListener(),
                       "\005SummaryInformation");
    r.read(new FileInputStream(filename));
}</pre>
</td>
</tr>
</table>
</div>

     
<p align="justify">The first interesting statement is</p>

     
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>POIFSReader r = new POIFSReader();</pre>
</td>
</tr>
</table>
</div>

     
<p align="justify">It creates a
      <code>org.apache.poi.poifs.eventfilesystem.POIFSReader</code> instance
      which we shall need to read the POI filesystem. Before the application
      actually opens the POI filesystem we have to tell the
      <code>POIFSReader</code> which documents we are interested in. In this
      case the application should do something with the document
      <em>\005SummaryInformation</em>.</p>

     
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>
r.registerListener(new MyPOIFSReaderListener(),
                   "\005SummaryInformation");</pre>
</td>
</tr>
</table>
</div>

     
<p align="justify">This method call registers a
      <code>org.apache.poi.poifs.eventfilesystem.POIFSReaderListener</code>
      with the <code>POIFSReader</code>. The <code>POIFSReaderListener</code>
      interface specifies the method <code>processPOIFSReaderEvent</code>
      which processes a document. The class
      <code>MyPOIFSReaderListener</code> implements the
      <code>POIFSReaderListener</code> and thus the
      <code>processPOIFSReaderEvent</code> method. The eventing POI filesystem
      calls this method when it finds the <em>\005SummaryInformation</em>
      document. In the sample application <code>MyPOIFSReaderListener</code> is
      a static class in the <em>ReadTitle.java</em> source file.)</p>

     
<p align="justify">Now everything is prepared and reading the POI filesystem can
      start:</p>

     
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>r.read(new FileInputStream(filename));</pre>
</td>
</tr>
</table>
</div>

     
<p align="justify">The following source code fragment shows the
      <code>MyPOIFSReaderListener</code> class and how it retrieves the
      title.</p>

     
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>
static class MyPOIFSReaderListener implements POIFSReaderListener
{
    public void processPOIFSReaderEvent(POIFSReaderEvent e)
    {
        SummaryInformation si = null;
        try
        {
            si = (SummaryInformation)
                 PropertySetFactory.create(e.getStream());
        }
        catch (Exception ex)
        {
            throw new RuntimeException
                ("Property set stream \"" +
                 event.getPath() + event.getName() + "\": " + ex);
        }
        final String title = si.getTitle();
        if (title != null)
            System.out.println("Title: \"" + title + "\"");
        else
            System.out.println("Document has no title.");
    }
}
</pre>
</td>
</tr>
</table>
</div>

     
<p align="justify">The line</p>

     
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>SummaryInformation si = null;</pre>
</td>
</tr>
</table>
</div>

     
<p align="justify">declares a <code>SummaryInformation</code> variable and initializes it
      with <code>null</code>. We need an instance of this class to access the
      title. The instance is created in a <code>try</code> block:</p>

     
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>si = (SummaryInformation)
                 PropertySetFactory.create(e.getStream());</pre>
</td>
</tr>
</table>
</div>

     
<p align="justify">The expression <code>e.getStream()</code> returns the input stream
      containing the bytes of the property set stream named
      <em>\005SummaryInformation</em>. This stream is passed into the
      <code>create</code> method of the factory class
      <code>org.apache.poi.hpsf.PropertySetFactory</code> which returns
      a <code>org.apache.poi.hpsf.PropertySet</code> instance. It is more or
      less safe to cast this result to <code>SummaryInformation</code>, a
      convenience class with methods like <code>getTitle()</code>,
      <code>getAuthor()</code> etc.</p>

     
<p align="justify">The <code>PropertySetFactory.create</code> method may throw all sorts
      of exceptions. We'll deal with them in the next sections. For now we just
      catch all exceptions and throw a <code>RuntimeException</code>
      containing the message text of the origin exception.</p>

     
<p align="justify">If all goes well, the sample application retrieves the title and prints
     it to the standard output. As you can see you must be prepared for the
      case that the POI filesystem does not have a title.</p>

     
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td>
<pre>final String title = si.getTitle();
    if (title != null)
        System.out.println("Title: \"" + title + "\"");
    else
        System.out.println("Document has no title.");</pre>
</td>
</tr>
</table>
</div>

     
<p align="justify">Please note that a Microsoft Office document does not necessarily
      contain the <em>\005SummaryInformation</em> stream. The documents created
      by the Microsoft Office suite have one, as far as I know. However, an
      Excel spreadsheet exported from StarOffice 5.2 won't have a
      <em>\005SummaryInformation</em> stream. In this case the applications
      won't throw an exception but simply does not call the
      <code>processPOIFSReaderEvent</code> method. You have been warned!</p>
    
</td>
</tr>
</table>
</div>
<br>
   
</td>
</tr>
</table>
</div>
<br>

   
<anchor id="sec2"></anchor>
   
<div align="right">
<table cellspacing="0" cellpadding="2" border="0" width="99%">
<tr>
<td bgcolor="#525D76"><font color="#ffffff" size="+0"><font face="Arial,sans-serif"><b>Additional Standard Properties, Exceptions And Embedded Objects</b></font></font></td>
</tr>
<tr>
<td>
<br>

    
<note>This section focusses on reading additional standard properties. It
     also talks about exceptions that may be thrown when dealing with HPSF and
     shows how you can read properties of embedded objects.</note>

    
<p align="justify">A couple of <em>additional standard properties</em> are not
     contained in the <em>\005SummaryInformation</em> stream explained above,
     for example a document's category or the number of multimedia clips in a
     PowerPoint presentation. Microsoft has invented an additional stream named
     <em>\005DocumentSummaryInformation</em> to hold these properties. With two
     minor exceptions you can proceed exactly as described above to read the
     properties stored in <em>\005DocumentSummaryInformation</em>:</p>

    
<ul>
     
<li>
<p align="justify">Instead of <em>\005SummaryInformation</em> use
       <em>\005DocumentSummaryInformation</em> as the stream's name.</p>
</li>
     
<li>
<p align="justify">Replace all occurrences of the class
       <code>SummaryInformation</code> by
       <code>DocumentSummaryInformation</code>.</p>
</li>
    
</ul>

    
<p align="justify">And of course you cannot call <code>getTitle()</code> because
     <code>DocumentSummaryInformation</code> has different query methods. See
     the API documentation for the details!</p>

    
<p align="justify">In the previous section the application simply caught all
     <em>exceptions</em> and was in no way interested in any
     details. However, a real application will likely want to know what went
     wrong and act appropriately. Besides any IO exceptions there are three
     HPSF resp. POI specific exceptions you should know about:</p>

    
<dl>
     
<dt>
<code>NoPropertySetStreamException</code>:</dt>
     
<dd>
<p align="justify">This exception is thrown if the application tries to create a
       <code>PropertySet</code> or one of its subclasses
       <code>SummaryInformation</code> and
       <code>DocumentSummaryInformation</code> from a stream that is not a
       property set stream. A faulty property set stream counts as not being a 
       property set stream at all. An application should be prepared to deal
       with this case even if opens streams named
       <em>\005SummaryInformation</em> or
       <em>\005DocumentSummaryInformation</em> only. These are just names. A
       stream's name by itself does not ensure that the stream contains the
       expected contents and that this contents is correct.</p>
</dd>

     
<dt>
<code>UnexpectedPropertySetTypeException</code>
</dt>
     
<dd>
<p align="justify">This exception is thrown if a certain type of property set is
       expected somewhere (e.g. a <code>SummaryInformation</code> or
       <code>DocumentSummaryInformation</code>) but the provided property
       set is not of that type.</p>
</dd>

     
<dt>
<code>MarkUnsupportedException</code>
</dt>
     
<dd>
<p align="justify">This exception is thrown if an input stream that is to be parsed
       into a property set does not support the
       <code>InputStream.mark(int)</code> operation. The POI filesystem uses
       the <code>DocumentInputStream</code> class which does support this
       operation, so you are safe here. However, if you read a property set
       stream from another kind of input stream things may be
       different.</p>
</dd>
    
</dl>

    
<p align="justify">Many Microsoft Office documents contain <em>embedded
      objects</em>, for example an Excel sheet on a page in a Word
     document. Embedded objects may have property sets of their own. An
     application can open these property set streams as described above. The
     only difference is that they are not located in the POI filesystem's root
     but in a nested directory instead. Just register a
     <code>POIFSReaderListener</code> for the property set streams you are
     interested in. For example, the <em>POIBrowser</em> application in the
     contrib section tries to open each and every document in a POI filesystem
     as a property set stream. If this operation was successful it displays the
     properties.</p>
   
</td>
</tr>
</table>
</div>
<br>

   
<anchor id="sec3"></anchor>
   
<div align="right">
<table cellspacing="0" cellpadding="2" border="0" width="99%">
<tr>
<td bgcolor="#525D76"><font color="#ffffff" size="+0"><font face="Arial,sans-serif"><b>Reading Non-Standard Properties</b></font></font></td>
</tr>
<tr>
<td>
<br>

    
<note>This section tells how to read
     non-standard properties. Non-standard properties are application-specific
     name/value/type triples.</note>

    
<div align="center">
<table cellspacing="2" cellpadding="2" border="1">
<tr>
<td bgcolor="#c0c0c0"><font size="-1" color="#023264">Write this section!</font></td>
</tr>
</table>
</div>
   
</td>
</tr>
</table>
</div>
<br>
  
</td>
</tr>
</table>
</div>
<br>
</td>
</tr>
</table>
<br>
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tbody>
<tr>
<td>
<hr noshade="" size="1">
</td>
</tr>
<tr>
<td align="center"><i>Copyright &copy; 2002 Apache Software Foundation</i></td>
</tr>
<tr>
<td align="right" width="100%">
<br>
</td>
</tr>
<tr>
<td align="right" width="100%"><a href="http://krysalis.org/"><img alt="Krysalis Logo" src="images/krysalis-compatible.jpg"></a><a href="http://xml.apache.org/cocoon/"><img alt="Cocoon Logo" src="images/built-with-cocoon.gif"></a></td>
</tr>
</tbody>
</table>
</body>
</html>