$Id$
+<book software="POI Project"
+ title="HPSF"
+ copyright="@year@ POI Project">
+ <menu label="Navigation">
+ <menu-item label="Main" href="../index.html"/>
+ </menu>
+ <menu label="HPSF">
+ <menu-item label="Overview" href="index.html"/>
+ <menu-item label="How To" href="how-to.html"/>
+ <menu-item label="Thumbnails" href="thumbnails.html"/>
+ <menu-item label="Internals" href="internals.html"/>
+ <menu-item label="To Do" href="todo.html"/>
+ </menu>
diff --git a/src/documentation/content/xdocs/hpsf/how-to.xml b/src/documentation/content/xdocs/hpsf/how-to.xml
new file mode 100644
index 0000000000..57f880700e
--- /dev/null
+++ b/src/documentation/content/xdocs/hpsf/how-to.xml
@@ -0,0 +1,868 @@
+ <header>
+ <title>HPSF HOW-TO</title>
+ <authors>
+ <person name="Rainer Klute" email=""/>
+ </authors>
+ </header>
+ <body>
+ <section><title>How To Use the HPSF APIs</title>
+ <p>This HOW-TO is organized in three sections. You should read them
+ sequentially because the later sections build upon the earlier ones.</p>
+ <ol>
+ <li>
+ The <link href="#sec1">first section</link> explains how to read
+ the most important standard properties of a Microsoft Office
+ document. Standard properties are things like title, author, creation
+ date etc. It is quite likely that you will find here what you need and
+ don't have to read the other sections.
+ </li>
+ <li>
+ The <link href="#sec2">second section</link> goes a small step
+ further and focusses on reading additional standard properties. It also
+ talks about exceptions that may be thrown when dealing with HPSF and
+ shows how you can read properties of embedded objects.
+ </li>
+ <li>
+ The <link href="#sec3">third section</link> tells how to read
+ non-standard properties. Non-standard properties are application-specific
+ triples consisting of an ID, a type, and a value.
+ </li>
+ </ol>
+ <anchor id="sec1"/>
+ <section><title>Reading Standard Properties</title>
+ <note>This section explains how to read
+ the most important standard properties of a Microsoft Office
+ document. Standard properties are things like title, author, creation
+ date etc. Chances are that you will find here what you need and
+ don't have to read the other sections.</note>
+ <p>The first thing you should understand is that properties are stored in
+ separate documents inside the POI filesystem. (If you don't know what a
+ POI filesystem is, read the <link href="../poifs/index.html">POIFS
+ documentation</link>.) A document in a POI filesystem is also called a
+ <strong>stream</strong>.</p>
+ <p>The following example shows how to read a POI filesystem's
+ "title" property. Reading other properties is similar. Consider the API
+ documentation of <code>org.apache.poi.hpsf.SummaryInformation</code> to
+ learn which methods are available!</p>
+ <p>The standard properties this section focusses on can be found in a
+ document called <em>\005SummaryInformation</em> located in the root of the
+ POI filesystem. The notation <em>\005</em> in the document's name means
+ the character with the decimal value of 5. In order to read the title, an
+ application has to perform the following steps:</p>
+ <ol>
+ <li>
+ Open the document <em>\005SummaryInformation</em> located in the root
+ of the POI filesystem.
+ </li>
+ <li>
+ Create an instance of the class <code>SummaryInformation</code> from
+ that document.
+ </li>
+ <li>
+ Call the <code>SummaryInformation</code> instance's
+ <code>getTitle()</code> method.
+ </li>
+ </ol>
+ <p>Sounds easy, doesn't it? Here are the steps in detail.</p>
+ <section><title>Open the document \005SummaryInformation in the root of the
+ POI filesystem</title>
+ <p>An application that wants to open a document in a POI filesystem
+ (POIFS) proceeds as shown by the following code fragment. (The full
+ source code of the sample application is available in the
+ <em>examples</em> section of the POI source tree as
+ <em></em>.</p>
+ <source>
+import org.apache.poi.hpsf.*;
+import org.apache.poi.poifs.eventfilesystem.*;
+// ...
+public static void main(String[] args)
+ throws IOException
+ final String filename = args[0];
+ POIFSReader r = new POIFSReader();
+ r.registerListener(new MyPOIFSReaderListener(),
+ "\005SummaryInformation");
+ FileInputStream(filename));
+ <p>The first interesting statement is</p>
+ <source>POIFSReader r = new POIFSReader();</source>
+ <p>It creates a
+ <code>org.apache.poi.poifs.eventfilesystem.POIFSReader</code> instance
+ which we shall need to read the POI filesystem. Before the application
+ actually opens the POI filesystem we have to tell the
+ <code>POIFSReader</code> which documents we are interested in. In this
+ case the application should do something with the document
+ <em>\005SummaryInformation</em>.</p>
+ <source>
+r.registerListener(new MyPOIFSReaderListener(),
+ "\005SummaryInformation");</source>
+ <p>This method call registers a
+ <code>org.apache.poi.poifs.eventfilesystem.POIFSReaderListener</code>
+ with the <code>POIFSReader</code>. The <code>POIFSReaderListener</code>
+ interface specifies the method <code>processPOIFSReaderEvent</code>
+ which processes a document. The class
+ <code>MyPOIFSReaderListener</code> implements the
+ <code>POIFSReaderListener</code> and thus the
+ <code>processPOIFSReaderEvent</code> method. The eventing POI filesystem
+ calls this method when it finds the <em>\005SummaryInformation</em>
+ document. In the sample application <code>MyPOIFSReaderListener</code> is
+ a static class in the <em></em> source file.</p>
+ <p>Now everything is prepared and reading the POI filesystem can
+ start:</p>
+ <source> FileInputStream(filename));</source>
+ <p>The following source code fragment shows the
+ <code>MyPOIFSReaderListener</code> class and how it retrieves the
+ title.</p>
+ <source>
+static class MyPOIFSReaderListener implements POIFSReaderListener
+ public void processPOIFSReaderEvent(POIFSReaderEvent event)
+ {
+ SummaryInformation si = null;
+ try
+ {
+ si = (SummaryInformation)
+ PropertySetFactory.create(event.getStream());
+ }
+ catch (Exception ex)
+ {
+ throw new RuntimeException
+ ("Property set stream \"" +
+ event.getPath() + event.getName() + "\": " + ex);
+ }
+ final String title = si.getTitle();
+ if (title != null)
+ System.out.println("Title: \"" + title + "\"");
+ else
+ System.out.println("Document has no title.");
+ }
+ <p>The line</p>
+ <source>SummaryInformation si = null;</source>
+ <p>declares a <code>SummaryInformation</code> variable and initializes it
+ with <code>null</code>. We need an instance of this class to access the
+ title. The instance is created in a <code>try</code> block:</p>
+ <source>si = (SummaryInformation)
+ PropertySetFactory.create(event.getStream());</source>
+ <p>The expression <code>event.getStream()</code> returns the input stream
+ containing the bytes of the property set stream named
+ <em>\005SummaryInformation</em>. This stream is passed into the
+ <code>create</code> method of the factory class
+ <code>org.apache.poi.hpsf.PropertySetFactory</code> which returns
+ a <code>org.apache.poi.hpsf.PropertySet</code> instance. It is more or
+ less safe to cast this result to <code>SummaryInformation</code>, a
+ convenience class with methods like <code>getTitle()</code>,
+ <code>getAuthor()</code> etc.</p>
+ <p>The <code>PropertySetFactory.create</code> method may throw all sorts
+ of exceptions. We'll deal with them in the next sections. For now we just
+ catch all exceptions and throw a <code>RuntimeException</code>
+ containing the message text of the origin exception.</p>
+ <p>If all goes well, the sample application retrieves the title and prints
+ it to the standard output. As you can see you must be prepared for the
+ case that the POI filesystem does not have a title.</p>
+ <source>final String title = si.getTitle();
+if (title != null)
+ System.out.println("Title: \"" + title + "\"");
+ System.out.println("Document has no title.");</source>
+ <p>Please note that a Microsoft Office document does not necessarily
+ contain the <em>\005SummaryInformation</em> stream. The documents created
+ by the Microsoft Office suite have one, as far as I know. However, an
+ Excel spreadsheet exported from StarOffice 5.2 won't have a
+ <em>\005SummaryInformation</em> stream. In this case the applications
+ won't throw an exception but simply does not call the
+ <code>processPOIFSReaderEvent</code> method. You have been warned!</p>
+ </section>
+ </section>
+ <anchor id="sec2"/>
+ <section><title>Additional Standard Properties, Exceptions And Embedded Objects</title>
+ <note>This section focusses on reading additional standard properties. It
+ also talks about exceptions that may be thrown when dealing with HPSF and
+ shows how you can read properties of embedded objects.</note>
+ <p>A couple of <strong>additional standard properties</strong> are not
+ contained in the <em>\005SummaryInformation</em> stream explained above,
+ for example a document's category or the number of multimedia clips in a
+ PowerPoint presentation. Microsoft has invented an additional stream named
+ <em>\005DocumentSummaryInformation</em> to hold these properties. With two
+ minor exceptions you can proceed exactly as described above to read the
+ properties stored in <em>\005DocumentSummaryInformation</em>:</p>
+ <ul>
+ <li>Instead of <em>\005SummaryInformation</em> use
+ <em>\005DocumentSummaryInformation</em> as the stream's name.</li>
+ <li>Replace all occurrences of the class
+ <code>SummaryInformation</code> by
+ <code>DocumentSummaryInformation</code>.</li>
+ </ul>
+ <p>And of course you cannot call <code>getTitle()</code> because
+ <code>DocumentSummaryInformation</code> has different query methods. See
+ the Javadoc API documentation for the details!</p>
+ <p>In the previous section the application simply caught all
+ <strong>exceptions</strong> and was in no way interested in any
+ details. However, a real application will likely want to know what went
+ wrong and act appropriately. Besides any IO exceptions there are three
+ HPSF resp. POI specific exceptions you should know about:</p>
+ <dl>
+ <dt><code>NoPropertySetStreamException</code>:</dt>
+ <dd>
+ This exception is thrown if the application tries to create a
+ <code>PropertySet</code> instance from a stream that is not a
+ property set stream. (<code>SummaryInformation</code> and
+ <code>DocumentSummaryInformation</code> are subclasses of
+ <code>PropertySet</code>.) A faulty property set stream counts as not
+ being a property set stream at all. An application should be prepared to
+ deal with this case even if it opens streams named
+ <em>\005SummaryInformation</em> or
+ <em>\005DocumentSummaryInformation</em> only. These are just names. A
+ stream's name by itself does not ensure that the stream contains the
+ expected contents and that this contents is correct.
+ </dd>
+ <dt><code>UnexpectedPropertySetTypeException</code></dt>
+ <dd>This exception is thrown if a certain type of property set is
+ expected somewhere (e.g. a <code>SummaryInformation</code> or
+ <code>DocumentSummaryInformation</code>) but the provided property
+ set is not of that type.</dd>
+ <dt><code>MarkUnsupportedException</code></dt>
+ <dd>This exception is thrown if an input stream that is to be parsed
+ into a property set does not support the
+ <code>InputStream.mark(int)</code> operation. The POI filesystem uses
+ the <code>DocumentInputStream</code> class which does support this
+ operation, so you are safe here. However, if you read a property set
+ stream from another kind of input stream things may be
+ different.</dd>
+ </dl>
+ <p>Many Microsoft Office documents contain <strong>embedded
+ objects</strong>, for example an Excel sheet on a page in a Word
+ document. Embedded objects may have property sets of their own. An
+ application can open these property set streams as described above. The
+ only difference is that they are not located in the POI filesystem's root
+ but in a <strong>nested directory</strong> instead. Just register a
+ <code>POIFSReaderListener</code> for the property set streams you are
+ interested in. For example, the <em>POIBrowser</em> application in the
+ contrib section tries to open each and every document in a POI filesystem
+ as a property set stream. If this operation was successful it displays the
+ properties.</p>
+ </section>
+ <anchor id="sec3"/>
+ <section><title>Reading Non-Standard Properties</title>
+ <note>This section tells how to read non-standard properties. Non-standard
+ properties are application-specific ID/type/value triples.</note>
+ <section><title>Overview</title>
+ <p>Now comes the real hardcode stuff. As mentioned above,
+ <code>SummaryInformation</code> and
+ <code>DocumentSummaryInformation</code> are just special cases of the
+ general concept of a property set. This concept says that a
+ <strong>property set</strong> consists of properties and that each
+ <strong>property</strong> is an entity with an <strong>ID</strong>, a
+ <strong>type</strong>, and a <strong>value</strong>.</p>
+ <p>Okay, that was still rather easy. However, to make things more
+ complicated, Microsoft in its infinite wisdom decided that a property set
+ shalt be broken into one or more <strong>sections</strong>. Each section
+ holds a bunch of properties. But since that's still not complicated
+ enough, a section may have an optional <strong>dictionary</strong> that
+ maps property IDs to <strong>property names</strong> - we'll explain
+ later what that means.</p>
+ <p>The procedure to get to the properties is the following:</p>
+ <ol>
+ <li>Use the <strong><code>PropertySetFactory</code></strong> class to
+ create a <code>PropertySet</code> object from a property set stream. If
+ you don't know whether an input stream is a property set stream, just
+ try to call <code>PropertySetFactory.create(</code>:
+ You'll either get a <code>PropertySet</code> instance returned or an
+ exception is thrown.</li>
+ <li>Call the <code>PropertySet</code>'s method <code>getSections()</code>
+ to get the sections contained in the property set. Each section is
+ an instance of the <code>Section</code> class.</li>
+ <li>Each section has a format ID. The format ID of the first section in a
+ property set determines the property set's type. For example, the first
+ (and only) section of the SummaryInformation property set has a format
+ ID of <code>F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9</code>. You can
+ get the format ID with <code>Section.getFormatID()</code>.</li>
+ <li>The properties contained in a <code>Section</code> can be retrieved
+ with <code>Section.getProperties()</code>. The result is an array of
+ <code>Property</code> instances.</li>
+ <li>A property has a name, a type, and a value. The <code>Property</code>
+ class has methods to retrieve them.</li>
+ </ol>
+ </section>
+ <section><title>A Sample Application</title>
+ <p>Let's have a look at a sample Java application that dumps all property
+ set streams contained in a POI file system. The full source code of this
+ program can be found as <em></em> in the
+ <em>examples</em> area of the POI source code tree. Here are the key
+ sections:</p>
+ <source>import*;
+import java.util.*;
+import org.apache.poi.hpsf.*;
+import org.apache.poi.poifs.eventfilesystem.*;
+import org.apache.poi.util.HexDump;</source>
+ <p>The most important package the application needs is
+ <code>org.apache.poi.hpsf.*</code>. This package contains the HPSF
+ classes. Most classes named below are from the HPSF package. Of course we
+ also need the POIFS event file system's classes and <code>*</code>
+ since we are dealing with POI I/O. From the <code>java.util</code> package
+ we use the <code>List</code> and <code>Iterator</code> class. The class
+ <code>org.apache.poi.util.HexDump</code> provides a methods to dump byte
+ arrays as nicely formatted strings.</p>
+ <source>public static void main(String[] args)
+ throws IOException
+ final String filename = args[0];
+ POIFSReader r = new POIFSReader();
+ /* Register a listener for *all* documents. */
+ r.registerListener(new MyPOIFSReaderListener());
+ FileInputStream(filename));
+ <p>The <code>POIFSReader</code> is set up in a way that the listener
+ <code>MyPOIFSReaderListener</code> is called on every file in the POI file
+ system.</p>
+ </section>
+ <section><title>The Property Set</title>
+ <p>The listener class tries to create a <code>PropertySet</code> from each
+ stream using the <code>PropertySetFactory.create()</code> method:</p>
+ <source>static class MyPOIFSReaderListener implements POIFSReaderListener
+ public void processPOIFSReaderEvent(POIFSReaderEvent event)
+ {
+ PropertySet ps = null;
+ try
+ {
+ ps = PropertySetFactory.create(event.getStream());
+ }
+ catch (NoPropertySetStreamException ex)
+ {
+ out("No property set stream: \"" + event.getPath() +
+ event.getName() + "\"");
+ return;
+ }
+ catch (Exception ex)
+ {
+ throw new RuntimeException
+ ("Property set stream \"" +
+ event.getPath() + event.getName() + "\": " + ex);
+ }
+ /* Print the name of the property set stream: */
+ out("Property set stream \"" + event.getPath() +
+ event.getName() + "\":");</source>
+ <p>Creating the <code>PropertySet</code> is done in a <code>try</code>
+ block, because not each stream in the POI file system contains a property
+ set. If it is some other file, the
+ <code>PropertySetFactory.create()</code> throws a
+ <code>NoPropertySetStreamException</code>, which is caught and
+ logged. Then the program continues with the next stream. However, all
+ other types of exceptions cause the program to terminate by throwing a
+ runtime exception. If all went well, we can print the name of the property
+ set stream.</p>
+ </section>
+ <section><title>The Sections</title>
+ <p>The next step is to print the number of sections followed by the
+ sections themselves:</p>
+ <source>/* Print the number of sections: */
+final long sectionCount = ps.getSectionCount();
+out(" No. of sections: " + sectionCount);
+/* Print the list of sections: */
+List sections = ps.getSections();
+int nr = 0;
+for (Iterator i = sections.iterator(); i.hasNext();)
+ /* Print a single section: */
+ Section sec = (Section);
+ // See below for the complete loop body.
+ <p>The <code>PropertySet</code>'s method <code>getSectionCount()</code>
+ returns the number of sections.</p>
+ <p>To retrieve the sections, use the <code>getSections()</code>
+ method. This method returns a <code>java.util.List</code> containing
+ instances of the <code>Section</code> class in their proper order.</p>
+ <p>The sample code shows a loop that retrieves the <code>Section</code>
+ objects one by one and prints some information about each one. Here is
+ the complete body of the loop:</p>
+ <source>/* Print a single section: */
+Section sec = (Section);
+out(" Section " + nr++ + ":");
+String s = hex(sec.getFormatID().getBytes());
+s = s.substring(0, s.length() - 1);
+out(" Format ID: " + s);
+/* Print the number of properties in this section. */
+int propertyCount = sec.getPropertyCount();
+out(" No. of properties: " + propertyCount);
+/* Print the properties: */
+Property[] properties = sec.getProperties();
+for (int i2 = 0; i2 &lt; properties.length; i2++)
+ /* Print a single property: */
+ Property p = properties[i2];
+ int id = p.getID();
+ long type = p.getType();
+ Object value = p.getValue();
+ out(" Property ID: " + id + ", type: " + type +
+ ", value: " + value);
+ </section>
+ <section><title>The Section's Format ID</title>
+ <p>The first method called on the <code>Section</code> instance is
+ <code>getFormatID()</code>. As explained above, the format ID of the
+ first section in a property set determines the type of the property
+ set. Its type is <code>ClassID</code> which is essentially a sequence of
+ 16 bytes. A real application using its own type of a custom property set
+ should have defined a unique format ID and, when reading a property set
+ stream, should check the format ID is equal to that unique format ID. The
+ sample program just prints the format ID it finds in a section:</p>
+ <source>String s = hex(sec.getFormatID().getBytes());
+s = s.substring(0, s.length() - 1);
+out(" Format ID: " + s);</source>
+ <p>As you can see, the <code>getFormatID()</code> method returns a
+ <code>ClassID</code> object. An array containing the bytes can be
+ retrieved with <code>ClassID.getBytes()</code>. In order to get a nicely
+ formatted printout, the sample program uses the <code>hex()</code> helper
+ method which in turn uses the POI utility class <code>HexDump</code> in
+ the <code>org.apache.poi.util</code> package. Another helper method is
+ <code>out()</code> which just saves typing
+ <code>System.out.println()</code>.</p>
+ </section>
+ <section><title>The Properties</title>
+ <p>Before getting the properties, it is possible to find out how many
+ properties are available in the section via the
+ <code>Section.getPropertyCount()</code>. The sample application uses this
+ method to print the number of properties to the standard output:</p>
+ <source>int propertyCount = sec.getPropertyCount();
+out(" No. of properties: " + propertyCount);</source>
+ <p>Now its time to get to the properties themselves. You can retrieve a
+ section's properties with the method
+ <code>Section.getProperties()</code>:</p>
+ <source>Property[] properties = sec.getProperties();</source>
+ <p>As you can see the result is an array of <code>Property</code>
+ objects. This class has three methods to retrieve a property's ID, its
+ type, and its value. The following code snippet shows how to call
+ them:</p>
+ <source>for (int i2 = 0; i2 &lt; properties.length; i2++)
+ /* Print a single property: */
+ Property p = properties[i2];
+ int id = p.getID();
+ long type = p.getType();
+ Object value = p.getValue();
+ out(" Property ID: " + id + ", type: " + type +
+ ", value: " + value);
+ </section>
+ <section><title>Sample Output</title>
+ <p>The output of the sample program might look like the following. It
+ shows the summary information and the document summary information
+ property sets of a Microsoft Word document. However, unlike the first and
+ second section of this HOW-TO the application does not have any code
+ which is specific to the <code>SummaryInformation</code> and
+ <code>DocumentSummaryInformation</code> classes.</p>
+ <source>Property set stream "/SummaryInformation":
+ No. of sections: 1
+ Section 0:
+ Format ID: 00000000 F2 9F 85 E0 4F F9 10 68 AB 91 08 00 2B 27 B3 D9 ....O..h....+'..
+ No. of properties: 17
+ Property ID: 1, type: 2, value: 1252
+ Property ID: 2, type: 30, value: Titel
+ Property ID: 3, type: 30, value: Thema
+ Property ID: 4, type: 30, value: Rainer Klute (Autor)
+ Property ID: 5, type: 30, value: Test (Stichwörter)
+ Property ID: 6, type: 30, value: This is a document for testing HPSF
+ Property ID: 7, type: 30, value:
+ Property ID: 8, type: 30, value: Unknown User
+ Property ID: 9, type: 30, value: 3
+ Property ID: 18, type: 30, value: Microsoft Word 9.0
+ Property ID: 12, type: 64, value: Mon Jan 01 00:59:25 CET 1601
+ Property ID: 13, type: 64, value: Thu Jul 18 16:22:00 CEST 2002
+ Property ID: 14, type: 3, value: 1
+ Property ID: 15, type: 3, value: 20
+ Property ID: 16, type: 3, value: 93
+ Property ID: 19, type: 3, value: 0
+ Property ID: 17, type: 71, value: [B@13582d
+Property set stream "/DocumentSummaryInformation":
+ No. of sections: 2
+ Section 0:
+ Format ID: 00000000 D5 CD D5 02 2E 9C 10 1B 93 97 08 00 2B 2C F9 AE ............+,..
+ No. of properties: 14
+ Property ID: 1, type: 2, value: 1252
+ Property ID: 2, type: 30, value: Test
+ Property ID: 14, type: 30, value: Rainer Klute (Manager)
+ Property ID: 15, type: 30, value: Rainer Klute IT-Consulting GmbH
+ Property ID: 5, type: 3, value: 3
+ Property ID: 6, type: 3, value: 2
+ Property ID: 17, type: 3, value: 111
+ Property ID: 23, type: 3, value: 592636
+ Property ID: 11, type: 11, value: false
+ Property ID: 16, type: 11, value: false
+ Property ID: 19, type: 11, value: false
+ Property ID: 22, type: 11, value: false
+ Property ID: 13, type: 4126, value: [B@56a499
+ Property ID: 12, type: 4108, value: [B@506411
+ Section 1:
+ Format ID: 00000000 D5 CD D5 05 2E 9C 10 1B 93 97 08 00 2B 2C F9 AE ............+,..
+ No. of properties: 7
+ Property ID: 0, type: 0, value: {6=Test-JaNein, 5=Test-Zahl, 4=Test-Datum, 3=Test-Text, 2=_PID_LINKBASE}
+ Property ID: 1, type: 2, value: 1252
+ Property ID: 2, type: 65, value: [B@c9ba38
+ Property ID: 3, type: 30, value: This is some text.
+ Property ID: 4, type: 64, value: Wed Jul 17 00:00:00 CEST 2002
+ Property ID: 5, type: 3, value: 27
+ Property ID: 6, type: 11, value: true
+No property set stream: "/WordDocument"
+No property set stream: "/CompObj"
+No property set stream: "/1Table"</source>
+ <p>There are some interesting items to note:</p>
+ <ul>
+ <li>The first property set (summary information) consists of a single
+ section, the second property set (document summary information) consists
+ of two sections.</li>
+ <li>Each section type (identified by its format ID) has its own domain of
+ property ID. For example, in the second property set the properties with
+ ID 2 have different meanings in the two section. By the way, the format
+ IDs of these sections are <strong>not</strong> equal, but you have to
+ look hard to find the difference.</li>
+ <li>The properties are not in any particular order in the section,
+ although they slightly tend to be sorted by their IDs.</li>
+ </ul>
+ </section>
+ <section><title>Property IDs</title>
+ <p>Properties in the same section are distinguished by their IDs. This is
+ similar to variables in a programming language like Java, which are
+ distinguished by their names. But unlike variable names, property IDs are
+ simple integral numbers. There is another similarity, however. Just like
+ a Java variable has a certain scope (e.g. a member variables in a class),
+ a property ID also has its scope of validity: the section.</p>
+ <p>Two property IDs in sections with different section format IDs
+ don't have the same meaning even though their IDs might be equal. For
+ example, ID 4 in the first (and only) section of a summary
+ information property set denotes the document's author, while ID 4 in the
+ first section of the document summary information property set means the
+ document's byte count. The sample output above does not show a property
+ with an ID of 4 in the first section of the document summary information
+ property set. That means that the document does not have a byte
+ count. However, there is a property with an ID of 4 in the
+ <em>second</em> section: This is a user-defined property ID - we'll get
+ to that topic in a minute.</p>
+ <p>So, how can you find out what the meaning of a certain property ID in
+ the summary information and the document summary information property set
+ is? The standard property sets as such don't have any hints about the
+ <strong>meanings of their property IDs</strong>. For example, the summary
+ information property set does not tell you that the property ID 4 stands
+ for the document's author. This is external knowledge. Microsoft defined
+ standard meanings for some of the property IDs in the summary information
+ and the document summary information property sets. As a help to the Java
+ and POI programmer, the class <code>PropertyIDMap</code> in the
+ <code>org.apache.poi.hpsf.wellknown</code> package defines constants
+ for the "well-known" property IDs. For example, there is the
+ definition</p>
+ <source>public final static int PID_AUTHOR = 4;</source>
+ <p>These definitions allow you to use symbolic names instead of
+ numbers.</p>
+ <p>In order to provide support for the other way, too, - i.e. to map
+ property IDs to property names - the class <code>PropertyIDMap</code>
+ defines two static methods:
+ <code>getSummaryInformationProperties()</code> and
+ <code>getDocumentSummaryInformationProperties()</code>. Both return
+ <code>java.util.Map</code> objects which map property IDs to
+ strings. Such a string gives a hint about the property's meaning. For
+ example,
+ <code>PropertyIDMap.getSummaryInformationProperties().get(4)</code>
+ returns the string "PID_AUTHOR". An application could use this string as
+ a key to a localized string which is displayed to the user, e.g. "Author"
+ in English or "Verfasser" in German. HPSF might provide such
+ language-dependend ("localized") mappings in a later release.</p>
+ <p>Usually you won't have to deal with those two maps. Instead you should
+ call the <code>Section.getPIDString(int)</code> method. It returns the
+ string associated with the specified property ID in the context of the
+ <code>Section</code> object.</p>
+ <p>Above you learned that property IDs have a meaning in the scope of a
+ section only. However, there are two exceptions to the rule: The property
+ IDs 0 and 1 have a fixed meaning in <strong>all</strong> sections:</p>
+ <table>
+ <tr>
+ <th>Property ID</th>
+ <th>Meaning</th>
+ </tr>
+ <tr>
+ <td>0</td>
+ <td>The property's value is a <strong>dictionary</strong>, i.e. a
+ mapping from property IDs to strings.</td>
+ </tr>
+ <tr>
+ <td>1</td>
+ <td>The property's value is the number of a <strong>codepage</strong>,
+ i.e. a mapping from character codes to characters. All strings in the
+ section containing this property must be interpreted using this
+ codepage. Typical property values are 1252 (8-bit "western" characters)
+ or 1200 (16-bit Unicode characters).</td>
+ </tr>
+ </table>
+ </section>
+ <section><title>Property types</title>
+ <p>A property is nothing without its value. It is stored in a property set
+ stream as a sequence of bytes. You must know the property's
+ <strong>type</strong> in order to properly interpret those bytes and
+ reasonably handle the value. A property's type is one of the so-called
+ Microsoft-defined <strong>"variant types"</strong>. When you call
+ <code>Property.getType()</code> you'll get a <code>long</code> value
+ which denoting the property's variant type. The class
+ <code>Variant</code> in the <code>org.apache.poi.hpsf</code> package
+ holds most of those <code>long</code> values as named constants. For
+ example, the constant <code>VT_I4 = 3</code> means a signed integer value
+ of four bytes. Examples of other types are <code>VT_LPSTR = 30</code>
+ meaning a null-terminated string of 8-bit characters, <code>VT_LPWSTR =
+ 31</code> which means a null-terminated Unicode string, or <code>VT_BOOL
+ = 11</code> denoting a boolean value.</p>
+ <p>In most cases you won't need a property's type because HPSF does all
+ the work for you.</p>
+ </section>
+ <section><title>Property values</title>
+ <p>When an application wants to retrieve a property's value and calls
+ <code>Property.getValue()</code>, HPSF has to interpret the bytes making
+ out the value according to the property's type. The type determines how
+ many bytes the value consists of and what
+ to do with them. For example, if the type is <code>VT_I4</code>, HPSF
+ knows that the value is four bytes long and that these bytes
+ comprise a signed integer value in the little-endian format. This is
+ quite different from e.g. a type of <code>VT_LPWSTR</code>. In this case
+ HPSF has to scan the value bytes for a Unicode null character and collect
+ everything from the beginning to that null character as a Unicode
+ string.</p>
+ <p>The good new is that HPSF does another job for you, too: It maps the
+ variant type to an adequate Java type.</p>
+ <table>
+ <tr>
+ <th>Variant type:</th>
+ <th>Java type:</th>
+ </tr>
+ <tr>
+ <td>VT_I2</td>
+ <td>java.lang.Integer</td>
+ </tr>
+ <tr>
+ <td>VT_I4</td>
+ <td>java.lang.Long</td>
+ </tr>
+ <tr>
+ <td>VT_FILETIME</td>
+ <td>java.util.Date</td>
+ </tr>
+ <tr>
+ <td>VT_LPSTR</td>
+ <td>java.lang.String</td>
+ </tr>
+ <tr>
+ <td>VT_LPWSTR</td>
+ <td>java.lang.String</td>
+ </tr>
+ <tr>
+ <td>VT_CF</td>
+ <td>byte[]</td>
+ </tr>
+ <tr>
+ <td>VT_BOOL</td>
+ <td>java.lang.Boolean</td>
+ </tr>
+ </table>
+ <p>The bad news is that there are still a couple of variant types HPSF
+ does not yet support. If it encounters one of these types it
+ returns the property's value as a byte array and leaves it to be
+ interpreted by the application.</p>
+ <p>An application retrieves a property's value by calling the
+ <code>Property.getValue()</code> method. This method's return type is the
+ abstract <code>Object</code> class. The <code>getValue()</code> method
+ looks up the property's variant type, reads the property's value bytes,
+ creates an instance of an adequate Java type, assigns it the property's
+ value and returns it. Primitive types like <code>int</code> or
+ <code>long</code> will be returned as the corresponding class,
+ e.g. <code>Integer</code> or <code>Long</code>.</p>
+ </section>
+ <section><title>Dictionaries</title>
+ <p>The property with ID 0 has a very special meaning: It is a
+ <strong>dictionary</strong> mapping property IDs to property names. We
+ have seen already that the meanings of standard properties in the
+ summary information and the document summary information property sets
+ have been defined by Microsoft. The advantage is that the labels of
+ properties like "Author" or "Title" don't have to be stored in the
+ property set. However, a user can define custom fields in, say, Microsoft
+ Word. For each field the user has to specify a name, a type, and a
+ value.</p>
+ <p>The names of the custom-defined fields (i.e. the property names) are
+ stored in the document summary information second section's
+ <strong>dictionary</strong>. The dictionary is a map which associates
+ property IDs with property names.</p>
+ <p>The method <code>Section.getPIDString(int)</code> not only returns with
+ the well-known property names of the summary information and document
+ summary information property sets, but with self-defined properties,
+ too. It should also work with self-defined properties in self-defined
+ sections.</p>
+ </section>
+ <section><title>Codepage support</title>
+ <fixme author="Rainer Klute">Improve codepage support!</fixme>
+ <p>The property with ID 1 holds the number of the codepage which was used
+ to encode the strings in this section. The present HPSF codepage support
+ is still very limited: When reading property value strings, HPSF
+ distinguishes between 16-bit characters and 8-bit characters. 16-bit
+ characters should be Unicode characters and thus be okay. 8-bit
+ characters are interpreted according to the platform's default character
+ set. This is fine as long as the document being read has been written on
+ a platform with the same default character set. However, if you receive a
+ document from another region of the world and want to process it with
+ HPSF you are in trouble - unless the creator used Unicode, of course.</p>
+ </section>
+ <section><title>Further Reading</title>
+ <p>There are still some aspects of HSPF left which are not covered by this
+ HOW-TO. You should dig into the Javadoc API documentation to learn
+ further details. Since you've struggled through this document up to this
+ point, you are well prepared.</p>
+ </section>
+ </section>
+ </section>
+ </body>
diff --git a/src/documentation/content/xdocs/hpsf/index.xml b/src/documentation/content/xdocs/hpsf/index.xml
+ <header>
+ <title>HPSF (Horrible Property Set Format)</title>
+ <subtitle>Overview</subtitle>
+ <authors>
+ <person name="Rainer Klute" email=""/>
+ </authors>
+ </header>
+ <body>
+ <section><title>Overview</title>
+ <p>Microsoft applications like "Word", "Excel" or "Powerpoint" let the user
+ describe his document by properties like "title", "category" and so on. The
+ application itself adds further information: last author, creation date
+ etc. These document properties are stored in so-called <strong>property set
+ streams</strong>. A property set stream is a separate document within a
+ <link href="../poifs/index.html">POI filesystem</link>. We'll call property
+ set streams mostly just "property sets". HPSF is POI's pure-Java
+ implementation to read (and in future to write) property sets.</p>
+ <p>The <link href="how-to.html">HPSF HOWTO</link> describes what a Java
+ application should do to read a property set using HPSF and to retrieve the
+ information it needs.</p>
+ <p>HPSF supports OLE2 property set streams in general, and is not limited to
+ the special case of document properties in the Microsoft Office files
+ mentioned above. The <link href="internals.html">HPSF description</link>
+ describes the internal structure of property set streams. A separate
+ document explains the internal of <link href="thumbnails.html">thumbnail
+ images</link>.</p>
+ </section>
+ </body>
diff --git a/src/documentation/content/xdocs/hpsf/internals.xml b/src/documentation/content/xdocs/hpsf/internals.xml
+ <header>
+ <title>HPSF Internals: The Horrible Property Set Format</title>
+ <authors>
+ <person name="Rainer Klute" email=""/>
+ </authors>
+ </header>
+ <body>
+ <section><title>HPSF Internals</title>
+ <section><title>Introduction</title>
+ <p>A Microsoft Office document is internally organized like a filesystem
+ with directory and files. Microsoft calls these files
+ <strong>streams</strong>. A document can have properties attached to it,
+ like author, title, number of words etc. These metadata are not stored in
+ the main stream of, say, a Word document, but instead in a dedicated
+ stream with a special format. Usually this stream's name is
+ <code>\005SummaryInformation</code>, where <code>\005</code> represents
+ the character with a decimal value of 5.</p>
+ <p>A single piece of information in the stream is called a
+ <strong>property</strong>, for example the document title. Each property
+ has an integral <strong>ID</strong> (e.g. 2 for title), a
+ <strong>type</strong> (telling that the title is a string of bytes) and a
+ <strong>value</strong> (what this is should be obvious). A stream
+ containing properties is called a
+ <strong>property set stream</strong>.</p>
+ <p>This document describes the internal structure of a property set stream,
+ i.e. the <strong>Horrible Property Set Format (HDF)</strong>. It does not
+ describe how a Microsoft Office document is organized internally and how
+ to retrieve a stream from it. See the <link
+ href="../poifs/index.html">POIFS documentation</link> for that kind of
+ stuff.</p>
+ <p>The Horrible Property Set Format is not only used in the Summary
+ Information stream in the top-level document of a Microsoft Office
+ document. Often there is also a property set stream named
+ <code>\005DocumentSummaryInformation</code> with additional properties.
+ Embedded documents may have their own property set streams. You cannot
+ tell by a stream's name whether it is a property set stream or not.
+ Instead you have to open the stream and look at its bytes.</p>
+ </section>
+ <section><title>Data Types</title>
+ <p>Before delving into the details of the property set stream format we
+ have to have a short look at data types. Integral values are stored in the
+ so-called <strong>little endian</strong> format. In this format the bytes
+ that make out an integral value are stored in the "wrong" order. For
+ example, the decimal value 4660 is 0x1234 in the hexadecimal notation. If
+ you think this should be represented by a byte 0x12 followed by another
+ byte 0x34, you are right. This is called the <strong>big endian</strong>
+ format. In the little endian format, however, this order is reversed and
+ the low-value byte comes first: 0x3412.
+ </p>
+ <p>The following table gives an overview about some important data
+ types:</p>
+ <table>
+ <tr>
+ <th>Name</th>
+ <th>Length</th>
+ <th>Example (Little Endian)</th>
+ <th>Example (Big Endian)</th>
+ </tr>
+ <tr>
+ <td><strong>Bytes</strong></td>
+ <td>1 byte</td>
+ <td><code>0x12</code></td>
+ <td><code>0x12</code></td>
+ </tr>
+ <tr>
+ <td><strong>Word</strong></td>
+ <td>2 bytes</td>
+ <td><code>0x1234</code></td>
+ <td><code>0x3412</code></td>
+ </tr>
+ <tr>
+ <td><strong>DWord</strong></td>
+ <td>4 bytes</td>
+ <td><code>0x12345678</code></td>
+ <td><code>0x78563412</code></td>
+ </tr>
+ <tr>
+ <td><strong>ClassID</strong><br/>
+ A sequence of one DWord, two Words and eight Bytes</td>
+ <td>16 bytes</td>
+ <td><code>0xE0859FF2F94F6810AB9108002B27B3D9</code> resp.
+ <code>E0859FF2-F94F-6810-AB-91-08-00-2B-27-B3-D9</code></td>
+ <td><code>0xF29F85E04FF91068AB9108002B27B3D9</code> resp.
+ <code>F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9</code></td>
+ </tr>
+ <tr>
+ <td></td>
+ <td></td>
+ <td>The ClassID examples are given here in two different notations. The
+ second notation without the "0x" at the beginning and with dashes
+ inside shows the internal grouping into one DWord, two Words and eight
+ Bytes.</td>
+ <td><em>Watch out:</em> Microsoft documentation and tools show class IDs
+ a little bit differently like
+ <code>F29F85E0-4FF9-1068-AB91-08002B27B3D9</code>.
+ However, that representation is (intentionally?) misleading with
+ respect to endianess.</td>
+ </tr>
+ </table>
+ </section>
+ <section><title>HPSF Overview</title>
+ <p>A property set stream consists of three main parts:</p>
+ <ol>
+ <li>The <strong>header</strong> and</li>
+ <li>the <strong>section(s)</strong> containing the properties.</li>
+ </ol>
+ </section>
+ <section><title>The Header</title>
+ <p>The first bytes in a property set stream is the <strong>header</strong>.
+ It has a fixed length and looks like this:</p>
+ <table>
+ <tr>
+ <th>Offset</th>
+ <th>Type</th>
+ <th>Contents</th>
+ <th>Remarks</th>
+ </tr>
+ <tr>
+ <td>0</td>
+ <td>Word</td>
+ <td><code>0xFFFE</code></td>
+ <td>If the first four bytes of a stream do not contain these values, the
+ stream is not a property set stream.</td>
+ </tr>
+ <tr>
+ <td>2</td>
+ <td>Word</td>
+ <td><code>0x0000</code></td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>4</td>
+ <td>DWord</td>
+ <td>Denotes the operating system and the OS version under which this
+ stream was created. The operating system ID is in the DWord's higher
+ word (after little endian decoding): <code>0x0000</code> for Win16,
+ <code>0x0001</code> for Macintosh and <code>0x0002</code> for Win32 - that's
+ all. The reader is most likely aware of the fact that there are some
+ more operating systems. However, Microsoft does not seem to know.</td>
+ <td></td>
+ </tr>
+ <tr>
+ <td>8</td>
+ <td>ClassID</td>
+ <td><code>0x00000000000000000000000000000000</code></td>
+ <td>Most property set streams have this value but this is not
+ required.</td>
+ </tr>
+ <tr>
+ <td>24</td>
+ <td>DWord</td>
+ <td><code>0x01000000</code> or greater</td>
+ <td>Section count. This field's value should be equal to 1 or greater.
+ Microsoft claims that this is a "reserved" field, but it seems to tell
+ how many sections (see below) are following in the stream. This would
+ really make sense because otherwise you could not know where and how
+ far you should read section data.</td>
+ </tr>
+ </table>
+ </section>
+ <section><title>Section List</title>
+ <p>Following the header is the section list. This is an array of pairs each
+ consisting of a section format ID and an offset. This array has as many
+ pairs of ClassID and and DWord fields as the section count field in the
+ header says. The Summary Information stream contains a single section, the
+ Document Summary Information stream contains two.</p>
+ <table>
+ <tr>
+ <th>Type</th>
+ <th>Contents</th>
+ <th>Remarks</th>
+ </tr>
+ <tr>
+ <td>ClassID</td>
+ <td>Section format ID</td>
+ <td><code>0xF29F85E04FF91068AB9108002B27B3D9</code> for the single section
+ in the Summary Information stream.<br/><br/>
+ <code>0xD5CDD5022E9C101B939708002B2CF9AE</code> for the first
+ section in the Document Summary Information stream.</td>
+ </tr>
+ <tr>
+ <td>DWord</td>
+ <td>Offset</td>
+ <td>The number of bytes between the beginning of the stream and the
+ beginning of the section within the stream.</td>
+ </tr>
+ <tr>
+ <td>ClassID</td>
+ <td>Section format ID</td>
+ <td>...</td>
+ </tr>
+ <tr>
+ <td>DWord</td>
+ <td>Offset</td>
+ <td>...</td>
+ </tr>
+ <tr>
+ <td>...</td>
+ <td>...</td>
+ <td>...</td>
+ </tr>
+ </table>
+ </section>
+ <section><title>Section</title>
+ <p>A section is divided into three parts: the section header (with the
+ section length and the number of properties in the section), the
+ properties list (with type and offset of each property), and the
+ properties themselves. Here are the details:</p>
+ <table>
+ <tr>
+ <th>&nbsp;</th>
+ <th>Type</th>
+ <th>Contents</th>
+ <th>Remarks</th>
+ </tr>
+ <tr>
+ <td>Section header</td>
+ <td>DWord</td>
+ <td>Length</td>
+ <td>The length of the section in bytes.</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>DWord</td>
+ <td>Property count</td>
+ <td>The number of properties in the section.</td>
+ </tr>
+ <tr>
+ <td>Properties list</td>
+ <td>DWord</td>
+ <td>Property ID</td>
+ <td>The property ID tells what the property means. For example, an ID of
+ <code>0x0002</code> in the Summary Information stands for the document's
+ title. See the <link href="#property_ids">Property IDs</link>
+ chapter below for more details.</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>DWord</td>
+ <td>Offset</td>
+ <td>The number of bytes between the beginning of the section and the
+ property.</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>...</td>
+ <td>...</td>
+ <td>...</td>
+ </tr>
+ <tr>
+ <td>Properties</td>
+ <td>DWord</td>
+ <td>Property type ("variant")</td>
+ <td>This is the property's data type, e.g. an integer value, a byte
+ string or a Unicode string. See the
+ <link href="#property_types"><em>Property Types</em></link> chapter for
+ details!</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td><em>Field length depends on the property type
+ ("variant")</em></td>
+ <td>Property value</td>
+ <td>This field's length depends on the property's type. These are the
+ bytes that make out the DWord, the byte string or some other data of
+ fixed or variable length.<br/><br/>
+ The property value's length is always stored in an area which is a
+ multiple of 4 in length. If the property is shorter, e.g. a byte
+ string of 13 bytes, the remaining bytes are padded with <code>0x00</code>
+ bytes.</td>
+ </tr>
+ <tr>
+ <td></td>
+ <td>...</td>
+ <td>...</td>
+ <td>...</td>
+ </tr>
+ </table>
+ </section>
+ <section><title>Property IDs</title>
+ <anchor id="property_ids"/>
+ <p>As seen above, a section holds a property list: an array with property
+ IDs and offsets. The property ID gives each property a meaning. For
+ example, in the Summary Information stream the property ID 2 says that
+ this property is the document's title.</p>
+ <p>If you want to know a property ID's meaning, it is not sufficient to
+ know the ID itself. You must also know the
+ <strong>section format ID</strong>. For example, in the Document Summary
+ Information stream the property ID 2 means not the document's title but
+ its category. Due to Microsoft's infinite wisdom the section format ID is
+ not part of the section. Thus if you have only a section without the
+ stream it is in, you cannot make any sense of the properties because you
+ do not know what they mean.</p>
+ <p>So each section format ID has its own name space of property IDs.
+ Microsoft defined some "well-known" property IDs for the Summary
+ Information and the Document Summary Information streams. You can extend
+ them by your own additional IDs. This will be described below.</p>
+ <section><title>Property IDs in The Summary Information Stream</title>
+ <p>The Summary Information stream has a single section with a section
+ format ID of <code>0xF29F85E04FF91068AB9108002B27B3D9</code>. The following
+ table defines the meaning of its property IDs. Each row associates a
+ property ID with a <em>name</em> and an <em>ID string</em>. (The property
+ <em>type</em> is just for informational purposes given here. As we have
+ seen above, the type is always given along with the value.)</p>
+ <p>The property <em>name</em> is a readable string which could be
+ displayed to the user. However, this string is useful only for users who
+ understand English. The property name does not help with other
+ languages.</p>
+ <p>The property <em>ID string</em> is about the same but looks more
+ technically and is nothing a user should bother with. You could the ID
+ string and map it to an appropriate display string in a particular
+ language. Of course you could do that with the property ID as well and
+ with less overhead, but people (including software developers) tend to be
+ better in remembering symbolic constants than remembering numbers.</p>
+ <table>
+ <tr>
+ <th>Property ID</th>
+ <th>Property Name</th>
+ <th>Property ID String</th>
+ <th>Property Type</th>
+ </tr>
+ <tr>
+ <td>2</td>
+ <td>Title</td>
+ <td>PID_TITLE</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>3</td>
+ <td>Subject</td>
+ <td>PID_SUBJECT</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>4</td>
+ <td>Author</td>
+ <td>PID_AUTHOR</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>5</td>
+ <td>Keywords</td>
+ <td>PID_KEYWORDS</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>6</td>
+ <td>Comments</td>
+ <td>PID_COMMENTS</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>7</td>
+ <td>Template</td>
+ <td>PID_TEMPLATE</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>8</td>
+ <td>Last Saved By</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>9</td>
+ <td>Revision Number</td>
+ <td>PID_REVNUMBER</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>10</td>
+ <td>Total Editing Time</td>
+ <td>PID_EDITTIME</td>
+ <td>VT_FILETIME</td>
+ </tr>
+ <tr>
+ <td>11</td>
+ <td>Last Printed</td>
+ <td>VT_FILETIME</td>
+ </tr>
+ <tr>
+ <td>12</td>
+ <td>Create Time/Date</td>
+ <td>PID_CREATE_DTM</td>
+ <td>VT_FILETIME</td>
+ </tr>
+ <tr>
+ <td>13</td>
+ <td>Last Saved Time/Date</td>
+ <td>VT_FILETIME</td>
+ </tr>
+ <tr>
+ <td>14</td>
+ <td>Number of Pages</td>
+ <td>PID_PAGECOUNT</td>
+ <td>VT_I4</td>
+ </tr>
+ <tr>
+ <td>15</td>
+ <td>Number of Words</td>
+ <td>PID_WORDCOUNT</td>
+ <td>VT_I4</td>
+ </tr>
+ <tr>
+ <td>16</td>
+ <td>Number of Characters</td>
+ <td>PID_CHARCOUNT</td>
+ <td>VT_I4</td>
+ </tr>
+ <tr>
+ <td>17</td>
+ <td>Thumbnail</td>
+ <td>PID_THUMBNAIL</td>
+ <td>VT_CF</td>
+ </tr>
+ <tr>
+ <td>18</td>
+ <td>Name of Creating Application</td>
+ <td>PID_APPNAME</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>19</td>
+ <td>Security</td>
+ <td>PID_SECURITY</td>
+ <td>VT_I4</td>
+ </tr>
+ </table>
+ </section>
+ <section><title>Property IDs in The Document Summary Information Stream</title>
+ <p>The Document Summary Information stream has two sections with a section
+ format ID of <code>0xD5CDD5022E9C101B939708002B2CF9AE</code> for the first
+ one. The following table defines the meaning of the property IDs in the
+ first section. See the preceeding section for interpreting the table.</p>
+ <table>
+ <tr>
+ <th>Property ID</th>
+ <th>Property name</th>
+ <th>Property ID string</th>
+ <th>VT type</th>
+ </tr>
+ <tr>
+ <td>0</td>
+ <td>Dictionary</td>
+ <td>[Special format]</td>
+ </tr>
+ <tr>
+ <td>1</td>
+ <td>Code page</td>
+ <td>PID_CODEPAGE</td>
+ <td>VT_I2</td>
+ </tr>
+ <tr>
+ <td>2</td>
+ <td>Category</td>
+ <td>PID_CATEGORY</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>3</td>
+ <td>PresentationTarget</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>4</td>
+ <td>Bytes</td>
+ <td>PID_BYTECOUNT</td>
+ <td>VT_I4</td>
+ </tr>
+ <tr>
+ <td>5</td>
+ <td>Lines</td>
+ <td>PID_LINECOUNT</td>
+ <td>VT_I4</td>
+ </tr>
+ <tr>
+ <td>6</td>
+ <td>Paragraphs</td>
+ <td>PID_PARCOUNT</td>
+ <td>VT_I4</td>
+ </tr>
+ <tr>
+ <td>7</td>
+ <td>Slides</td>
+ <td>VT_I4</td>
+ </tr>
+ <tr>
+ <td>8</td>
+ <td>Notes</td>
+ <td>PID_NOTECOUNT</td>
+ <td>VT_I4</td>
+ </tr>
+ <tr>
+ <td>9</td>
+ <td>HiddenSlides</td>
+ <td>VT_I4</td>
+ </tr>
+ <tr>
+ <td>10</td>
+ <td>MMClips</td>
+ <td>VT_I4</td>
+ </tr>
+ <tr>
+ <td>11</td>
+ <td>ScaleCrop</td>
+ <td>PID_SCALE</td>
+ <td>VT_BOOL</td>
+ </tr>
+ <tr>
+ <td>12</td>
+ <td>HeadingPairs</td>
+ </tr>
+ <tr>
+ <td>13</td>
+ <td>TitlesofParts</td>
+ <td>PID_DOCPARTS</td>
+ <td>VT_LPSTR | VT_VECTOR</td>
+ </tr>
+ <tr>
+ <td>14</td>
+ <td>Manager</td>
+ <td>PID_MANAGER</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>15</td>
+ <td>Company</td>
+ <td>PID_COMPANY</td>
+ <td>VT_LPSTR</td>
+ </tr>
+ <tr>
+ <td>16</td>
+ <td>LinksUpTo Date</td>
+ <td>VT_BOOL</td>
+ </tr>
+ </table>
+ </section>
+ </section>
+ <section><title>Property Types</title>
+ <anchor id="property_types"/>
+ <p>A property consists of a DWord <em>type field</em> followed by the
+ property value. The property type is an integer value and tells how the
+ data byte following it are to be interpreted. In the Microsoft world it is
+ also known as the <em>variant</em>.</p>
+ <p>The <em>Usage</em> column says where a variant type may occur. Not all
+ of them are allowed in a property set but just those marked with a [P].
+ <strong>[V]</strong> - may appear in a VARIANT, <strong>[T]</strong> - may
+ appear in a TYPEDESC, <strong>[P]</strong> - may appear in an OLE property
+ set, <strong>[S]</strong> - may appear in a Safe Array.</p>
+ <table>
+ <tr>
+ <th>Variant ID</th>
+ <th>Variant Type</th>
+ <th>Usage</th>
+ <th>Description</th>
+ </tr>
+ <tr>
+ <td>0</td>
+ <td>VT_EMPTY</td>
+ <td>[V] [P]</td>
+ <td>nothing</td>
+ </tr>
+ <tr>
+ <td>1</td>
+ <td>VT_NULL</td>
+ <td>[V] [P]</td>
+ <td>SQL style Null</td>
+ </tr>
+ <tr>
+ <td>2</td>
+ <td>VT_I2</td>
+ <td>[V] [T] [P] [S]</td>
+ <td>2 byte signed int</td>
+ </tr>
+ <tr>
+ <td>3</td>
+ <td>VT_I4</td>
+ <td>[V] [T] [P] [S]</td>
+ <td>4 byte signed int</td>
+ </tr>
+ <tr>
+ <td>4</td>
+ <td>VT_R4</td>
+ <td>[V] [T] [P] [S]</td>
+ <td>4 byte real</td>
+ </tr>
+ <tr>
+ <td>5</td>
+ <td>VT_R8</td>
+ <td>[V] [T] [P] [S]</td>
+ <td>8 byte real</td>
+ </tr>
+ <tr>
+ <td>6</td>
+ <td>VT_CY</td>
+ <td>[V] [T] [P] [S]</td>
+ <td>currency</td>
+ </tr>
+ <tr>
+ <td>7</td>
+ <td>VT_DATE</td>
+ <td>[V] [T] [P] [S]</td>
+ <td>date</td>
+ </tr>
+ <tr>
+ <td>8</td>
+ <td>VT_BSTR</td>
+ <td>[V] [T] [P] [S]</td>
+ <td>OLE Automation string</td>
+ </tr>
+ <tr>
+ <td>9</td>
+ <td>VT_DISPATCH</td>
+ <td>[V] [T] [P] [S]</td>
+ <td>IDispatch *</td>
+ </tr>
+ <tr>
+ <td>10</td>
+ <td>VT_ERROR</td>
+ <td>[V] [T] [S]</td>
+ <td>SCODE</td>
+ </tr>
+ <tr>
+ <td>11</td>
+ <td>VT_BOOL</td>
+ <td>[V] [T] [P] [S]</td>
+ <td>True=-1, False=0</td>
+ </tr>
+ <tr>
+ <td>12</td>
+ <td>VT_VARIANT</td>
+ <td>[V] [T] [P] [S]</td>
+ <td>VARIANT *</td>
+ </tr>
+ <tr>
+ <td>13</td>
+ <td>VT_UNKNOWN</td>
+ <td>[V] [T] [S]</td>
+ <td>IUnknown *</td>
+ </tr>
+ <tr>
+ <td>14</td>
+ <td>VT_DECIMAL</td>
+ <td>[V] [T] [S]</td>
+ <td>16 byte fixed point</td>
+ </tr>
+ <tr>
+ <td>16</td>
+ <td>VT_I1</td>
+ <td>[T]</td>
+ <td>signed char</td>
+ </tr>
+ <tr>
+ <td>17</td>
+ <td>VT_UI1</td>
+ <td>[V] [T] [P] [S]</td>
+ <td>unsigned char</td>
+ </tr>
+ <tr>
+ <td>18</td>
+ <td>VT_UI2</td>
+ <td>[T] [P]</td>
+ <td>unsigned short</td>
+ </tr>
+ <tr>
+ <td>19</td>
+ <td>VT_UI4</td>
+ <td>[T] [P]</td>
+ <td>unsigned short</td>
+ </tr>
+ <tr>
+ <td>20</td>
+ <td>VT_I8</td>
+ <td>[T] [P]</td>
+ <td>signed 64-bit int</td>
+ </tr>
+ <tr>
+ <td>21</td>
+ <td>VT_UI8</td>
+ <td>[T] [P]</td>
+ <td>unsigned 64-bit int</td>
+ </tr>
+ <tr>
+ <td>22</td>
+ <td>VT_INT</td>
+ <td>[T]</td>
+ <td>signed machine int</td>
+ </tr>
+ <tr>
+ <td>23</td>
+ <td>VT_UINT</td>
+ <td>[T]</td>
+ <td>unsigned machine int</td>
+ </tr>
+ <tr>
+ <td>24</td>
+ <td>VT_VOID</td>
+ <td>[T]</td>
+ <td>C style void</td>
+ </tr>
+ <tr>
+ <td>25</td>
+ <td>VT_HRESULT</td>
+ <td>[T]</td>
+ <td>Standard return type</td>
+ </tr>
+ <tr>
+ <td>26</td>
+ <td>VT_PTR</td>
+ <td>[T]</td>
+ <td>pointer type</td>
+ </tr>
+ <tr>
+ <td>27</td>
+ <td>VT_SAFEARRAY</td>
+ <td>[T]</td>
+ <td>(use VT_ARRAY in VARIANT)</td>
+ </tr>
+ <tr>
+ <td>28</td>
+ <td>VT_CARRAY</td>
+ <td>[T]</td>
+ <td>C style array</td>
+ </tr>
+ <tr>
+ <td>29</td>
+ <td>[T]</td>
+ <td>user defined type</td>
+ </tr>
+ <tr>
+ <td>30</td>
+ <td>VT_LPSTR</td>
+ <td>[T] [P]</td>
+ <td>null terminated string</td>
+ </tr>
+ <tr>
+ <td>31</td>
+ <td>VT_LPWSTR</td>
+ <td>[T] [P]</td>
+ <td>wide null terminated string</td>
+ </tr>
+ <tr>
+ <td>64</td>
+ <td>VT_FILETIME</td>
+ <td>[P]</td>
+ <td>FILETIME</td>
+ </tr>
+ <tr>
+ <td>65</td>
+ <td>VT_BLOB</td>
+ <td>[P]</td>
+ <td>Length prefixed bytes</td>
+ </tr>
+ <tr>
+ <td>66</td>
+ <td>VT_STREAM</td>
+ <td>[P]</td>
+ <td>Name of the stream follows</td>
+ </tr>
+ <tr>
+ <td>67</td>
+ <td>VT_STORAGE</td>
+ <td>[P]</td>
+ <td>Name of the storage follows</td>
+ </tr>
+ <tr>
+ <td>68</td>
+ <td>[P]</td>
+ <td>Stream contains an object</td>
+ </tr>
+ <tr>
+ <td>69</td>
+ <td>[P]</td>
+ <td>Storage contains an object</td>
+ </tr>
+ <tr>
+ <td>70</td>
+ <td>VT_BLOB_OBJECT</td>
+ <td>[P]</td>
+ <td>Blob contains an object</td>
+ </tr>
+ <tr>
+ <td>71</td>
+ <td>VT_CF</td>
+ <td>[P]</td>
+ <td>Clipboard format</td>
+ </tr>
+ <tr>
+ <td>72</td>
+ <td>VT_CLSID</td>
+ <td>[P]</td>
+ <td>A Class ID</td>
+ </tr>
+ <tr>
+ <td>0x1000</td>
+ <td>VT_VECTOR</td>
+ <td>[P]</td>
+ <td>simple counted array</td>
+ </tr>
+ <tr>
+ <td>0x2000</td>
+ <td>VT_ARRAY</td>
+ <td>[V]</td>
+ <td>SAFEARRAY*</td>
+ </tr>
+ <tr>
+ <td>0x4000</td>
+ <td>VT_BYREF</td>
+ <td>[V]</td>
+ <td>void* for local use</td>
+ </tr>
+ <tr>
+ <td>0x8000</td>
+ <td>VT_RESERVED</td>
+ <td><br/></td>
+ <td><br/></td>
+ </tr>
+ <tr>
+ <td>0xFFFF</td>
+ <td>VT_ILLEGAL</td>
+ <td><br/></td>
+ <td><br/></td>
+ </tr>
+ <tr>
+ <td>0xFFF</td>
+ <td><br/></td>
+ <td><br/></td>
+ </tr>
+ <tr>
+ <td>0xFFF</td>
+ <td>VT_TYPEMASK</td>
+ <td><br/></td>
+ <td><br/></td>
+ </tr>
+ </table>
+ </section>
+ <section><title>References</title>
+ <p>In order to assemble the HPSF description I used information publically
+ available on the Internet only. The references given below have been very
+ helpful. If you have any amendments or corrections, please let us know!
+ Thank you!</p>
+ <ol>
+ <li>In
+ <link href=""><em>Understanding OLE
+ documents</em></link>, Ken Kyler gives an introduction to OLE2
+ documents
+ and especially to property sets. He names the property names, types, and
+ IDs of the Summary Information and Document Summary Information
+ stream.</li>
+ <li>The
+ <link href=""><em>ActiveX Programmer's
+ Reference</em></link> at
+ <link href=""></link>
+ seems a little outdated, but that's what I have found.</li>
+ <li>An overview of the <code>VT_</code> types is in
+ <link href=""><em>Variant
+ Type Definitions</em></link>.</li>
+ <li>What is a <code>FILETIME</code>? The answer can be found
+ under <link
+ href=""></link>, <link href=""></link> or
+ <link href=""></link>.
+ In short: <em>The FILETIME structure holds a date and time associated
+ with a file. The structure identifies a 64-bit integer specifying the
+ number of 100-nanosecond intervals which have passed since January 1,
+ 1601. This 64-bit value is split into the two dwords stored in the
+ structure.</em></li>
+ <li>Information about the code page property in the
+ DocumentSummaryInformation stream is available at <link
+ href=""></link>.</li>
+ <li>This documentation origins from the <link href="">HPSF description</link> available at <link href=""></link>.</li>
+ </ol>
+ </section>
+ </section>
+ </body>
diff --git a/src/documentation/content/xdocs/hpsf/thumbnails.xml b/src/documentation/content/xdocs/hpsf/thumbnails.xml
+ <header>
+ <title>HPSF THUMBNAIL HOW-TO</title>
+ <authors>
+ <person name="Drew Varner" email="" />
+ </authors>
+ </header>
+ <body>
+ <section><title>The VT_CF Format</title>
+ <p>Thumbnail information is stored as a VT_CF, or Thumbnail Variant. The
+ Thumbnail Variant is used to store various types of information in a
+ clipboard. The VT_CF can store information in formats for the Macintosh or
+ Windows clipboard.</p>
+ <p>There are many types of data that can be copied to the clipboard, but the
+ only types of information needed for thumbnail manipulation are the image
+ formats.</p>
+ <p>The <code>VT_CF</code> structure looks like this:</p>
+ <table>
+ <tr>
+ <th>Element:</th>
+ <td>Clipboard Size</td>
+ <td>Clipboard Format Tag</td>
+ <td>Clipboard Data</td>
+ </tr>
+ <tr>
+ <th>Size:</th>
+ <td>32 bit unsigned integer (DWord)</td>
+ <td>32 bit signed integer (DWord)</td>
+ <td>variable length (byte array)</td>
+ </tr>
+ </table>
+ <p>The Clipboard Size refers to the size (in bytes) of Clipboard Data
+ (variable size) plus the Clipboard Format (four bytes).</p>
+ <p>Clipboard Format Tag has four possible values:</p>
+ <table>
+ <tr>
+ <th>Value</th>
+ <th>Identifier</th>
+ <th>Description</th>
+ </tr>
+ <tr>
+ <td><code>-1L</code></td>
+ <td><code>CFTAG_WINDOWS</code></td>
+ <td>a built-in Windows&copy; clipboard format value</td>
+ </tr>
+ <tr>
+ <td><code>-2L</code></td>
+ <td><code>CFTAG_MACINTOSH</code></td>
+ <td>a Macintosh clipboard format value</td>
+ </tr>
+ <tr>
+ <td><code>-3L</code></td>
+ <td><code>CFTAG_FMTID</code></td>
+ <td>a format identifier (FMTID) This is rarely used.</td>
+ </tr>
+ <tr>
+ <td><code>0L</code></td>
+ <td><code>CFTAG_NODATA</code></td>
+ <td>No data This is rarely used.</td>
+ </tr>
+ </table>
+ </section>
+ <section><title>Windows Clipboard Data</title>
+ <p>Windows clipboard data has four image formats for thumbnails:</p>
+ <table>
+ <tr>
+ <th>Value</th>
+ <th>Identifier</th>
+ <th>Description</th>
+ </tr>
+ <tr>
+ <td>3</td>
+ <td><code>CF_METAFILEPICT</code></td>
+ <td>Windows metafile format - recommended</td>
+ </tr>
+ <tr>
+ <td>8</td>
+ <td><code>CF_DIB</code></td>
+ <td>Device Independent Bitmap</td>
+ </tr>
+ <tr>
+ <td>14</td>
+ <td><code>CF_ENHMETAFILE</code></td>
+ <td>Enhanced Windows metafile format</td>
+ </tr>
+ <tr>
+ <td>2</td>
+ <td><code>CF_BITMAP</code></td>
+ <td>Bitmap - Obsolete - Use <code>CF_DIB</code> instead</td>
+ </tr>
+ </table>
+ </section>
+ <section><title>Windows Metafile Format</title>
+ <p>The most common format for thumbnails on the Windows platform is the
+ Windows metafile format. The Clipboard places and extra header in front of
+ a the standard Windows Metafile Format data.</p>
+ <p>The Clipboard Data byte array looks like this when an image is stored in
+ Windows' Clipboard WMF format.</p>
+ <table>
+ <tr>
+ <th>Identifier</th>
+ <td>mm</td>
+ <td>width</td>
+ <td>height</td>
+ <td>handle</td>
+ <td>WMF data</td>
+ </tr>
+ <tr>
+ <th>Size</th>
+ <td>32 bit unsigned int</td>
+ <td>16 bit unsigned(?) int</td>
+ <td>16 bit unsigned(?) int</td>
+ <td>16 bit unsigned(?) int</td>
+ <td>16 bit unsigned(?) int</td>
+ <td>byte array - variable length</td>
+ </tr>
+ <tr>
+ <th>Description</th>
+ <td>Clipboard WMF</td>
+ <td>Mapping Mode</td>
+ <td>Image Width</td>
+ <td>Image Height</td>
+ <td>handle to the WMF data array in memory, or 0</td>
+ <td>standard WMF byte stream</td>
+ </tr>
+ </table>
+ </section>
+ <section><title>Device Independent Bitmap</title>
+ <p><strong>FIXME:</strong> Describe the Device Independent Bitmap
+ format!</p>
+ </section>
+ <section><title>Macintosh Clipboard Data</title>
+ <p><strong>FIXME:</strong> Describe the Macintosh clipboard formats!</p>
+ </section>
+ </body>
new file mode 100644
+ <header>
+ <title>To Do</title>
+ <authors>
+ <person name="Rainer Klute" email=""/>
+ </authors>
+ </header>
+ <body>
+ <section><title>To Do</title>
+ <p>The following functionalities should be added to HPFS:</p>
+ <ol>
+ <li>
+ Add writing capability for property sets. Presently property sets can
+ be read only.
+ </li>
+ <li>
+ Add codepage support: Presently the bytes making out the string in a
+ property's value are interpreted using the platform's default character
+ set.
+ </li>
+ <li>
+ Add resource bundles to
+ <code>org.apache.poi.hpsf.wellknown</code> to ease
+ localizations. This would be useful for mapping standard property IDs to
+ localized strings. Example: The property ID 4 could be mapped to "Author"
+ in English or "Verfasser" in German.
+ </li>
+ <li>
+ Implement reading functionality for those property types that are not
+ yet supported. HPSF should return proper Java types instead of just byte
+ arrays.
+ </li>
+ <li>
+ Add WMF to <code>java.awt.Image</code> example code in <link
+ href="thumbnails.html">Thumbnail
+ HOW TO</link>.
+ </li>
+ </ol>
+ </section>
+ </body>
