diff options
Diffstat (limited to 'src/documentation/content/xdocs/hpsf')
-rw-r--r-- | src/documentation/content/xdocs/hpsf/book.xml | 21 | ||||
-rw-r--r-- | src/documentation/content/xdocs/hpsf/how-to.xml | 868 | ||||
-rw-r--r-- | src/documentation/content/xdocs/hpsf/index.xml | 54 | ||||
-rw-r--r-- | src/documentation/content/xdocs/hpsf/internals.xml | 1010 | ||||
-rw-r--r-- | src/documentation/content/xdocs/hpsf/thumbnails.xml | 182 | ||||
-rw-r--r-- | src/documentation/content/xdocs/hpsf/todo.xml | 65 |
6 files changed, 2200 insertions, 0 deletions
diff --git a/src/documentation/content/xdocs/hpsf/book.xml b/src/documentation/content/xdocs/hpsf/book.xml new file mode 100644 index 0000000000..529baed75a --- /dev/null +++ b/src/documentation/content/xdocs/hpsf/book.xml @@ -0,0 +1,21 @@ +<?xml version="1.0"?> +<!DOCTYPE book PUBLIC "-//APACHE//DTD Cocoon Documentation Book V1.0//EN" "../dtd/book-cocoon-v10.dtd"> +<!-- $Id$ --> +<book software="POI Project" + title="HPSF" + copyright="@year@ POI Project"> + + <menu label="Navigation"> + <menu-item label="Main" href="../index.html"/> + </menu> + <menu label="HPSF"> + <menu-item label="Overview" href="index.html"/> + <menu-item label="How To" href="how-to.html"/> + <menu-item label="Thumbnails" href="thumbnails.html"/> + <menu-item label="Internals" href="internals.html"/> + <menu-item label="To Do" href="todo.html"/> + </menu> + +</book> + + diff --git a/src/documentation/content/xdocs/hpsf/how-to.xml b/src/documentation/content/xdocs/hpsf/how-to.xml new file mode 100644 index 0000000000..57f880700e --- /dev/null +++ b/src/documentation/content/xdocs/hpsf/how-to.xml @@ -0,0 +1,868 @@ +<?xml version="1.0" encoding="iso-8859-1"?> +<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" +"../dtd/document-v11.dtd"> +<!-- $Id$ --> + +<document> + <header> + <title>HPSF HOW-TO</title> + <authors> + <person name="Rainer Klute" email="klute@apache.org"/> + </authors> + </header> + <body> + <section><title>How To Use the HPSF APIs</title> + + <p>This HOW-TO is organized in three sections. You should read them + sequentially because the later sections build upon the earlier ones.</p> + + <ol> + <li> + The <link href="#sec1">first section</link> explains how to read + the most important standard properties of a Microsoft Office + document. Standard properties are things like title, author, creation + date etc. It is quite likely that you will find here what you need and + don't have to read the other sections. + </li> + + <li> + The <link href="#sec2">second section</link> goes a small step + further and focusses on reading additional standard properties. It also + talks about exceptions that may be thrown when dealing with HPSF and + shows how you can read properties of embedded objects. + </li> + + <li> + The <link href="#sec3">third section</link> tells how to read + non-standard properties. Non-standard properties are application-specific + triples consisting of an ID, a type, and a value. + </li> + </ol> + + + + <anchor id="sec1"/> + <section><title>Reading Standard Properties</title> + + <note>This section explains how to read + the most important standard properties of a Microsoft Office + document. Standard properties are things like title, author, creation + date etc. Chances are that you will find here what you need and + don't have to read the other sections.</note> + + <p>The first thing you should understand is that properties are stored in + separate documents inside the POI filesystem. (If you don't know what a + POI filesystem is, read the <link href="../poifs/index.html">POIFS + documentation</link>.) A document in a POI filesystem is also called a + <strong>stream</strong>.</p> + + <p>The following example shows how to read a POI filesystem's + "title" property. Reading other properties is similar. Consider the API + documentation of <code>org.apache.poi.hpsf.SummaryInformation</code> to + learn which methods are available!</p> + + <p>The standard properties this section focusses on can be found in a + document called <em>\005SummaryInformation</em> located in the root of the + POI filesystem. The notation <em>\005</em> in the document's name means + the character with the decimal value of 5. In order to read the title, an + application has to perform the following steps:</p> + + <ol> + <li> + Open the document <em>\005SummaryInformation</em> located in the root + of the POI filesystem. + </li> + <li> + Create an instance of the class <code>SummaryInformation</code> from + that document. + </li> + <li> + Call the <code>SummaryInformation</code> instance's + <code>getTitle()</code> method. + </li> + </ol> + + <p>Sounds easy, doesn't it? Here are the steps in detail.</p> + + + <section><title>Open the document \005SummaryInformation in the root of the + POI filesystem</title> + + <p>An application that wants to open a document in a POI filesystem + (POIFS) proceeds as shown by the following code fragment. (The full + source code of the sample application is available in the + <em>examples</em> section of the POI source tree as + <em>ReadTitle.java</em>.</p> + + <source> +import java.io.*; +import org.apache.poi.hpsf.*; +import org.apache.poi.poifs.eventfilesystem.*; + +// ... + +public static void main(String[] args) + throws IOException +{ + final String filename = args[0]; + POIFSReader r = new POIFSReader(); + r.registerListener(new MyPOIFSReaderListener(), + "\005SummaryInformation"); + r.read(new FileInputStream(filename)); +}</source> + + <p>The first interesting statement is</p> + + <source>POIFSReader r = new POIFSReader();</source> + + <p>It creates a + <code>org.apache.poi.poifs.eventfilesystem.POIFSReader</code> instance + which we shall need to read the POI filesystem. Before the application + actually opens the POI filesystem we have to tell the + <code>POIFSReader</code> which documents we are interested in. In this + case the application should do something with the document + <em>\005SummaryInformation</em>.</p> + + <source> +r.registerListener(new MyPOIFSReaderListener(), + "\005SummaryInformation");</source> + + <p>This method call registers a + <code>org.apache.poi.poifs.eventfilesystem.POIFSReaderListener</code> + with the <code>POIFSReader</code>. The <code>POIFSReaderListener</code> + interface specifies the method <code>processPOIFSReaderEvent</code> + which processes a document. The class + <code>MyPOIFSReaderListener</code> implements the + <code>POIFSReaderListener</code> and thus the + <code>processPOIFSReaderEvent</code> method. The eventing POI filesystem + calls this method when it finds the <em>\005SummaryInformation</em> + document. In the sample application <code>MyPOIFSReaderListener</code> is + a static class in the <em>ReadTitle.java</em> source file.</p> + + <p>Now everything is prepared and reading the POI filesystem can + start:</p> + + <source>r.read(new FileInputStream(filename));</source> + + <p>The following source code fragment shows the + <code>MyPOIFSReaderListener</code> class and how it retrieves the + title.</p> + + <source> +static class MyPOIFSReaderListener implements POIFSReaderListener +{ + public void processPOIFSReaderEvent(POIFSReaderEvent event) + { + SummaryInformation si = null; + try + { + si = (SummaryInformation) + PropertySetFactory.create(event.getStream()); + } + catch (Exception ex) + { + throw new RuntimeException + ("Property set stream \"" + + event.getPath() + event.getName() + "\": " + ex); + } + final String title = si.getTitle(); + if (title != null) + System.out.println("Title: \"" + title + "\""); + else + System.out.println("Document has no title."); + } +} +</source> + + <p>The line</p> + + <source>SummaryInformation si = null;</source> + + <p>declares a <code>SummaryInformation</code> variable and initializes it + with <code>null</code>. We need an instance of this class to access the + title. The instance is created in a <code>try</code> block:</p> + + <source>si = (SummaryInformation) + PropertySetFactory.create(event.getStream());</source> + + <p>The expression <code>event.getStream()</code> returns the input stream + containing the bytes of the property set stream named + <em>\005SummaryInformation</em>. This stream is passed into the + <code>create</code> method of the factory class + <code>org.apache.poi.hpsf.PropertySetFactory</code> which returns + a <code>org.apache.poi.hpsf.PropertySet</code> instance. It is more or + less safe to cast this result to <code>SummaryInformation</code>, a + convenience class with methods like <code>getTitle()</code>, + <code>getAuthor()</code> etc.</p> + + <p>The <code>PropertySetFactory.create</code> method may throw all sorts + of exceptions. We'll deal with them in the next sections. For now we just + catch all exceptions and throw a <code>RuntimeException</code> + containing the message text of the origin exception.</p> + + <p>If all goes well, the sample application retrieves the title and prints + it to the standard output. As you can see you must be prepared for the + case that the POI filesystem does not have a title.</p> + + <source>final String title = si.getTitle(); +if (title != null) + System.out.println("Title: \"" + title + "\""); +else + System.out.println("Document has no title.");</source> + + <p>Please note that a Microsoft Office document does not necessarily + contain the <em>\005SummaryInformation</em> stream. The documents created + by the Microsoft Office suite have one, as far as I know. However, an + Excel spreadsheet exported from StarOffice 5.2 won't have a + <em>\005SummaryInformation</em> stream. In this case the applications + won't throw an exception but simply does not call the + <code>processPOIFSReaderEvent</code> method. You have been warned!</p> + </section> + </section> + + <anchor id="sec2"/> + <section><title>Additional Standard Properties, Exceptions And Embedded Objects</title> + + <note>This section focusses on reading additional standard properties. It + also talks about exceptions that may be thrown when dealing with HPSF and + shows how you can read properties of embedded objects.</note> + + <p>A couple of <strong>additional standard properties</strong> are not + contained in the <em>\005SummaryInformation</em> stream explained above, + for example a document's category or the number of multimedia clips in a + PowerPoint presentation. Microsoft has invented an additional stream named + <em>\005DocumentSummaryInformation</em> to hold these properties. With two + minor exceptions you can proceed exactly as described above to read the + properties stored in <em>\005DocumentSummaryInformation</em>:</p> + + <ul> + <li>Instead of <em>\005SummaryInformation</em> use + <em>\005DocumentSummaryInformation</em> as the stream's name.</li> + <li>Replace all occurrences of the class + <code>SummaryInformation</code> by + <code>DocumentSummaryInformation</code>.</li> + </ul> + + <p>And of course you cannot call <code>getTitle()</code> because + <code>DocumentSummaryInformation</code> has different query methods. See + the Javadoc API documentation for the details!</p> + + <p>In the previous section the application simply caught all + <strong>exceptions</strong> and was in no way interested in any + details. However, a real application will likely want to know what went + wrong and act appropriately. Besides any IO exceptions there are three + HPSF resp. POI specific exceptions you should know about:</p> + + <dl> + <dt><code>NoPropertySetStreamException</code>:</dt> + <dd> + This exception is thrown if the application tries to create a + <code>PropertySet</code> instance from a stream that is not a + property set stream. (<code>SummaryInformation</code> and + <code>DocumentSummaryInformation</code> are subclasses of + <code>PropertySet</code>.) A faulty property set stream counts as not + being a property set stream at all. An application should be prepared to + deal with this case even if it opens streams named + <em>\005SummaryInformation</em> or + <em>\005DocumentSummaryInformation</em> only. These are just names. A + stream's name by itself does not ensure that the stream contains the + expected contents and that this contents is correct. + </dd> + + <dt><code>UnexpectedPropertySetTypeException</code></dt> + <dd>This exception is thrown if a certain type of property set is + expected somewhere (e.g. a <code>SummaryInformation</code> or + <code>DocumentSummaryInformation</code>) but the provided property + set is not of that type.</dd> + + <dt><code>MarkUnsupportedException</code></dt> + <dd>This exception is thrown if an input stream that is to be parsed + into a property set does not support the + <code>InputStream.mark(int)</code> operation. The POI filesystem uses + the <code>DocumentInputStream</code> class which does support this + operation, so you are safe here. However, if you read a property set + stream from another kind of input stream things may be + different.</dd> + </dl> + + <p>Many Microsoft Office documents contain <strong>embedded + objects</strong>, for example an Excel sheet on a page in a Word + document. Embedded objects may have property sets of their own. An + application can open these property set streams as described above. The + only difference is that they are not located in the POI filesystem's root + but in a <strong>nested directory</strong> instead. Just register a + <code>POIFSReaderListener</code> for the property set streams you are + interested in. For example, the <em>POIBrowser</em> application in the + contrib section tries to open each and every document in a POI filesystem + as a property set stream. If this operation was successful it displays the + properties.</p> + </section> + + <anchor id="sec3"/> + <section><title>Reading Non-Standard Properties</title> + + <note>This section tells how to read non-standard properties. Non-standard + properties are application-specific ID/type/value triples.</note> + + <section><title>Overview</title> + <p>Now comes the real hardcode stuff. As mentioned above, + <code>SummaryInformation</code> and + <code>DocumentSummaryInformation</code> are just special cases of the + general concept of a property set. This concept says that a + <strong>property set</strong> consists of properties and that each + <strong>property</strong> is an entity with an <strong>ID</strong>, a + <strong>type</strong>, and a <strong>value</strong>.</p> + + <p>Okay, that was still rather easy. However, to make things more + complicated, Microsoft in its infinite wisdom decided that a property set + shalt be broken into one or more <strong>sections</strong>. Each section + holds a bunch of properties. But since that's still not complicated + enough, a section may have an optional <strong>dictionary</strong> that + maps property IDs to <strong>property names</strong> - we'll explain + later what that means.</p> + + <p>The procedure to get to the properties is the following:</p> + + <ol> + <li>Use the <strong><code>PropertySetFactory</code></strong> class to + create a <code>PropertySet</code> object from a property set stream. If + you don't know whether an input stream is a property set stream, just + try to call <code>PropertySetFactory.create(java.io.InputStream)</code>: + You'll either get a <code>PropertySet</code> instance returned or an + exception is thrown.</li> + + <li>Call the <code>PropertySet</code>'s method <code>getSections()</code> + to get the sections contained in the property set. Each section is + an instance of the <code>Section</code> class.</li> + + <li>Each section has a format ID. The format ID of the first section in a + property set determines the property set's type. For example, the first + (and only) section of the SummaryInformation property set has a format + ID of <code>F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9</code>. You can + get the format ID with <code>Section.getFormatID()</code>.</li> + + <li>The properties contained in a <code>Section</code> can be retrieved + with <code>Section.getProperties()</code>. The result is an array of + <code>Property</code> instances.</li> + + <li>A property has a name, a type, and a value. The <code>Property</code> + class has methods to retrieve them.</li> + </ol> + </section> + + <section><title>A Sample Application</title> + <p>Let's have a look at a sample Java application that dumps all property + set streams contained in a POI file system. The full source code of this + program can be found as <em>ReadCustomPropertySets.java</em> in the + <em>examples</em> area of the POI source code tree. Here are the key + sections:</p> + + <source>import java.io.*; +import java.util.*; +import org.apache.poi.hpsf.*; +import org.apache.poi.poifs.eventfilesystem.*; +import org.apache.poi.util.HexDump;</source> + + <p>The most important package the application needs is + <code>org.apache.poi.hpsf.*</code>. This package contains the HPSF + classes. Most classes named below are from the HPSF package. Of course we + also need the POIFS event file system's classes and <code>java.io.*</code> + since we are dealing with POI I/O. From the <code>java.util</code> package + we use the <code>List</code> and <code>Iterator</code> class. The class + <code>org.apache.poi.util.HexDump</code> provides a methods to dump byte + arrays as nicely formatted strings.</p> + + <source>public static void main(String[] args) + throws IOException +{ + final String filename = args[0]; + POIFSReader r = new POIFSReader(); + + /* Register a listener for *all* documents. */ + r.registerListener(new MyPOIFSReaderListener()); + r.read(new FileInputStream(filename)); +}</source> + + <p>The <code>POIFSReader</code> is set up in a way that the listener + <code>MyPOIFSReaderListener</code> is called on every file in the POI file + system.</p> + </section> + + <section><title>The Property Set</title> + <p>The listener class tries to create a <code>PropertySet</code> from each + stream using the <code>PropertySetFactory.create()</code> method:</p> + + <source>static class MyPOIFSReaderListener implements POIFSReaderListener +{ + public void processPOIFSReaderEvent(POIFSReaderEvent event) + { + PropertySet ps = null; + try + { + ps = PropertySetFactory.create(event.getStream()); + } + catch (NoPropertySetStreamException ex) + { + out("No property set stream: \"" + event.getPath() + + event.getName() + "\""); + return; + } + catch (Exception ex) + { + throw new RuntimeException + ("Property set stream \"" + + event.getPath() + event.getName() + "\": " + ex); + } + + /* Print the name of the property set stream: */ + out("Property set stream \"" + event.getPath() + + event.getName() + "\":");</source> + + <p>Creating the <code>PropertySet</code> is done in a <code>try</code> + block, because not each stream in the POI file system contains a property + set. If it is some other file, the + <code>PropertySetFactory.create()</code> throws a + <code>NoPropertySetStreamException</code>, which is caught and + logged. Then the program continues with the next stream. However, all + other types of exceptions cause the program to terminate by throwing a + runtime exception. If all went well, we can print the name of the property + set stream.</p> + </section> + + <section><title>The Sections</title> + <p>The next step is to print the number of sections followed by the + sections themselves:</p> + + <source>/* Print the number of sections: */ +final long sectionCount = ps.getSectionCount(); +out(" No. of sections: " + sectionCount); + +/* Print the list of sections: */ +List sections = ps.getSections(); +int nr = 0; +for (Iterator i = sections.iterator(); i.hasNext();) +{ + /* Print a single section: */ + Section sec = (Section) i.next(); + + // See below for the complete loop body. +}</source> + + <p>The <code>PropertySet</code>'s method <code>getSectionCount()</code> + returns the number of sections.</p> + + <p>To retrieve the sections, use the <code>getSections()</code> + method. This method returns a <code>java.util.List</code> containing + instances of the <code>Section</code> class in their proper order.</p> + + <p>The sample code shows a loop that retrieves the <code>Section</code> + objects one by one and prints some information about each one. Here is + the complete body of the loop:</p> + + <source>/* Print a single section: */ +Section sec = (Section) i.next(); +out(" Section " + nr++ + ":"); +String s = hex(sec.getFormatID().getBytes()); +s = s.substring(0, s.length() - 1); +out(" Format ID: " + s); + +/* Print the number of properties in this section. */ +int propertyCount = sec.getPropertyCount(); +out(" No. of properties: " + propertyCount); + +/* Print the properties: */ +Property[] properties = sec.getProperties(); +for (int i2 = 0; i2 < properties.length; i2++) +{ + /* Print a single property: */ + Property p = properties[i2]; + int id = p.getID(); + long type = p.getType(); + Object value = p.getValue(); + out(" Property ID: " + id + ", type: " + type + + ", value: " + value); +}</source> + </section> + + <section><title>The Section's Format ID</title> + <p>The first method called on the <code>Section</code> instance is + <code>getFormatID()</code>. As explained above, the format ID of the + first section in a property set determines the type of the property + set. Its type is <code>ClassID</code> which is essentially a sequence of + 16 bytes. A real application using its own type of a custom property set + should have defined a unique format ID and, when reading a property set + stream, should check the format ID is equal to that unique format ID. The + sample program just prints the format ID it finds in a section:</p> + + <source>String s = hex(sec.getFormatID().getBytes()); +s = s.substring(0, s.length() - 1); +out(" Format ID: " + s);</source> + + <p>As you can see, the <code>getFormatID()</code> method returns a + <code>ClassID</code> object. An array containing the bytes can be + retrieved with <code>ClassID.getBytes()</code>. In order to get a nicely + formatted printout, the sample program uses the <code>hex()</code> helper + method which in turn uses the POI utility class <code>HexDump</code> in + the <code>org.apache.poi.util</code> package. Another helper method is + <code>out()</code> which just saves typing + <code>System.out.println()</code>.</p> + </section> + + <section><title>The Properties</title> + <p>Before getting the properties, it is possible to find out how many + properties are available in the section via the + <code>Section.getPropertyCount()</code>. The sample application uses this + method to print the number of properties to the standard output:</p> + + <source>int propertyCount = sec.getPropertyCount(); +out(" No. of properties: " + propertyCount);</source> + + <p>Now its time to get to the properties themselves. You can retrieve a + section's properties with the method + <code>Section.getProperties()</code>:</p> + + <source>Property[] properties = sec.getProperties();</source> + + <p>As you can see the result is an array of <code>Property</code> + objects. This class has three methods to retrieve a property's ID, its + type, and its value. The following code snippet shows how to call + them:</p> + + <source>for (int i2 = 0; i2 < properties.length; i2++) +{ + /* Print a single property: */ + Property p = properties[i2]; + int id = p.getID(); + long type = p.getType(); + Object value = p.getValue(); + out(" Property ID: " + id + ", type: " + type + + ", value: " + value); +}</source> + </section> + + <section><title>Sample Output</title> + <p>The output of the sample program might look like the following. It + shows the summary information and the document summary information + property sets of a Microsoft Word document. However, unlike the first and + second section of this HOW-TO the application does not have any code + which is specific to the <code>SummaryInformation</code> and + <code>DocumentSummaryInformation</code> classes.</p> + + <source>Property set stream "/SummaryInformation": + No. of sections: 1 + Section 0: + Format ID: 00000000 F2 9F 85 E0 4F F9 10 68 AB 91 08 00 2B 27 B3 D9 ....O..h....+'.. + No. of properties: 17 + Property ID: 1, type: 2, value: 1252 + Property ID: 2, type: 30, value: Titel + Property ID: 3, type: 30, value: Thema + Property ID: 4, type: 30, value: Rainer Klute (Autor) + Property ID: 5, type: 30, value: Test (Stichwörter) + Property ID: 6, type: 30, value: This is a document for testing HPSF + Property ID: 7, type: 30, value: Normal.dot + Property ID: 8, type: 30, value: Unknown User + Property ID: 9, type: 30, value: 3 + Property ID: 18, type: 30, value: Microsoft Word 9.0 + Property ID: 12, type: 64, value: Mon Jan 01 00:59:25 CET 1601 + Property ID: 13, type: 64, value: Thu Jul 18 16:22:00 CEST 2002 + Property ID: 14, type: 3, value: 1 + Property ID: 15, type: 3, value: 20 + Property ID: 16, type: 3, value: 93 + Property ID: 19, type: 3, value: 0 + Property ID: 17, type: 71, value: [B@13582d +Property set stream "/DocumentSummaryInformation": + No. of sections: 2 + Section 0: + Format ID: 00000000 D5 CD D5 02 2E 9C 10 1B 93 97 08 00 2B 2C F9 AE ............+,.. + No. of properties: 14 + Property ID: 1, type: 2, value: 1252 + Property ID: 2, type: 30, value: Test + Property ID: 14, type: 30, value: Rainer Klute (Manager) + Property ID: 15, type: 30, value: Rainer Klute IT-Consulting GmbH + Property ID: 5, type: 3, value: 3 + Property ID: 6, type: 3, value: 2 + Property ID: 17, type: 3, value: 111 + Property ID: 23, type: 3, value: 592636 + Property ID: 11, type: 11, value: false + Property ID: 16, type: 11, value: false + Property ID: 19, type: 11, value: false + Property ID: 22, type: 11, value: false + Property ID: 13, type: 4126, value: [B@56a499 + Property ID: 12, type: 4108, value: [B@506411 + Section 1: + Format ID: 00000000 D5 CD D5 05 2E 9C 10 1B 93 97 08 00 2B 2C F9 AE ............+,.. + No. of properties: 7 + Property ID: 0, type: 0, value: {6=Test-JaNein, 5=Test-Zahl, 4=Test-Datum, 3=Test-Text, 2=_PID_LINKBASE} + Property ID: 1, type: 2, value: 1252 + Property ID: 2, type: 65, value: [B@c9ba38 + Property ID: 3, type: 30, value: This is some text. + Property ID: 4, type: 64, value: Wed Jul 17 00:00:00 CEST 2002 + Property ID: 5, type: 3, value: 27 + Property ID: 6, type: 11, value: true +No property set stream: "/WordDocument" +No property set stream: "/CompObj" +No property set stream: "/1Table"</source> + + <p>There are some interesting items to note:</p> + + <ul> + <li>The first property set (summary information) consists of a single + section, the second property set (document summary information) consists + of two sections.</li> + + <li>Each section type (identified by its format ID) has its own domain of + property ID. For example, in the second property set the properties with + ID 2 have different meanings in the two section. By the way, the format + IDs of these sections are <strong>not</strong> equal, but you have to + look hard to find the difference.</li> + + <li>The properties are not in any particular order in the section, + although they slightly tend to be sorted by their IDs.</li> + </ul> + </section> + + <section><title>Property IDs</title> + <p>Properties in the same section are distinguished by their IDs. This is + similar to variables in a programming language like Java, which are + distinguished by their names. But unlike variable names, property IDs are + simple integral numbers. There is another similarity, however. Just like + a Java variable has a certain scope (e.g. a member variables in a class), + a property ID also has its scope of validity: the section.</p> + + <p>Two property IDs in sections with different section format IDs + don't have the same meaning even though their IDs might be equal. For + example, ID 4 in the first (and only) section of a summary + information property set denotes the document's author, while ID 4 in the + first section of the document summary information property set means the + document's byte count. The sample output above does not show a property + with an ID of 4 in the first section of the document summary information + property set. That means that the document does not have a byte + count. However, there is a property with an ID of 4 in the + <em>second</em> section: This is a user-defined property ID - we'll get + to that topic in a minute.</p> + + <p>So, how can you find out what the meaning of a certain property ID in + the summary information and the document summary information property set + is? The standard property sets as such don't have any hints about the + <strong>meanings of their property IDs</strong>. For example, the summary + information property set does not tell you that the property ID 4 stands + for the document's author. This is external knowledge. Microsoft defined + standard meanings for some of the property IDs in the summary information + and the document summary information property sets. As a help to the Java + and POI programmer, the class <code>PropertyIDMap</code> in the + <code>org.apache.poi.hpsf.wellknown</code> package defines constants + for the "well-known" property IDs. For example, there is the + definition</p> + + <source>public final static int PID_AUTHOR = 4;</source> + + <p>These definitions allow you to use symbolic names instead of + numbers.</p> + + <p>In order to provide support for the other way, too, - i.e. to map + property IDs to property names - the class <code>PropertyIDMap</code> + defines two static methods: + <code>getSummaryInformationProperties()</code> and + <code>getDocumentSummaryInformationProperties()</code>. Both return + <code>java.util.Map</code> objects which map property IDs to + strings. Such a string gives a hint about the property's meaning. For + example, + <code>PropertyIDMap.getSummaryInformationProperties().get(4)</code> + returns the string "PID_AUTHOR". An application could use this string as + a key to a localized string which is displayed to the user, e.g. "Author" + in English or "Verfasser" in German. HPSF might provide such + language-dependend ("localized") mappings in a later release.</p> + + <p>Usually you won't have to deal with those two maps. Instead you should + call the <code>Section.getPIDString(int)</code> method. It returns the + string associated with the specified property ID in the context of the + <code>Section</code> object.</p> + + <p>Above you learned that property IDs have a meaning in the scope of a + section only. However, there are two exceptions to the rule: The property + IDs 0 and 1 have a fixed meaning in <strong>all</strong> sections:</p> + + <table> + <tr> + <th>Property ID</th> + <th>Meaning</th> + </tr> + + <tr> + <td>0</td> + <td>The property's value is a <strong>dictionary</strong>, i.e. a + mapping from property IDs to strings.</td> + </tr> + + <tr> + <td>1</td> + <td>The property's value is the number of a <strong>codepage</strong>, + i.e. a mapping from character codes to characters. All strings in the + section containing this property must be interpreted using this + codepage. Typical property values are 1252 (8-bit "western" characters) + or 1200 (16-bit Unicode characters).</td> + </tr> + </table> + </section> + + <section><title>Property types</title> + <p>A property is nothing without its value. It is stored in a property set + stream as a sequence of bytes. You must know the property's + <strong>type</strong> in order to properly interpret those bytes and + reasonably handle the value. A property's type is one of the so-called + Microsoft-defined <strong>"variant types"</strong>. When you call + <code>Property.getType()</code> you'll get a <code>long</code> value + which denoting the property's variant type. The class + <code>Variant</code> in the <code>org.apache.poi.hpsf</code> package + holds most of those <code>long</code> values as named constants. For + example, the constant <code>VT_I4 = 3</code> means a signed integer value + of four bytes. Examples of other types are <code>VT_LPSTR = 30</code> + meaning a null-terminated string of 8-bit characters, <code>VT_LPWSTR = + 31</code> which means a null-terminated Unicode string, or <code>VT_BOOL + = 11</code> denoting a boolean value.</p> + + <p>In most cases you won't need a property's type because HPSF does all + the work for you.</p> + </section> + + <section><title>Property values</title> + <p>When an application wants to retrieve a property's value and calls + <code>Property.getValue()</code>, HPSF has to interpret the bytes making + out the value according to the property's type. The type determines how + many bytes the value consists of and what + to do with them. For example, if the type is <code>VT_I4</code>, HPSF + knows that the value is four bytes long and that these bytes + comprise a signed integer value in the little-endian format. This is + quite different from e.g. a type of <code>VT_LPWSTR</code>. In this case + HPSF has to scan the value bytes for a Unicode null character and collect + everything from the beginning to that null character as a Unicode + string.</p> + + <p>The good new is that HPSF does another job for you, too: It maps the + variant type to an adequate Java type.</p> + + <table> + <tr> + <th>Variant type:</th> + <th>Java type:</th> + </tr> + + <tr> + <td>VT_I2</td> + <td>java.lang.Integer</td> + </tr> + + <tr> + <td>VT_I4</td> + <td>java.lang.Long</td> + </tr> + + <tr> + <td>VT_FILETIME</td> + <td>java.util.Date</td> + </tr> + + <tr> + <td>VT_LPSTR</td> + <td>java.lang.String</td> + </tr> + + <tr> + <td>VT_LPWSTR</td> + <td>java.lang.String</td> + </tr> + + <tr> + <td>VT_CF</td> + <td>byte[]</td> + </tr> + + <tr> + <td>VT_BOOL</td> + <td>java.lang.Boolean</td> + </tr> + + </table> + + <p>The bad news is that there are still a couple of variant types HPSF + does not yet support. If it encounters one of these types it + returns the property's value as a byte array and leaves it to be + interpreted by the application.</p> + + <p>An application retrieves a property's value by calling the + <code>Property.getValue()</code> method. This method's return type is the + abstract <code>Object</code> class. The <code>getValue()</code> method + looks up the property's variant type, reads the property's value bytes, + creates an instance of an adequate Java type, assigns it the property's + value and returns it. Primitive types like <code>int</code> or + <code>long</code> will be returned as the corresponding class, + e.g. <code>Integer</code> or <code>Long</code>.</p> + </section> + + + <section><title>Dictionaries</title> + <p>The property with ID 0 has a very special meaning: It is a + <strong>dictionary</strong> mapping property IDs to property names. We + have seen already that the meanings of standard properties in the + summary information and the document summary information property sets + have been defined by Microsoft. The advantage is that the labels of + properties like "Author" or "Title" don't have to be stored in the + property set. However, a user can define custom fields in, say, Microsoft + Word. For each field the user has to specify a name, a type, and a + value.</p> + + <p>The names of the custom-defined fields (i.e. the property names) are + stored in the document summary information second section's + <strong>dictionary</strong>. The dictionary is a map which associates + property IDs with property names.</p> + + <p>The method <code>Section.getPIDString(int)</code> not only returns with + the well-known property names of the summary information and document + summary information property sets, but with self-defined properties, + too. It should also work with self-defined properties in self-defined + sections.</p> + </section> + + <section><title>Codepage support</title> + <fixme author="Rainer Klute">Improve codepage support!</fixme> + + <p>The property with ID 1 holds the number of the codepage which was used + to encode the strings in this section. The present HPSF codepage support + is still very limited: When reading property value strings, HPSF + distinguishes between 16-bit characters and 8-bit characters. 16-bit + characters should be Unicode characters and thus be okay. 8-bit + characters are interpreted according to the platform's default character + set. This is fine as long as the document being read has been written on + a platform with the same default character set. However, if you receive a + document from another region of the world and want to process it with + HPSF you are in trouble - unless the creator used Unicode, of course.</p> + </section> + + <section><title>Further Reading</title> + <p>There are still some aspects of HSPF left which are not covered by this + HOW-TO. You should dig into the Javadoc API documentation to learn + further details. Since you've struggled through this document up to this + point, you are well prepared.</p> + </section> + </section> + </section> + </body> +</document> + +<!-- Keep this comment at the end of the file +Local variables: +mode: xml +sgml-omittag:nil +sgml-shorttag:nil +sgml-namecase-general:nil +sgml-general-insert-case:lower +sgml-minimize-attributes:nil +sgml-always-quote-attributes:t +sgml-indent-step:1 +sgml-indent-data:t +sgml-parent-document:nil +sgml-exposed-tags:nil +sgml-local-catalogs:nil +sgml-local-ecat-files:nil +End: +--> diff --git a/src/documentation/content/xdocs/hpsf/index.xml b/src/documentation/content/xdocs/hpsf/index.xml new file mode 100644 index 0000000000..b0e52c86d8 --- /dev/null +++ b/src/documentation/content/xdocs/hpsf/index.xml @@ -0,0 +1,54 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd"> +<!-- $Id$ --> + +<document> + <header> + <title>HPSF (Horrible Property Set Format)</title> + <subtitle>Overview</subtitle> + <authors> + <person name="Rainer Klute" email="klute@apache.org"/> + </authors> + </header> + <body> + <section><title>Overview</title> + <p>Microsoft applications like "Word", "Excel" or "Powerpoint" let the user + describe his document by properties like "title", "category" and so on. The + application itself adds further information: last author, creation date + etc. These document properties are stored in so-called <strong>property set + streams</strong>. A property set stream is a separate document within a + <link href="../poifs/index.html">POI filesystem</link>. We'll call property + set streams mostly just "property sets". HPSF is POI's pure-Java + implementation to read (and in future to write) property sets.</p> + + <p>The <link href="how-to.html">HPSF HOWTO</link> describes what a Java + application should do to read a property set using HPSF and to retrieve the + information it needs.</p> + + <p>HPSF supports OLE2 property set streams in general, and is not limited to + the special case of document properties in the Microsoft Office files + mentioned above. The <link href="internals.html">HPSF description</link> + describes the internal structure of property set streams. A separate + document explains the internal of <link href="thumbnails.html">thumbnail + images</link>.</p> + </section> + </body> +</document> + +<!-- Keep this comment at the end of the file +Local variables: +mode: xml +sgml-omittag:nil +sgml-shorttag:nil +sgml-namecase-general:nil +sgml-general-insert-case:lower +sgml-minimize-attributes:nil +sgml-always-quote-attributes:t +sgml-indent-step:1 +sgml-indent-data:t +sgml-parent-document:nil +sgml-exposed-tags:nil +sgml-local-catalogs:nil +sgml-local-ecat-files:nil +End: +--> diff --git a/src/documentation/content/xdocs/hpsf/internals.xml b/src/documentation/content/xdocs/hpsf/internals.xml new file mode 100644 index 0000000000..b4792dcd33 --- /dev/null +++ b/src/documentation/content/xdocs/hpsf/internals.xml @@ -0,0 +1,1010 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd"> +<!-- $Id$ --> + +<document> + <header> + <title>HPSF Internals: The Horrible Property Set Format</title> + <authors> + <person name="Rainer Klute" email="klute@rainer-klute.de"/> + </authors> + </header> + <body> + <section><title>HPSF Internals</title> + + <section><title>Introduction</title> + + <p>A Microsoft Office document is internally organized like a filesystem + with directory and files. Microsoft calls these files + <strong>streams</strong>. A document can have properties attached to it, + like author, title, number of words etc. These metadata are not stored in + the main stream of, say, a Word document, but instead in a dedicated + stream with a special format. Usually this stream's name is + <code>\005SummaryInformation</code>, where <code>\005</code> represents + the character with a decimal value of 5.</p> + + <p>A single piece of information in the stream is called a + <strong>property</strong>, for example the document title. Each property + has an integral <strong>ID</strong> (e.g. 2 for title), a + <strong>type</strong> (telling that the title is a string of bytes) and a + <strong>value</strong> (what this is should be obvious). A stream + containing properties is called a + <strong>property set stream</strong>.</p> + + <p>This document describes the internal structure of a property set stream, + i.e. the <strong>Horrible Property Set Format (HDF)</strong>. It does not + describe how a Microsoft Office document is organized internally and how + to retrieve a stream from it. See the <link + href="../poifs/index.html">POIFS documentation</link> for that kind of + stuff.</p> + + <p>The Horrible Property Set Format is not only used in the Summary + Information stream in the top-level document of a Microsoft Office + document. Often there is also a property set stream named + <code>\005DocumentSummaryInformation</code> with additional properties. + Embedded documents may have their own property set streams. You cannot + tell by a stream's name whether it is a property set stream or not. + Instead you have to open the stream and look at its bytes.</p> + </section> + + + + <section><title>Data Types</title> + + <p>Before delving into the details of the property set stream format we + have to have a short look at data types. Integral values are stored in the + so-called <strong>little endian</strong> format. In this format the bytes + that make out an integral value are stored in the "wrong" order. For + example, the decimal value 4660 is 0x1234 in the hexadecimal notation. If + you think this should be represented by a byte 0x12 followed by another + byte 0x34, you are right. This is called the <strong>big endian</strong> + format. In the little endian format, however, this order is reversed and + the low-value byte comes first: 0x3412. + </p> + + <p>The following table gives an overview about some important data + types:</p> + + <table> + + <tr> + <th>Name</th> + <th>Length</th> + <th>Example (Little Endian)</th> + <th>Example (Big Endian)</th> + </tr> + + <tr> + <td><strong>Bytes</strong></td> + <td>1 byte</td> + <td><code>0x12</code></td> + <td><code>0x12</code></td> + </tr> + + <tr> + <td><strong>Word</strong></td> + <td>2 bytes</td> + <td><code>0x1234</code></td> + <td><code>0x3412</code></td> + </tr> + + <tr> + <td><strong>DWord</strong></td> + <td>4 bytes</td> + <td><code>0x12345678</code></td> + <td><code>0x78563412</code></td> + </tr> + + <tr> + <td><strong>ClassID</strong><br/> + A sequence of one DWord, two Words and eight Bytes</td> + + <td>16 bytes</td> + + <td><code>0xE0859FF2F94F6810AB9108002B27B3D9</code> resp. + <code>E0859FF2-F94F-6810-AB-91-08-00-2B-27-B3-D9</code></td> + + <td><code>0xF29F85E04FF91068AB9108002B27B3D9</code> resp. + <code>F29F85E0-4FF9-1068-AB-91-08-00-2B-27-B3-D9</code></td> + </tr> + + <tr> + <td></td> + <td></td> + <td>The ClassID examples are given here in two different notations. The + second notation without the "0x" at the beginning and with dashes + inside shows the internal grouping into one DWord, two Words and eight + Bytes.</td> + <td><em>Watch out:</em> Microsoft documentation and tools show class IDs + a little bit differently like + <code>F29F85E0-4FF9-1068-AB91-08002B27B3D9</code>. + However, that representation is (intentionally?) misleading with + respect to endianess.</td> + </tr> + </table> + </section> + + + + <section><title>HPSF Overview</title> + + <p>A property set stream consists of three main parts:</p> + + <ol> + <li>The <strong>header</strong> and</li> + <li>the <strong>section(s)</strong> containing the properties.</li> + </ol> + </section> + + + + <section><title>The Header</title> + + <p>The first bytes in a property set stream is the <strong>header</strong>. + It has a fixed length and looks like this:</p> + + <table> + <tr> + <th>Offset</th> + <th>Type</th> + <th>Contents</th> + <th>Remarks</th> + </tr> + + <tr> + <td>0</td> + <td>Word</td> + <td><code>0xFFFE</code></td> + <td>If the first four bytes of a stream do not contain these values, the + stream is not a property set stream.</td> + </tr> + + <tr> + <td>2</td> + <td>Word</td> + <td><code>0x0000</code></td> + <td></td> + </tr> + + <tr> + <td>4</td> + <td>DWord</td> + <td>Denotes the operating system and the OS version under which this + stream was created. The operating system ID is in the DWord's higher + word (after little endian decoding): <code>0x0000</code> for Win16, + <code>0x0001</code> for Macintosh and <code>0x0002</code> for Win32 - that's + all. The reader is most likely aware of the fact that there are some + more operating systems. However, Microsoft does not seem to know.</td> + <td></td> + </tr> + + <tr> + <td>8</td> + <td>ClassID</td> + <td><code>0x00000000000000000000000000000000</code></td> + <td>Most property set streams have this value but this is not + required.</td> + </tr> + + <tr> + <td>24</td> + <td>DWord</td> + <td><code>0x01000000</code> or greater</td> + <td>Section count. This field's value should be equal to 1 or greater. + Microsoft claims that this is a "reserved" field, but it seems to tell + how many sections (see below) are following in the stream. This would + really make sense because otherwise you could not know where and how + far you should read section data.</td> + </tr> + </table> + </section> + + + + <section><title>Section List</title> + + <p>Following the header is the section list. This is an array of pairs each + consisting of a section format ID and an offset. This array has as many + pairs of ClassID and and DWord fields as the section count field in the + header says. The Summary Information stream contains a single section, the + Document Summary Information stream contains two.</p> + + <table> + <tr> + <th>Type</th> + <th>Contents</th> + <th>Remarks</th> + </tr> + + <tr> + <td>ClassID</td> + <td>Section format ID</td> + <td><code>0xF29F85E04FF91068AB9108002B27B3D9</code> for the single section + in the Summary Information stream.<br/><br/> + + <code>0xD5CDD5022E9C101B939708002B2CF9AE</code> for the first + section in the Document Summary Information stream.</td> + </tr> + + <tr> + <td>DWord</td> + <td>Offset</td> + <td>The number of bytes between the beginning of the stream and the + beginning of the section within the stream.</td> + </tr> + + <tr> + <td>ClassID</td> + <td>Section format ID</td> + <td>...</td> + </tr> + + <tr> + <td>DWord</td> + <td>Offset</td> + <td>...</td> + </tr> + + <tr> + <td>...</td> + <td>...</td> + <td>...</td> + </tr> + </table> + </section> + + + + <section><title>Section</title> + + <p>A section is divided into three parts: the section header (with the + section length and the number of properties in the section), the + properties list (with type and offset of each property), and the + properties themselves. Here are the details:</p> + + <table> + <tr> + <th> </th> + <th>Type</th> + <th>Contents</th> + <th>Remarks</th> + </tr> + + <tr> + <td>Section header</td> + + <td>DWord</td> + <td>Length</td> + <td>The length of the section in bytes.</td> + </tr> + + <tr> + <td></td> + <td>DWord</td> + <td>Property count</td> + <td>The number of properties in the section.</td> + </tr> + + <tr> + + <td>Properties list</td> + + <td>DWord</td> + <td>Property ID</td> + <td>The property ID tells what the property means. For example, an ID of + <code>0x0002</code> in the Summary Information stands for the document's + title. See the <link href="#property_ids">Property IDs</link> + chapter below for more details.</td> + </tr> + + <tr> + <td></td> + <td>DWord</td> + <td>Offset</td> + <td>The number of bytes between the beginning of the section and the + property.</td> + </tr> + + <tr> + <td></td> + <td>...</td> + <td>...</td> + <td>...</td> + </tr> + + <tr> + <td>Properties</td> + + <td>DWord</td> + <td>Property type ("variant")</td> + <td>This is the property's data type, e.g. an integer value, a byte + string or a Unicode string. See the + <link href="#property_types"><em>Property Types</em></link> chapter for + details!</td> + </tr> + + <tr> + <td></td> + <td><em>Field length depends on the property type + ("variant")</em></td> + <td>Property value</td> + <td>This field's length depends on the property's type. These are the + bytes that make out the DWord, the byte string or some other data of + fixed or variable length.<br/><br/> + + The property value's length is always stored in an area which is a + multiple of 4 in length. If the property is shorter, e.g. a byte + string of 13 bytes, the remaining bytes are padded with <code>0x00</code> + bytes.</td> + </tr> + + <tr> + <td></td> + <td>...</td> + <td>...</td> + <td>...</td> + </tr> + </table> + </section> + + + + <section><title>Property IDs</title> + <anchor id="property_ids"/> + + <p>As seen above, a section holds a property list: an array with property + IDs and offsets. The property ID gives each property a meaning. For + example, in the Summary Information stream the property ID 2 says that + this property is the document's title.</p> + + <p>If you want to know a property ID's meaning, it is not sufficient to + know the ID itself. You must also know the + <strong>section format ID</strong>. For example, in the Document Summary + Information stream the property ID 2 means not the document's title but + its category. Due to Microsoft's infinite wisdom the section format ID is + not part of the section. Thus if you have only a section without the + stream it is in, you cannot make any sense of the properties because you + do not know what they mean.</p> + + <p>So each section format ID has its own name space of property IDs. + Microsoft defined some "well-known" property IDs for the Summary + Information and the Document Summary Information streams. You can extend + them by your own additional IDs. This will be described below.</p> + + <section><title>Property IDs in The Summary Information Stream</title> + + <p>The Summary Information stream has a single section with a section + format ID of <code>0xF29F85E04FF91068AB9108002B27B3D9</code>. The following + table defines the meaning of its property IDs. Each row associates a + property ID with a <em>name</em> and an <em>ID string</em>. (The property + <em>type</em> is just for informational purposes given here. As we have + seen above, the type is always given along with the value.)</p> + + <p>The property <em>name</em> is a readable string which could be + displayed to the user. However, this string is useful only for users who + understand English. The property name does not help with other + languages.</p> + + <p>The property <em>ID string</em> is about the same but looks more + technically and is nothing a user should bother with. You could the ID + string and map it to an appropriate display string in a particular + language. Of course you could do that with the property ID as well and + with less overhead, but people (including software developers) tend to be + better in remembering symbolic constants than remembering numbers.</p> + + <table> + <tr> + <th>Property ID</th> + <th>Property Name</th> + <th>Property ID String</th> + <th>Property Type</th> + </tr> + <tr> + <td>2</td> + <td>Title</td> + <td>PID_TITLE</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>3</td> + <td>Subject</td> + <td>PID_SUBJECT</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>4</td> + <td>Author</td> + <td>PID_AUTHOR</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>5</td> + <td>Keywords</td> + <td>PID_KEYWORDS</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>6</td> + <td>Comments</td> + <td>PID_COMMENTS</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>7</td> + <td>Template</td> + <td>PID_TEMPLATE</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>8</td> + <td>Last Saved By</td> + <td>PID_LASTAUTHOR</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>9</td> + <td>Revision Number</td> + <td>PID_REVNUMBER</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>10</td> + <td>Total Editing Time</td> + <td>PID_EDITTIME</td> + <td>VT_FILETIME</td> + </tr> + <tr> + <td>11</td> + <td>Last Printed</td> + <td>PID_LASTPRINTED</td> + <td>VT_FILETIME</td> + </tr> + <tr> + <td>12</td> + <td>Create Time/Date</td> + <td>PID_CREATE_DTM</td> + <td>VT_FILETIME</td> + </tr> + <tr> + <td>13</td> + <td>Last Saved Time/Date</td> + <td>PID_LASTSAVE_DTM</td> + <td>VT_FILETIME</td> + </tr> + <tr> + <td>14</td> + <td>Number of Pages</td> + <td>PID_PAGECOUNT</td> + <td>VT_I4</td> + </tr> + <tr> + <td>15</td> + <td>Number of Words</td> + <td>PID_WORDCOUNT</td> + <td>VT_I4</td> + </tr> + <tr> + <td>16</td> + <td>Number of Characters</td> + <td>PID_CHARCOUNT</td> + <td>VT_I4</td> + </tr> + <tr> + <td>17</td> + <td>Thumbnail</td> + <td>PID_THUMBNAIL</td> + <td>VT_CF</td> + </tr> + <tr> + <td>18</td> + <td>Name of Creating Application</td> + <td>PID_APPNAME</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>19</td> + <td>Security</td> + <td>PID_SECURITY</td> + <td>VT_I4</td> + </tr> + </table> + </section> + + + + <section><title>Property IDs in The Document Summary Information Stream</title> + + <p>The Document Summary Information stream has two sections with a section + format ID of <code>0xD5CDD5022E9C101B939708002B2CF9AE</code> for the first + one. The following table defines the meaning of the property IDs in the + first section. See the preceeding section for interpreting the table.</p> + + <table> + <tr> + <th>Property ID</th> + <th>Property name</th> + <th>Property ID string</th> + <th>VT type</th> + </tr> + + <tr> + <td>0</td> + <td>Dictionary</td> + <td>PID_DICTIONARY</td> + <td>[Special format]</td> + </tr> + <tr> + <td>1</td> + <td>Code page</td> + <td>PID_CODEPAGE</td> + <td>VT_I2</td> + </tr> + <tr> + <td>2</td> + <td>Category</td> + <td>PID_CATEGORY</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>3</td> + <td>PresentationTarget</td> + <td>PID_PRESFORMAT</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>4</td> + <td>Bytes</td> + <td>PID_BYTECOUNT</td> + <td>VT_I4</td> + </tr> + <tr> + <td>5</td> + <td>Lines</td> + <td>PID_LINECOUNT</td> + <td>VT_I4</td> + </tr> + <tr> + <td>6</td> + <td>Paragraphs</td> + <td>PID_PARCOUNT</td> + <td>VT_I4</td> + </tr> + <tr> + <td>7</td> + <td>Slides</td> + <td>PID_SLIDECOUNT</td> + <td>VT_I4</td> + </tr> + <tr> + <td>8</td> + <td>Notes</td> + <td>PID_NOTECOUNT</td> + <td>VT_I4</td> + </tr> + <tr> + <td>9</td> + <td>HiddenSlides</td> + <td>PID_HIDDENCOUNT</td> + <td>VT_I4</td> + </tr> + <tr> + <td>10</td> + <td>MMClips</td> + <td>PID_MMCLIPCOUNT</td> + <td>VT_I4</td> + </tr> + <tr> + <td>11</td> + <td>ScaleCrop</td> + <td>PID_SCALE</td> + <td>VT_BOOL</td> + </tr> + <tr> + <td>12</td> + <td>HeadingPairs</td> + <td>PID_HEADINGPAIR</td> + <td>VT_VARIANT | VT_VECTOR</td> + </tr> + <tr> + <td>13</td> + <td>TitlesofParts</td> + <td>PID_DOCPARTS</td> + <td>VT_LPSTR | VT_VECTOR</td> + </tr> + <tr> + <td>14</td> + <td>Manager</td> + <td>PID_MANAGER</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>15</td> + <td>Company</td> + <td>PID_COMPANY</td> + <td>VT_LPSTR</td> + </tr> + <tr> + <td>16</td> + <td>LinksUpTo Date</td> + <td>PID_LINKSDIRTY</td> + <td>VT_BOOL</td> + </tr> + </table> + </section> + </section> + + + + <section><title>Property Types</title> + <anchor id="property_types"/> + + <p>A property consists of a DWord <em>type field</em> followed by the + property value. The property type is an integer value and tells how the + data byte following it are to be interpreted. In the Microsoft world it is + also known as the <em>variant</em>.</p> + + <p>The <em>Usage</em> column says where a variant type may occur. Not all + of them are allowed in a property set but just those marked with a [P]. + <strong>[V]</strong> - may appear in a VARIANT, <strong>[T]</strong> - may + appear in a TYPEDESC, <strong>[P]</strong> - may appear in an OLE property + set, <strong>[S]</strong> - may appear in a Safe Array.</p> + + <table> + <tr> + <th>Variant ID</th> + <th>Variant Type</th> + <th>Usage</th> + <th>Description</th> + </tr> + <tr> + <td>0</td> + <td>VT_EMPTY</td> + <td>[V] [P]</td> + <td>nothing</td> + </tr> + <tr> + <td>1</td> + <td>VT_NULL</td> + <td>[V] [P]</td> + <td>SQL style Null</td> + </tr> + <tr> + <td>2</td> + <td>VT_I2</td> + <td>[V] [T] [P] [S]</td> + <td>2 byte signed int</td> + </tr> + <tr> + <td>3</td> + <td>VT_I4</td> + <td>[V] [T] [P] [S]</td> + <td>4 byte signed int</td> + </tr> + <tr> + <td>4</td> + <td>VT_R4</td> + <td>[V] [T] [P] [S]</td> + <td>4 byte real</td> + </tr> + <tr> + <td>5</td> + <td>VT_R8</td> + <td>[V] [T] [P] [S]</td> + <td>8 byte real</td> + </tr> + <tr> + <td>6</td> + <td>VT_CY</td> + <td>[V] [T] [P] [S]</td> + <td>currency</td> + </tr> + <tr> + <td>7</td> + <td>VT_DATE</td> + <td>[V] [T] [P] [S]</td> + <td>date</td> + </tr> + <tr> + <td>8</td> + <td>VT_BSTR</td> + <td>[V] [T] [P] [S]</td> + <td>OLE Automation string</td> + </tr> + <tr> + <td>9</td> + <td>VT_DISPATCH</td> + <td>[V] [T] [P] [S]</td> + <td>IDispatch *</td> + </tr> + <tr> + <td>10</td> + <td>VT_ERROR</td> + <td>[V] [T] [S]</td> + <td>SCODE</td> + </tr> + <tr> + <td>11</td> + <td>VT_BOOL</td> + <td>[V] [T] [P] [S]</td> + <td>True=-1, False=0</td> + </tr> + <tr> + <td>12</td> + <td>VT_VARIANT</td> + <td>[V] [T] [P] [S]</td> + <td>VARIANT *</td> + </tr> + <tr> + <td>13</td> + <td>VT_UNKNOWN</td> + <td>[V] [T] [S]</td> + <td>IUnknown *</td> + </tr> + <tr> + <td>14</td> + <td>VT_DECIMAL</td> + <td>[V] [T] [S]</td> + <td>16 byte fixed point</td> + </tr> + <tr> + <td>16</td> + <td>VT_I1</td> + <td>[T]</td> + <td>signed char</td> + </tr> + <tr> + <td>17</td> + <td>VT_UI1</td> + <td>[V] [T] [P] [S]</td> + <td>unsigned char</td> + </tr> + <tr> + <td>18</td> + <td>VT_UI2</td> + <td>[T] [P]</td> + <td>unsigned short</td> + </tr> + <tr> + <td>19</td> + <td>VT_UI4</td> + <td>[T] [P]</td> + <td>unsigned short</td> + </tr> + <tr> + <td>20</td> + <td>VT_I8</td> + <td>[T] [P]</td> + <td>signed 64-bit int</td> + </tr> + <tr> + <td>21</td> + <td>VT_UI8</td> + <td>[T] [P]</td> + <td>unsigned 64-bit int</td> + </tr> + <tr> + <td>22</td> + <td>VT_INT</td> + <td>[T]</td> + <td>signed machine int</td> + </tr> + <tr> + <td>23</td> + <td>VT_UINT</td> + <td>[T]</td> + <td>unsigned machine int</td> + </tr> + <tr> + <td>24</td> + <td>VT_VOID</td> + <td>[T]</td> + <td>C style void</td> + </tr> + <tr> + <td>25</td> + <td>VT_HRESULT</td> + <td>[T]</td> + <td>Standard return type</td> + </tr> + <tr> + <td>26</td> + <td>VT_PTR</td> + <td>[T]</td> + <td>pointer type</td> + </tr> + <tr> + <td>27</td> + <td>VT_SAFEARRAY</td> + <td>[T]</td> + <td>(use VT_ARRAY in VARIANT)</td> + </tr> + <tr> + <td>28</td> + <td>VT_CARRAY</td> + <td>[T]</td> + <td>C style array</td> + </tr> + <tr> + <td>29</td> + <td>VT_USERDEFINED</td> + <td>[T]</td> + <td>user defined type</td> + </tr> + <tr> + <td>30</td> + <td>VT_LPSTR</td> + <td>[T] [P]</td> + <td>null terminated string</td> + </tr> + <tr> + <td>31</td> + <td>VT_LPWSTR</td> + <td>[T] [P]</td> + <td>wide null terminated string</td> + </tr> + <tr> + <td>64</td> + <td>VT_FILETIME</td> + <td>[P]</td> + <td>FILETIME</td> + </tr> + <tr> + <td>65</td> + <td>VT_BLOB</td> + <td>[P]</td> + <td>Length prefixed bytes</td> + </tr> + <tr> + <td>66</td> + <td>VT_STREAM</td> + <td>[P]</td> + <td>Name of the stream follows</td> + </tr> + <tr> + <td>67</td> + <td>VT_STORAGE</td> + <td>[P]</td> + <td>Name of the storage follows</td> + </tr> + <tr> + <td>68</td> + <td>VT_STREAMED_OBJECT</td> + <td>[P]</td> + <td>Stream contains an object</td> + </tr> + <tr> + <td>69</td> + <td>VT_STORED_OBJECT</td> + <td>[P]</td> + <td>Storage contains an object</td> + </tr> + <tr> + <td>70</td> + <td>VT_BLOB_OBJECT</td> + <td>[P]</td> + <td>Blob contains an object</td> + </tr> + <tr> + <td>71</td> + <td>VT_CF</td> + <td>[P]</td> + <td>Clipboard format</td> + </tr> + <tr> + <td>72</td> + <td>VT_CLSID</td> + <td>[P]</td> + <td>A Class ID</td> + </tr> + <tr> + <td>0x1000</td> + <td>VT_VECTOR</td> + <td>[P]</td> + <td>simple counted array</td> + </tr> + <tr> + <td>0x2000</td> + <td>VT_ARRAY</td> + <td>[V]</td> + <td>SAFEARRAY*</td> + </tr> + <tr> + <td>0x4000</td> + <td>VT_BYREF</td> + <td>[V]</td> + <td>void* for local use</td> + </tr> + <tr> + <td>0x8000</td> + <td>VT_RESERVED</td> + <td><br/></td> + <td><br/></td> + </tr> + <tr> + <td>0xFFFF</td> + <td>VT_ILLEGAL</td> + <td><br/></td> + <td><br/></td> + </tr> + <tr> + <td>0xFFF</td> + <td>VT_ILLEGALMASKED</td> + <td><br/></td> + <td><br/></td> + </tr> + <tr> + <td>0xFFF</td> + <td>VT_TYPEMASK</td> + <td><br/></td> + <td><br/></td> + </tr> + </table> + </section> + + + + <section><title>References</title> + + <p>In order to assemble the HPSF description I used information publically + available on the Internet only. The references given below have been very + helpful. If you have any amendments or corrections, please let us know! + Thank you!</p> + + <ol> + + <li>In + <link href="http://www.kyler.com/pubs/ddj9894.html"><em>Understanding OLE + documents</em></link>, Ken Kyler gives an introduction to OLE2 + documents + and especially to property sets. He names the property names, types, and + IDs of the Summary Information and Document Summary Information + stream.</li> + + <li>The + <link href="http://www.dwam.net/docs/oleref/"><em>ActiveX Programmer's + Reference</em></link> at + <link href="http://www.dwam.net/docs/oleref/">http://www.dwam.net/docs/oleref/</link> + seems a little outdated, but that's what I have found.</li> + + <li>An overview of the <code>VT_</code> types is in + <link href="http://www.marin.clara.net/COM/variant_type_definitions.htm"><em>Variant + Type Definitions</em></link>.</li> + + <li>What is a <code>FILETIME</code>? The answer can be found + under <link + href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/sysinfo/base/filetime_str.asp"></link>, <link href="http://www.vbapi.com/ref/f/filetime.html">http://www.vbapi.com/ref/f/filetime.html</link> or + <link href="http://www.cs.rpi.edu/courses/fall01/os/FILETIME.html">http://www.cs.rpi.edu/courses/fall01/os/FILETIME.html</link>. + In short: <em>The FILETIME structure holds a date and time associated + with a file. The structure identifies a 64-bit integer specifying the + number of 100-nanosecond intervals which have passed since January 1, + 1601. This 64-bit value is split into the two dwords stored in the + structure.</em></li> + + <li>Information about the code page property in the + DocumentSummaryInformation stream is available at <link + href="http://msdn.microsoft.com/library/default.asp?url=/library/en-us/stg/stg/property_id_1.asp">http://msdn.microsoft.com/library/default.asp?url=/library/en-us/stg/stg/property_id_1.asp</link>.</li> + + <li>This documentation origins from the <link href="http://www.rainer-klute.de/~klute/Software/poibrowser/doc/HPSF-Description.html">HPSF description</link> available at <link href="http://www.rainer-klute.de/~klute/Software/poibrowser/doc/HPSF-Description.html">http://www.rainer-klute.de/~klute/Software/poibrowser/doc/HPSF-Description.html</link>.</li> + </ol> + </section> + </section> + </body> +</document> + +<!-- Keep this comment at the end of the file +Local variables: +mode: xml +sgml-omittag:nil +sgml-shorttag:nil +sgml-namecase-general:nil +sgml-general-insert-case:lower +sgml-minimize-attributes:nil +sgml-always-quote-attributes:t +sgml-indent-step:1 +sgml-indent-data:t +sgml-parent-document:nil +sgml-exposed-tags:nil +sgml-local-catalogs:nil +sgml-local-ecat-files:nil +End: +--> diff --git a/src/documentation/content/xdocs/hpsf/thumbnails.xml b/src/documentation/content/xdocs/hpsf/thumbnails.xml new file mode 100644 index 0000000000..c6d4bb19c3 --- /dev/null +++ b/src/documentation/content/xdocs/hpsf/thumbnails.xml @@ -0,0 +1,182 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" +"../dtd/document-v11.dtd"> +<!-- $Id$ --> + +<document> + <header> + <title>HPSF THUMBNAIL HOW-TO</title> + <authors> + <person name="Drew Varner" email="Drew.Varner@-deleteThis-sc.edu" /> + </authors> + </header> + <body> + <section><title>The VT_CF Format</title> + + <p>Thumbnail information is stored as a VT_CF, or Thumbnail Variant. The + Thumbnail Variant is used to store various types of information in a + clipboard. The VT_CF can store information in formats for the Macintosh or + Windows clipboard.</p> + + <p>There are many types of data that can be copied to the clipboard, but the + only types of information needed for thumbnail manipulation are the image + formats.</p> + + <p>The <code>VT_CF</code> structure looks like this:</p> + + <table> + <tr> + <th>Element:</th> + <td>Clipboard Size</td> + <td>Clipboard Format Tag</td> + <td>Clipboard Data</td> + </tr> + <tr> + <th>Size:</th> + <td>32 bit unsigned integer (DWord)</td> + <td>32 bit signed integer (DWord)</td> + <td>variable length (byte array)</td> + </tr> + </table> + + <p>The Clipboard Size refers to the size (in bytes) of Clipboard Data + (variable size) plus the Clipboard Format (four bytes).</p> + + <p>Clipboard Format Tag has four possible values:</p> + + <table> + <tr> + <th>Value</th> + <th>Identifier</th> + <th>Description</th> + </tr> + <tr> + <td><code>-1L</code></td> + <td><code>CFTAG_WINDOWS</code></td> + <td>a built-in Windows© clipboard format value</td> + </tr> + <tr> + <td><code>-2L</code></td> + <td><code>CFTAG_MACINTOSH</code></td> + <td>a Macintosh clipboard format value</td> + </tr> + <tr> + <td><code>-3L</code></td> + <td><code>CFTAG_FMTID</code></td> + <td>a format identifier (FMTID) This is rarely used.</td> + </tr> + <tr> + <td><code>0L</code></td> + <td><code>CFTAG_NODATA</code></td> + <td>No data This is rarely used.</td> + </tr> + </table> + </section> + + + + <section><title>Windows Clipboard Data</title> + + <p>Windows clipboard data has four image formats for thumbnails:</p> + + <table> + <tr> + <th>Value</th> + <th>Identifier</th> + <th>Description</th> + </tr> + <tr> + <td>3</td> + <td><code>CF_METAFILEPICT</code></td> + <td>Windows metafile format - recommended</td> + </tr> + <tr> + <td>8</td> + <td><code>CF_DIB</code></td> + <td>Device Independent Bitmap</td> + </tr> + <tr> + <td>14</td> + <td><code>CF_ENHMETAFILE</code></td> + <td>Enhanced Windows metafile format</td> + </tr> + <tr> + <td>2</td> + <td><code>CF_BITMAP</code></td> + <td>Bitmap - Obsolete - Use <code>CF_DIB</code> instead</td> + </tr> + </table> + </section> + + <section><title>Windows Metafile Format</title> + + <p>The most common format for thumbnails on the Windows platform is the + Windows metafile format. The Clipboard places and extra header in front of + a the standard Windows Metafile Format data.</p> + + <p>The Clipboard Data byte array looks like this when an image is stored in + Windows' Clipboard WMF format.</p> + + <table> + <tr> + <th>Identifier</th> + <td>CF_METAFILEPICT</td> + <td>mm</td> + <td>width</td> + <td>height</td> + <td>handle</td> + <td>WMF data</td> + </tr> + <tr> + <th>Size</th> + <td>32 bit unsigned int</td> + <td>16 bit unsigned(?) int</td> + <td>16 bit unsigned(?) int</td> + <td>16 bit unsigned(?) int</td> + <td>16 bit unsigned(?) int</td> + <td>byte array - variable length</td> + </tr> + <tr> + <th>Description</th> + <td>Clipboard WMF</td> + <td>Mapping Mode</td> + <td>Image Width</td> + <td>Image Height</td> + <td>handle to the WMF data array in memory, or 0</td> + <td>standard WMF byte stream</td> + </tr> + </table> + </section> + + + <section><title>Device Independent Bitmap</title> + <p><strong>FIXME:</strong> Describe the Device Independent Bitmap + format!</p> + </section> + + + + <section><title>Macintosh Clipboard Data</title> + <p><strong>FIXME:</strong> Describe the Macintosh clipboard formats!</p> + </section> + + </body> +</document> + +<!-- Keep this comment at the end of the file +Local variables: +mode: xml +sgml-omittag:nil +sgml-shorttag:nil +sgml-namecase-general:nil +sgml-general-insert-case:lower +sgml-minimize-attributes:nil +sgml-always-quote-attributes:t +sgml-indent-step:1 +sgml-indent-data:t +sgml-parent-document:nil +sgml-exposed-tags:nil +sgml-local-catalogs:nil +sgml-local-ecat-files:nil +End: +--> diff --git a/src/documentation/content/xdocs/hpsf/todo.xml b/src/documentation/content/xdocs/hpsf/todo.xml new file mode 100644 index 0000000000..3c3ca4e7f3 --- /dev/null +++ b/src/documentation/content/xdocs/hpsf/todo.xml @@ -0,0 +1,65 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd"> +<!-- $Id$ --> + +<document> + <header> + <title>To Do</title> + <authors> + <person name="Rainer Klute" email="klute@rainer-klute.de"/> + </authors> + </header> + <body> + <section><title>To Do</title> + + <p>The following functionalities should be added to HPFS:</p> + + <ol> + <li> + Add writing capability for property sets. Presently property sets can + be read only. + </li> + <li> + Add codepage support: Presently the bytes making out the string in a + property's value are interpreted using the platform's default character + set. + </li> + <li> + Add resource bundles to + <code>org.apache.poi.hpsf.wellknown</code> to ease + localizations. This would be useful for mapping standard property IDs to + localized strings. Example: The property ID 4 could be mapped to "Author" + in English or "Verfasser" in German. + </li> + <li> + Implement reading functionality for those property types that are not + yet supported. HPSF should return proper Java types instead of just byte + arrays. + </li> + <li> + Add WMF to <code>java.awt.Image</code> example code in <link + href="thumbnails.html">Thumbnail + HOW TO</link>. + </li> + </ol> + </section> + </body> +</document> + +<!-- Keep this comment at the end of the file +Local variables: +mode: xml +sgml-omittag:nil +sgml-shorttag:nil +sgml-namecase-general:nil +sgml-general-insert-case:lower +sgml-minimize-attributes:nil +sgml-always-quote-attributes:t +sgml-indent-step:1 +sgml-indent-data:t +sgml-parent-document:nil +sgml-exposed-tags:nil +sgml-local-catalogs:nil +sgml-local-ecat-files:nil +End: +--> |