A few small updates to the HSLF useage docs, and adding some initial documentation...

author Nick Burch <nick@apache.org>

Thu, 9 Jun 2005 13:12:59 +0000 (13:12 +0000)

committer Nick Burch <nick@apache.org>

Thu, 9 Jun 2005 13:12:59 +0000 (13:12 +0000)
author Nick Burch <nick@apache.org>
Thu, 9 Jun 2005 13:12:59 +0000 (13:12 +0000)
committer Nick Burch <nick@apache.org>
Thu, 9 Jun 2005 13:12:59 +0000 (13:12 +0000)
diff --git a/src/documentation/content/xdocs/hslf/book.xml b/src/documentation/content/xdocs/hslf/book.xml

index cc92cdb1c7f26b6830c74f94dfa2494914fe1535..a0a827b0a7b35632011d4571f70aa5699e04f857 100644 (file)
--- a/src/documentation/content/xdocs/hslf/book.xml
+++ b/src/documentation/content/xdocs/hslf/book.xml
@@ -13,6 +13,7 @@
      <menu label="HSLF">
          <menu-item label="Overview" href="index.html"/>
          <menu-item label="Quick Guide" href="quick-guide.html"/>
+        <menu-item label="PPT File Format" href="ppt-file-format.html"/>
         </menu>
         
  </book>
diff --git a/src/documentation/content/xdocs/hslf/ppt-file-format.xml b/src/documentation/content/xdocs/hslf/ppt-file-format.xml

new file mode 100644 (file)

index 0000000..ede2eee
--- /dev/null
+++ b/src/documentation/content/xdocs/hslf/ppt-file-format.xml
@@ -0,0 +1,181 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Copyright (C) 2004 The Apache Software Foundation. All rights reserved. -->
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">
+
+<document>
+    <header>
+        <title>POI-HSLF - A Guide to the PowerPoint File Format</title>
+        <subtitle>Overview</subtitle>
+        <authors>
+            <person name="Nick Burch" email="nick at torchbox dot com"/>
+        </authors>
+    </header>
+
+    <body>
+        <section><title>Records, Containers and Atoms</title>
+               <p>
+               PowerPoint documents are made up of a tree of records. A record may
+               contain either other records (in which case it is a Container),
+               or data (in which case it's an Atom). A record can't hold both.
+               </p>
+               <p>
+               PowerPoint documents don't have one overall container record. Instead,
+               there are a number of different container records to be found at
+               the top level.
+               </p>
+               <p>
+               Any numbers or strings stored in the records are always stored in
+               Little Endian format (least important bytes first). This is the case
+               no matter what platform the file was written on - be that a 
+               Little Endian or a Big Endian system.
+               </p>
+               <p>
+               PowerPoint may have Escher (DDF) records embeded in it. These
+               are always held as the children of a PPDrawing record (record
+               type 1036). Escher records have the same format as PowerPoint
+               records.
+               </p>
+               </section>
+               
+               <section><title>Record Headers</title>
+               <p>
+               All records, be they containers or atoms, have the same standard
+               8 byte header. It is:
+               </p>
+               <ul><li>1/2 byte container flag</li>
+               <li>1.5 byte option field</li>
+               <li>2 byte record type</li>
+               <li>4 byte record length</li></ul>
+               <p>
+               If the first byte of the header, BINARY_AND with 0x0f, is 0x0f,
+               then the record is a container. Otherwise, it's an atom. The rest
+               of the first two bytes are used to store the "options" for the
+               record. Most commonly, this is used to indicate the version of
+               the record, but the exact useage is record specific.
+               </p>
+               <p>
+               The record type is a little endian number, which tells you what
+               kind of record you're dealing with. Each different kind of record
+               has it's own value that gets stored here. PowerPoint records have
+               a type that's normally less than 6000 (decimal). Escher records
+               normally have a type between 0xF000 and 0xF1FF.
+               </p>
+               <p>
+               The record length is another little endian number. For an atom,
+               it's the size of the data part of the record, i.e. the length
+               of the record <em>less</em> its 8 byte record header. For a
+               container, it's the size of all the records that are children of
+               this record. That means that the size of a container record is the
+               length, plus 8 bytes for its record header.
+               </p>
+               </section>
+
+               <section><title>CurrentUserAtom, UserEditAtom and PersistPtrIncrementalBlock</title>
+               <p><strong>aka Records that care about the byte level position of other records</strong></p>
+               <p>
+               A small number of records contain byte level position offsets to other
+               records. If you change the position of any records in the file, then
+               there's a good chance that you will need to update some of these
+               special records.
+               </p>
+               <p>
+               First up, CurrentUserAtom. This is actually stored in a different
+               OLE2 (POIFS) stream to the main PowerPoint document. It contains
+               a few bits of information on who lasted edited the file. Most
+               importantly, at byte 8 of its contents, it stores (as a 32 bit
+               little endian number) the offset in the main stream to the most
+               recent UserEditAtom.
+               </p>
+               <p>
+               The UserEditAtom contains two byte level offsets (again as 32 bit
+               little endian numbers). At byte 12 is the offset to the 
+               PersistPtrIncrementalBlock associated with this UserEditAtom
+               (each UserEditAtom has one and only one PersistPtrIncrementalBlock).
+               At byte 8, there's the offset to the previous UserEditAtom. If this
+               is 0, then you're at the first one.
+               </p>
+               <p>
+               Every time you do a non full save in PowerPoint, it tacks on another
+               UserEditAtom and another PersistPtrIncrementalBlock. The 
+               CurrentUserAtom is updated to point to this new UserEditAtom, and the
+               new UserEditAtom points back to the previous UserEditAtom. You then
+               end up with a chain, starting from the CurrentUserAtom, linking
+               back through all the UserEditAtoms, until you reach the first one
+               from a full save.
+               </p>
+<source>
+/-------------------------------\
+| CurrentUserAtom (own stream)  |
+|   OffsetToCurrentEdit = 10562 |==\
+\-------------------------------/  |
+                                   |
+/==================================/
+|                                         /-----------------------------------\
+|                                         | PersistPtrIncrementalBlock @ 6144 |
+|                                         \-----------------------------------/
+|  /---------------------------------\                  |
+|  | UserEditAtom @ 6176             |                  |
+|  |   LastUserEditAtomOffset = 0    |                  |
+|  |   PersistPointersOffset =  6144 |==================/
+|  \---------------------------------/
+|                 |                       /-----------------------------------\
+|                 \====================\  | PersistPtrIncrementalBlock @ 8646 |
+|                                      |  \-----------------------------------/
+|  /---------------------------------\ |                |
+|  | UserEditAtom @ 8674             | |                |
+|  |   LastUserEditAtomOffset = 6176 |=/                |
+|  |   PersistPointersOffset =  8646 |==================/
+|  \---------------------------------/
+|                 |                       /------------------------------------\
+|                 \====================\  | PersistPtrIncrementalBlock @ 10538 |
+|                                      |  \------------------------------------/
+|  /---------------------------------\ |                |
+\==| UserEditAtom @ 10562            | |                |
+   |   LastUserEditAtomOffset = 8674 |=/                |
+   |   PersistPointersOffset = 10538 |==================/
+   \---------------------------------/
+</source>
+               <p>
+               The PersistPtrIncrementalBlock contains byte offsets to all the
+               Slides, Notes, Documents and MasterSlides in the file. The first
+               PersistPtrIncrementalBlock will point to all the ones that
+               were present the first time the file was saved. Subsequent 
+               PersistPtrIncrementalBlocks will contain pointers to all the ones
+               that were changed in that edit. To find the offset to a given
+               sheet in the latest version, then start with the most recent
+               PersistPtrIncrementalBlock. If this knows about the sheet, use the
+               offset it has. If it doesn't, then work back through older
+               PersistPtrIncrementalBlocks until you find one which does, and
+               use that.
+               </p>
+               <p>
+               Each PersistPtrIncrementalBlock can contain a number of entries
+               blocks. Each block holds information on a sequence of sheets.
+               Each block starts with a 32 bit little endian integer. Once read
+               into memory, the lower 20 bits contain the starting number for the
+               sequence of sheets to be described. The higher 12 bits contain
+               the count of the number of sheets described. Following that is
+               one 32 bit little endian integer for each sheet in the sequence, 
+               the value being the offset to that sheet. If there is any data
+               left after parsing a block, then it corresponds to the next block.
+               </p>
+<source>
+hex on disk      decimal        description
+-----------      -------        -----------
+0000             0              No options
+7217             6002           Record type is 6002
+2000 0000        32             Length of data is 32 bytes
+0100 5000        5242881        Count is 5 (12 highest bits)
+                                Starting number is 1 (20 lowest bits)
+0000 0000        0              Sheet (1+0)=1 starts at offset 0
+900D 0000        3472           Sheet (1+1)=2 starts at offset 3472
+E403 0000        996            Sheet (1+2)=3 starts at offset 996
+9213 0000        5010           Sheet (1+3)=4 starts at offset 5010
+BE15 0000        5566           Sheet (1+4)=5 starts at offset 5566
+0900 1000        1048585        Count is 1 (12 highest bits)
+                                Starting number is 9 (20 lowest bits)
+4418 0000        6212           Sheet (9+0)=9 starts at offset 9212
+</source>
+               </section>
+       </body>
+</document>
diff --git a/src/documentation/content/xdocs/hslf/quick-guide.xml b/src/documentation/content/xdocs/hslf/quick-guide.xml

index 5f6525232c88565ea5c2bf18574861edd0480876..7b7b98deda0abc3a616d3b6abc5a95192e178ec6 100644 (file)
--- a/src/documentation/content/xdocs/hslf/quick-guide.xml
+++ b/src/documentation/content/xdocs/hslf/quick-guide.xml
@@ -15,8 +15,9 @@
          <section><title>Basic Text Extraction</title>
          <p>For basic text extraction, make use of 
  <code>org.apache.poi.extractor.PowerPointExtractor</code>. It accepts a file or an input
-stream. The <code>getText()</code> method can be used to get the text from the slides,
-from the notes, or from both.
+stream. The <code>getText()</code> method can be used to get the text from the slides, and the <code>getNotes()</code> method can be used to get the text
+from the notes. Finally, <code>getText(true,true)</code> will get the text
+from both.
                 </p>
                 </section>
                 
@@ -31,19 +32,45 @@ what type of text it is (eg Body, Title)
                 </p>
                 </section>
                 
+        <section><title>Poor Quality Text Extraction</title>
+        <p>If speed is the most important thing for you, you don't care
+               about getting duplicate blocks of text, you don't care about 
+               getting text from master sheets, and you don't care about getting
+               old text, then 
+               <code>org.apache.poi.extractor.QuickButCruddyTextExtractor</code>
+               might be of use.</p>
+               <p>QuickButCruddyTextExtractor doesn't use the normal record 
+               parsing code, instead it uses a tree structure blind search 
+               method to get all text holding records. You will get all the text,
+               including lots of text you normally wouldn't ever want. However,
+               you will get it back very very fast!</p>
+               <p>There are two ways of getting the text back. 
+               <code>getTextAsString()</code> will return a single string with all
+               the text in it. <code>getTextAsVector()</code> will return a 
+               vector of strings, one for each text record found in the file.
+               </p>
+               </section>
+
                 <section><title>Changing Text</title>
-               <p>It is possible to change the text via <code>TextRun.setText(String)</code>. However, if
-the length of the text is changed, things will break because PowerPoint has
-internal file references in byte offsets, which are not yet all updated when
-the size changes.
+               <p>It is possible to change the text via 
+               <code>TextRun.setText(String)</code>. However, if the length of 
+               the text is changed, things will break because PowerPoint has
+               internal file references in byte offsets. We currently update all
+               of these byte references that we know about when writing out, but
+               there are a few more still to be found.
                 </p>
                 </section>
                 
                 <section><title>Guide to key classes</title>
                 <ul>
                 <li><code>org.apache.poi.hslf.HSLFSlideShow</code>
-  Handles reading in and writing out files. Generates a tree of the records
-  in the file
+               Handles reading in and writing out files. Calls 
+               <code>org.apache.poi.hslf.record.record</code> to build a tree
+               of all the records in the file, which it allows access to.
+               </li>
+               <li><code>org.apache.poi.hslf.record.record</code>
+               Base class of all records. Also provides the main record generation
+               code, which will build up a tree of records for a file.
                 </li>
                 <li><code>org.apache.poi.hslf.usermode.SlideShow</code>
    Builds up model entries from the records, and presents a user facing
@@ -55,4 +82,4 @@ the size changes.
                 </ul>
                 </section>
         </body>
-</document>
-\ No newline at end of file
+</document>
author	Nick Burch <nick@apache.org>
	Thu, 9 Jun 2005 13:12:59 +0000 (13:12 +0000)
committer	Nick Burch <nick@apache.org>
	Thu, 9 Jun 2005 13:12:59 +0000 (13:12 +0000)
src/documentation/content/xdocs/hslf/book.xml		patch \| blob \| history
src/documentation/content/xdocs/hslf/ppt-file-format.xml	[new file with mode: 0644]	patch \| blob
src/documentation/content/xdocs/hslf/quick-guide.xml		patch \| blob \| history