--- /dev/null
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Copyright (C) 2004 The Apache Software Foundation. All rights reserved. -->
+<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.1//EN" "../dtd/document-v11.dtd">
+
+<document>
+ <header>
+ <title>POI-HSLF - A Guide to the PowerPoint File Format</title>
+ <subtitle>Overview</subtitle>
+ <authors>
+ <person name="Nick Burch" email="nick at torchbox dot com"/>
+ </authors>
+ </header>
+
+ <body>
+ <section><title>Records, Containers and Atoms</title>
+ <p>
+ PowerPoint documents are made up of a tree of records. A record may
+ contain either other records (in which case it is a Container),
+ or data (in which case it's an Atom). A record can't hold both.
+ </p>
+ <p>
+ PowerPoint documents don't have one overall container record. Instead,
+ there are a number of different container records to be found at
+ the top level.
+ </p>
+ <p>
+ Any numbers or strings stored in the records are always stored in
+ Little Endian format (least important bytes first). This is the case
+ no matter what platform the file was written on - be that a
+ Little Endian or a Big Endian system.
+ </p>
+ <p>
+ PowerPoint may have Escher (DDF) records embeded in it. These
+ are always held as the children of a PPDrawing record (record
+ type 1036). Escher records have the same format as PowerPoint
+ records.
+ </p>
+ </section>
+
+ <section><title>Record Headers</title>
+ <p>
+ All records, be they containers or atoms, have the same standard
+ 8 byte header. It is:
+ </p>
+ <ul><li>1/2 byte container flag</li>
+ <li>1.5 byte option field</li>
+ <li>2 byte record type</li>
+ <li>4 byte record length</li></ul>
+ <p>
+ If the first byte of the header, BINARY_AND with 0x0f, is 0x0f,
+ then the record is a container. Otherwise, it's an atom. The rest
+ of the first two bytes are used to store the "options" for the
+ record. Most commonly, this is used to indicate the version of
+ the record, but the exact useage is record specific.
+ </p>
+ <p>
+ The record type is a little endian number, which tells you what
+ kind of record you're dealing with. Each different kind of record
+ has it's own value that gets stored here. PowerPoint records have
+ a type that's normally less than 6000 (decimal). Escher records
+ normally have a type between 0xF000 and 0xF1FF.
+ </p>
+ <p>
+ The record length is another little endian number. For an atom,
+ it's the size of the data part of the record, i.e. the length
+ of the record <em>less</em> its 8 byte record header. For a
+ container, it's the size of all the records that are children of
+ this record. That means that the size of a container record is the
+ length, plus 8 bytes for its record header.
+ </p>
+ </section>
+
+ <section><title>CurrentUserAtom, UserEditAtom and PersistPtrIncrementalBlock</title>
+ <p><strong>aka Records that care about the byte level position of other records</strong></p>
+ <p>
+ A small number of records contain byte level position offsets to other
+ records. If you change the position of any records in the file, then
+ there's a good chance that you will need to update some of these
+ special records.
+ </p>
+ <p>
+ First up, CurrentUserAtom. This is actually stored in a different
+ OLE2 (POIFS) stream to the main PowerPoint document. It contains
+ a few bits of information on who lasted edited the file. Most
+ importantly, at byte 8 of its contents, it stores (as a 32 bit
+ little endian number) the offset in the main stream to the most
+ recent UserEditAtom.
+ </p>
+ <p>
+ The UserEditAtom contains two byte level offsets (again as 32 bit
+ little endian numbers). At byte 12 is the offset to the
+ PersistPtrIncrementalBlock associated with this UserEditAtom
+ (each UserEditAtom has one and only one PersistPtrIncrementalBlock).
+ At byte 8, there's the offset to the previous UserEditAtom. If this
+ is 0, then you're at the first one.
+ </p>
+ <p>
+ Every time you do a non full save in PowerPoint, it tacks on another
+ UserEditAtom and another PersistPtrIncrementalBlock. The
+ CurrentUserAtom is updated to point to this new UserEditAtom, and the
+ new UserEditAtom points back to the previous UserEditAtom. You then
+ end up with a chain, starting from the CurrentUserAtom, linking
+ back through all the UserEditAtoms, until you reach the first one
+ from a full save.
+ </p>
+<source>
+/-------------------------------\
+| CurrentUserAtom (own stream) |
+| OffsetToCurrentEdit = 10562 |==\
+\-------------------------------/ |
+ |
+/==================================/
+| /-----------------------------------\
+| | PersistPtrIncrementalBlock @ 6144 |
+| \-----------------------------------/
+| /---------------------------------\ |
+| | UserEditAtom @ 6176 | |
+| | LastUserEditAtomOffset = 0 | |
+| | PersistPointersOffset = 6144 |==================/
+| \---------------------------------/
+| | /-----------------------------------\
+| \====================\ | PersistPtrIncrementalBlock @ 8646 |
+| | \-----------------------------------/
+| /---------------------------------\ | |
+| | UserEditAtom @ 8674 | | |
+| | LastUserEditAtomOffset = 6176 |=/ |
+| | PersistPointersOffset = 8646 |==================/
+| \---------------------------------/
+| | /------------------------------------\
+| \====================\ | PersistPtrIncrementalBlock @ 10538 |
+| | \------------------------------------/
+| /---------------------------------\ | |
+\==| UserEditAtom @ 10562 | | |
+ | LastUserEditAtomOffset = 8674 |=/ |
+ | PersistPointersOffset = 10538 |==================/
+ \---------------------------------/
+</source>
+ <p>
+ The PersistPtrIncrementalBlock contains byte offsets to all the
+ Slides, Notes, Documents and MasterSlides in the file. The first
+ PersistPtrIncrementalBlock will point to all the ones that
+ were present the first time the file was saved. Subsequent
+ PersistPtrIncrementalBlocks will contain pointers to all the ones
+ that were changed in that edit. To find the offset to a given
+ sheet in the latest version, then start with the most recent
+ PersistPtrIncrementalBlock. If this knows about the sheet, use the
+ offset it has. If it doesn't, then work back through older
+ PersistPtrIncrementalBlocks until you find one which does, and
+ use that.
+ </p>
+ <p>
+ Each PersistPtrIncrementalBlock can contain a number of entries
+ blocks. Each block holds information on a sequence of sheets.
+ Each block starts with a 32 bit little endian integer. Once read
+ into memory, the lower 20 bits contain the starting number for the
+ sequence of sheets to be described. The higher 12 bits contain
+ the count of the number of sheets described. Following that is
+ one 32 bit little endian integer for each sheet in the sequence,
+ the value being the offset to that sheet. If there is any data
+ left after parsing a block, then it corresponds to the next block.
+ </p>
+<source>
+hex on disk decimal description
+----------- ------- -----------
+0000 0 No options
+7217 6002 Record type is 6002
+2000 0000 32 Length of data is 32 bytes
+0100 5000 5242881 Count is 5 (12 highest bits)
+ Starting number is 1 (20 lowest bits)
+0000 0000 0 Sheet (1+0)=1 starts at offset 0
+900D 0000 3472 Sheet (1+1)=2 starts at offset 3472
+E403 0000 996 Sheet (1+2)=3 starts at offset 996
+9213 0000 5010 Sheet (1+3)=4 starts at offset 5010
+BE15 0000 5566 Sheet (1+4)=5 starts at offset 5566
+0900 1000 1048585 Count is 1 (12 highest bits)
+ Starting number is 9 (20 lowest bits)
+4418 0000 6212 Sheet (9+0)=9 starts at offset 9212
+</source>
+ </section>
+ </body>
+</document>
<section><title>Basic Text Extraction</title>
<p>For basic text extraction, make use of
<code>org.apache.poi.extractor.PowerPointExtractor</code>. It accepts a file or an input
-stream. The <code>getText()</code> method can be used to get the text from the slides,
-from the notes, or from both.
+stream. The <code>getText()</code> method can be used to get the text from the slides, and the <code>getNotes()</code> method can be used to get the text
+from the notes. Finally, <code>getText(true,true)</code> will get the text
+from both.
</p>
</section>
</p>
</section>
+ <section><title>Poor Quality Text Extraction</title>
+ <p>If speed is the most important thing for you, you don't care
+ about getting duplicate blocks of text, you don't care about
+ getting text from master sheets, and you don't care about getting
+ old text, then
+ <code>org.apache.poi.extractor.QuickButCruddyTextExtractor</code>
+ might be of use.</p>
+ <p>QuickButCruddyTextExtractor doesn't use the normal record
+ parsing code, instead it uses a tree structure blind search
+ method to get all text holding records. You will get all the text,
+ including lots of text you normally wouldn't ever want. However,
+ you will get it back very very fast!</p>
+ <p>There are two ways of getting the text back.
+ <code>getTextAsString()</code> will return a single string with all
+ the text in it. <code>getTextAsVector()</code> will return a
+ vector of strings, one for each text record found in the file.
+ </p>
+ </section>
+
<section><title>Changing Text</title>
- <p>It is possible to change the text via <code>TextRun.setText(String)</code>. However, if
-the length of the text is changed, things will break because PowerPoint has
-internal file references in byte offsets, which are not yet all updated when
-the size changes.
+ <p>It is possible to change the text via
+ <code>TextRun.setText(String)</code>. However, if the length of
+ the text is changed, things will break because PowerPoint has
+ internal file references in byte offsets. We currently update all
+ of these byte references that we know about when writing out, but
+ there are a few more still to be found.
</p>
</section>
<section><title>Guide to key classes</title>
<ul>
<li><code>org.apache.poi.hslf.HSLFSlideShow</code>
- Handles reading in and writing out files. Generates a tree of the records
- in the file
+ Handles reading in and writing out files. Calls
+ <code>org.apache.poi.hslf.record.record</code> to build a tree
+ of all the records in the file, which it allows access to.
+ </li>
+ <li><code>org.apache.poi.hslf.record.record</code>
+ Base class of all records. Also provides the main record generation
+ code, which will build up a tree of records for a file.
</li>
<li><code>org.apache.poi.hslf.usermode.SlideShow</code>
Builds up model entries from the records, and presents a user facing
</ul>
</section>
</body>
-</document>
\ No newline at end of file
+</document>