diff options
author | Said Ryan Ackley <sackley@apache.org> | 2002-04-14 06:45:58 +0000 |
---|---|---|
committer | Said Ryan Ackley <sackley@apache.org> | 2002-04-14 06:45:58 +0000 |
commit | a99d6c8ef4c6b8a08f1a6d5a8a997705536dfa87 (patch) | |
tree | d4548b419a5c696587950b393e6bba802c969cf6 | |
parent | c7785a5ce332f9be920ba7ee2b4b812f5e3334c1 (diff) | |
download | poi-a99d6c8ef4c6b8a08f1a6d5a8a997705536dfa87.tar.gz poi-a99d6c8ef4c6b8a08f1a6d5a8a997705536dfa87.zip |
Work in progress
git-svn-id: https://svn.apache.org/repos/asf/jakarta/poi/trunk@352408 13f79535-47bb-0310-9956-ffa450edef68
-rw-r--r-- | src/documentation/xdocs/hdf/docoverview.xml | 94 |
1 files changed, 94 insertions, 0 deletions
diff --git a/src/documentation/xdocs/hdf/docoverview.xml b/src/documentation/xdocs/hdf/docoverview.xml new file mode 100644 index 0000000000..d23c71b894 --- /dev/null +++ b/src/documentation/xdocs/hdf/docoverview.xml @@ -0,0 +1,94 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V1.0//EN" "../dtd/document-v10.dtd"> + +<document> + <header> + <title>HDF</title> + <subtitle>Word file format</subtitle> + <authors> + <person name="S. Ryan Ackley" email="sackley@cfl.rr.com"/> + </authors> + </header> + + <body> + <s1 title="The Word 97 File Format in semi-plain English"> + + <p>The purpose of this document is to give a brief high level overview of the + Word 97 document format. This document does not go into in-depth technical + detail and is only meant as a supplement to the Microsoft Word 97 Binary + File Format written by Microsoft.</p> + <p>The OLE file format is not discussed in this document. It is assumed that + the reader has a working knowledge of the POIFS API. </p> + + <s2 title="Word file structure"> + <p>A Word file is made up of the document text and data structures + containing formatting information about the text. Of course, this is a + very simplified illustration. There are fields and macros and other + things that have not been considered. At this stage, HDF is mainly + concerned with formatted text.</p> + </s2> + <s2 title="Reading Word files"> + <p>The entry point for HDF's reading of a Word file is the File Information + Block (FIB). This structure is the entry point for the locations and size + of a document's text and data structures. The FIB is located at the + beginning of the main stream.</p> + <s3 title="Text"> + <p>The document's text is also located in the main stream. Its starting + location is given as FIB.fcMin and its length is given in bytes by + FIB.ccpText. These two values are not very useful in getting the text + because of unicode. There may be unicode text intermingled with ASCII + text. That brings us to the piece table.</p> + <p>The piece table is used to divide the text into non-unicode and unicode + pieces. The size and offset are given in FIB.fcClx and FIB.lcbClx + respectively. The piece table may contain Property Modifiers (prm). + These are for complex(fast-saved) files and are skipped. Each text piece + contains offsets in the main stream that contain text for that piece. + If the piece uses unicode, the file offset is masked with a certain bit. + Then you have to unmask the bit and divide by 2 to get the real file + offset. </p> + </s3> + <s3 title="Text Formatting"> + <s4 title="Stylesheet"> + <p>All text formatting is based on styles contained in the StyleSheet. + The StyleSheet is a data structure containing among other things, style + descriptions. Each style description can contain a paragraph style and + a character style or simply a character style. Each style description + is stored in a compressed version on file. Basically these are deltas + from another style.</p> + <p>Eventually, you have to chain back to the nil style which is an + imaginary style with certain implied values.</p> + </s4> + <s4 title="Paragraph and Character styles"> + <p>Paragraph and Character formatting properties for a document's text are + stored on file as deltas from some base style in the Stylesheet. The + deltas are used to create a complete uncompressed style in memory.</p> + <p>Uncompressed paragraph styles are represented by the Pargraph + Properties(PAP) data structure. Uncompressed character styles are + represented by the Character Properties(CHP) data structure. The styles + for the document text are stored in compressed format in the + corresponding Formatted Disk Pages (FKP). A compressed PAP is referred + to as a PAPX and a compressed CHP is a CHPX. The FKP locations are + stored in the bin table. There are seperate bin tables for CHPXs and + PAPXs. The bin tables' locations and sizes are stored in the FIB.</p> + <p>A FKP is a 512 byte OLE page. It contains the offsets of the beginning + and end of each paragraph/character run in the main stream and the + compressed properties for that interval. The compessed PAPX is based on + its base style in the StyleSheet. The compressed CHPX is based on the + enclosing paragraph's base style in the Stylesheet.</p> + </s4> + <s4 title="Uncompressing styles and other data structures"> + <p>All compressed properties(CHPX, PAPX, SEPX) contain a grpprl. A grpprl + is an array of sprms. A sprm defines a delta from some base property. + There is a table of possible sprms in the Word 97 spec. Each sprm is a + two byte operand followed by a parameter. The parameter size depends on + the sprm. Each sprm describes an operation that should be performed on + the base style. After every sprm in the grpprl is performed on the base + style you will have the style for the paragraph, character run, + section, etc.</p> + </s4> + </s3> + </s2> + </s1> + </body> +</document> + |