aboutsummaryrefslogtreecommitdiffstats
path: root/src/documentation/content/xdocs/components/document/index.xml
blob: f5dd91c61cacc47833c0099895f89860c547f1a6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
<?xml version="1.0" encoding="UTF-8"?>
<!--
   ====================================================================
   Licensed to the Apache Software Foundation (ASF) under one or more
   contributor license agreements.  See the NOTICE file distributed with
   this work for additional information regarding copyright ownership.
   The ASF licenses this file to You under the Apache License, Version 2.0
   (the "License"); you may not use this file except in compliance with
   the License.  You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.
   ====================================================================
-->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "document-v20.dtd">

<document>
 <header>
  <title>Apache POI™ - HWPF and XWPF - Java API to Handle Microsoft Word Files</title>
  <subtitle>Overview</subtitle>
  <authors>
   <person name="Nicola Ken Barozzi" email="barozzi@nicolaken.com"/>
   <person name="Andrew C. Oliver" email="acoliver@apache.org"/>
   <person name="Ryan Ackley" email="sackley@apache.org"/>
   <person name="Rainer Klute" email="klute@apache.org"/>
  </authors>
 </header>

 <body>
 <section><title>Overview</title>

  <p>HWPF is the name of our port of the Microsoft Word 97(-2007) file format
    to pure Java. It also provides limited read only support for the older
    Word 6 and Word 95 file formats.</p>

  <p>The partner to HWPF for the new Word 2007 .docx format is <em>XWPF</em>.
    Whilst HWPF and XWPF provide similar features, there is not a common
    interface across the two of them at this time.</p>

  <p>Both HWPF and XWPF could be described as "moderately functional". For some
    use cases, especially around text extraction, support is very strong. For
    others, support may be limited or incomplete, and it may be necessary to
    dig down into low-level code. Error checking may be missing in places,
    so it may be possible to accidentally generate invalid files. Enhancements
    to fix such things are generally very well received!</p>

  <p>As detailed in the <a href="site:components">Components
    Page</a>, HWPF is contained within the poi-scratchpad-XXX.jar, while XWPF
    is in the poi-ooxml-XXX.jar. You will need to ensure you include the appropriate
    jars (and their dependencies!) in your classpath to use HWPF or XWPF.</p>

  <p>Please note that in version 3.12, due to a bug, you might need to include
    poi-scratchpad-XXX.jar when using XWPF. This has been fixed again for the next
    release as there should not be such a dependency.</p>

   </section>
   <section>
    <title>An overview of the code</title>
    <p>
        Source in the <em>org.apache.poi.hwpf.model</em> tree is the Java representation of
        internal Word format structure. This code is "internal", it shall not
        be used by your code. Code from <em>org.apache.poi.hwpf.usermodel</em>
        package is actual public and user-friendly (as much as possible) API to access document
        parts. Source code in the
        <em>org.apache.poi.hwpf.extractor</em>
        tree is a wrapper of this to facilitate easy extraction of interesting things (eg the Text),
        and
        <em>org.apache.poi.hwpf.converter</em>
        package contains Word-to-HTML and Word-to-FO converters (latest can be used to generate PDF
        from Word files when using with
        <a href="https://xmlgraphics.apache.org/fop/">Apache FOP</a>
        ). Also there is a small file-structure-dumping utility in
        <em>org.apache.poi.hwpf.dev</em>
        package, primally for developing purposes.
    </p>

    <p>
        The main entry point to HWPF is HWPFDocument. Currently it has a lot of references both to
        internal interfaces (
        <em>org.apache.poi.hwpf.model</em>
        package) and public API (
        <em>org.apache.poi.hwpf.usermodel</em>
        ) package. It is possible that it will be split into two different interfaces (like WordFile
        and WordDocument) in later versions.
    </p>

    <p>
      The main entry point to XWPF is XWPFDocument. From there, you can get the
      paragraphs, pictures, tables, sections, headers etc.
    </p>
    <p>
      Currently, there are only a handful of example programs using HWPF and XWPF
      available. They can be found in svn in the examples section, under
      <a href="https://github.com/apache/poi/tree/trunk/poi-examples/src/main/java/org/apache/poi/examples/hwpf">HWPF</a>
      and
      <a href="https://github.com/apache/poi/tree/trunk/poi-examples/src/main/java/org/apache/poi/examples/xwpf">XWPF</a>.
      Both HWPF and XWPF have fairly high levels of unit test coverage, which
      provides examples of using the various areas of functionality of both
      modules. These can be found in svn, under
      <a href="https://github.com/apache/poi/tree/trunk/poi-scratchpad/src/test/java/org/apache/poi/hwpf">HWPF</a>
      and
      <a href="https://github.com/apache/poi/tree/trunk/poi-ooxml/src/test/java/org/apache/poi/xwpf">XWPF</a>.
      Contributions of more examples, whether inspired by the unit tests or
      not, would be most welcomed!
    </p>

   </section>
   <section>
    <title>HWPF Notes</title>

    <p>A .doc Word document, as handled by HWPF, can be considered as very long single
        text buffer. The HWPF API provides "pointers"
        to document parts, like sections, paragraphs and character runs. Usually user will iterates
        over main document part sections, paragraphs from sections and character runs from
        paragraph. Each such interface is a pointer to document text subrange along with additional
        properties (and they all extends same Range parent class). There is additional Range
        implementations like Table, TableRow, TableCell, etc. Some structures like Bookmark or Field
        can also provide subranges pointers.
    </p>

    <p>Changing file content usually requires a lot of synchronized changes in those structures like
        updating property boundaries, position handlers, etc. Because of that HWPF API shall be
        considered as not thread safe. In addition, there is a "one pointer" rule for changing
        content. It means you should not use two different Range instances at one time. More
        precisely, if you are changing file content using some range pointer, all other range
        pointers except parents' ones become invalid. For example if you obtain overall range (1),
        paragraph range (2) from overall range and character run range (3) from paragraph range and
        change text of paragraph, character run range is now invalid and should not be used, but
        overall range pointer still valid. Each time you obtaining range (pointer) new instance is
        created. It means if you obtained two range pointers and changed document text using first
        range pointer, second one became invalid.
    </p>

   </section>
   <section>
    <title>XWPF Patches Required!</title>

    <p>At the moment, XWPF covers many common use cases for reading and writing
     .docx files. Whilst this is a great thing, it does mean that XWPF does
     everything that the current POI committers need it to do, and so none of
     the committers are actively adding new features.</p>

    <p>If you come across a feature in XWPF that you need, and isn't currently
     there, please do send in a patch to add the extra functionality! More details
     on contributing patches are available on the <a
      href="site:guidelines">"Contribution to POI" page</a>.</p>
   </section>

   <section>
    <title>HWPF Patches Required!</title>

    <p>At the moment we unfortunately do not have someone taking care for HWPF
     and fostering its development. What we need is someone to stand up, take
     this thing under his hood as his baby and push it forward. Ryan Ackley,
     who put a lot of effort into HWPF, is no longer on board, so HWPF is an
     orphan child waiting to be adopted.</p>

    <p>If <strong>you</strong> are interested in becoming the new HWPF
     pointman, you should look into the Microsoft Word internals. A good
     starting point seems to be Ryan Ackley's  <a
      href="site:docformat">overview</a>. An introduction to the binary
     file formats is <a
     href="https://msdn.microsoft.com/en-us/library/cc998577%28v=office.12%29.aspx">available
     from Microsoft</a>, which has some good references and links. After that,
     the full details on the word format are available from
     <a href="https://msdn.microsoft.com/en-us/library/cc313153%28v=office.12%29.aspx">Microsoft</a>,
     but the documentation can be a little hard to get into at first... Try reading the
     <a href="site:docformat">overview</a> first, and looking at the existing
     code, then finally look up the documentation for specific missing features.</p>

    <p>As a first step you should familiarize yourself with the source code,
     examples, test cases, and the HWPF patches available at <a
      href="https://issues.apache.org/">Bugzilla</a> (if any). Then you
     should compile an overview of</p>

    <ul>
     <li>the current HWPF status,</li>
     <li>the patches in <a
       href="https://issues.apache.org/bugzilla/">Bugzilla</a> to be checked
      in (and those that should better be ditched),</li>
     <li>the available test cases and the test cases still to be written,</li>
     <li>the available documentation and the docs to be written,</li>
     <li>anything else that seems reasonable</li>
    </ul>

    <p>When you start coding, you will not yet have write access to the
     SVN repository. Please submit your patches to <a
      href="https://issues.apache.org/">Bugzilla</a> and nag <a
      href="mailto:dev@poi.apache.org">the dev list</a> until someone commits
     them. Besides the actual checking in of HWPF patches, current POI
     committers will also do some minor reviews now and then of your source code
     patches, test cases and documentation to help ensure software quality. But
     most of the time you will be on your own. However, anyone offering useful
     contributions over a period of time will be offered committership!</p>

    <p>Please do not forget to write <a
      href="https://www.junit.org/">JUnit</a> test cases and documentation!
     We won't accept code that doesn't come with test cases. And please
     consider that other contributors should be able to understand your source
     code easily. If you need any help getting started with JUnit test cases
     for HWPF, please ask on the developers' mailing list! If you show that you
     are prepared to stick at it you will most likely be given SVN commit
     access. See <a href="site:guidelines">"Contribution to POI" page</a>
     for more details and help getting started.</p>

    <p>Of course we will help you as best as we can. However, presently there
     is no committer who is really familiar with the Word format, so you'll be
     mostly on your own. We are looking forward for you and your contributions!
     Honor and glory of becoming a POI committer are waiting!</p>
   </section>
 </body>
</document>

<!-- Keep this comment at the end of the file
Local variables:
mode: xml
sgml-omittag:nil
sgml-shorttag:nil
sgml-namecase-general:nil
sgml-general-insert-case:lower
sgml-minimize-attributes:nil
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
sgml-parent-document:nil
sgml-exposed-tags:nil
sgml-local-catalogs:nil
sgml-local-ecat-files:nil
End:
-->