1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
|
<?xml version="1.0" encoding="UTF-8"?>
<!--
====================================================================
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
====================================================================
-->
<!DOCTYPE document PUBLIC "-//APACHE//DTD Documentation V2.0//EN" "document-v20.dtd">
<document>
<header>
<title>Apache POI™ - HWPF and XWPF - Java API to Handle Microsoft Word Files</title>
<subtitle>Overview</subtitle>
<authors>
<person name="Nicola Ken Barozzi" email="barozzi@nicolaken.com"/>
<person name="Andrew C. Oliver" email="acoliver@apache.org"/>
<person name="Ryan Ackley" email="sackley@apache.org"/>
<person name="Rainer Klute" email="klute@apache.org"/>
</authors>
</header>
<body>
<section><title>Overview</title>
<p>HWPF is the name of our port of the Microsoft Word 97(-2007) file format
to pure Java. It also provides limited read only support for the older
Word 6 and Word 95 file formats.</p>
<p>The partner to HWPF for the new Word 2007 .docx format is <em>XWPF</em>.
Whilst HWPF and XWPF provide similar features, there is not a common
interface across the two of them at this time.</p>
<p>Both HWPF and XWPF could be described as "moderately functional". For some
use cases, especially around text extraction, support is very strong. For
others, support may be limited or incomplete, and it may be necessary to
dig down into low-level code. Error checking may be missing in places,
so it may be possible to accidentally generate invalid files. Enhancements
to fix such things are generally very well received!</p>
<p>As detailed in the <a href="site:components">Components
Page</a>, HWPF is contained within the poi-scratchpad-XXX.jar, while XWPF
is in the poi-ooxml-XXX.jar. You will need to ensure you include the appropriate
jars (and their dependencies!) in your classpath to use HWPF or XWPF.</p>
<p>Please note that in version 3.12, due to a bug, you might need to include
poi-scratchpad-XXX.jar when using XWPF. This has been fixed again for the next
release as there should not be such a dependency.</p>
</section>
<section>
<title>An overview of the code</title>
<p>
Source in the <em>org.apache.poi.hwpf.model</em> tree is the Java representation of
internal Word format structure. This code is "internal", it shall not
be used by your code. Code from <em>org.apache.poi.hwpf.usermodel</em>
package is actual public and user-friendly (as much as possible) API to access document
parts. Source code in the
<em>org.apache.poi.hwpf.extractor</em>
tree is a wrapper of this to facilitate easy extraction of interesting things (eg the Text),
and
<em>org.apache.poi.hwpf.converter</em>
package contains Word-to-HTML and Word-to-FO converters (latest can be used to generate PDF
from Word files when using with
<a href="https://xmlgraphics.apache.org/fop/">Apache FOP</a>
). Also there is a small file-structure-dumping utility in
<em>org.apache.poi.hwpf.dev</em>
package, primally for developing purposes.
</p>
<p>
The main entry point to HWPF is HWPFDocument. Currently it has a lot of references both to
internal interfaces (
<em>org.apache.poi.hwpf.model</em>
package) and public API (
<em>org.apache.poi.hwpf.usermodel</em>
) package. It is possible that it will be split into two different interfaces (like WordFile
and WordDocument) in later versions.
</p>
<p>
The main entry point to XWPF is XWPFDocument. From there, you can get the
paragraphs, pictures, tables, sections, headers etc.
</p>
<p>
Currently, there are only a handful of example programs using HWPF and XWPF
available. They can be found in svn in the examples section, under
<a href="https://github.com/apache/poi/tree/trunk/poi-examples/src/main/java/org/apache/poi/examples/hwpf">HWPF</a>
and
<a href="https://github.com/apache/poi/tree/trunk/poi-examples/src/main/java/org/apache/poi/examples/xwpf">XWPF</a>.
Both HWPF and XWPF have fairly high levels of unit test coverage, which
provides examples of using the various areas of functionality of both
modules. These can be found in svn, under
<a href="https://github.com/apache/poi/tree/trunk/poi-scratchpad/src/test/java/org/apache/poi/hwpf">HWPF</a>
and
<a href="https://github.com/apache/poi/tree/trunk/poi-ooxml/src/test/java/org/apache/poi/xwpf">XWPF</a>.
Contributions of more examples, whether inspired by the unit tests or
not, would be most welcomed!
</p>
</section>
<section>
<title>HWPF Notes</title>
<p>A .doc Word document, as handled by HWPF, can be considered as very long single
text buffer. The HWPF API provides "pointers"
to document parts, like sections, paragraphs and character runs. Usually user will iterates
over main document part sections, paragraphs from sections and character runs from
paragraph. Each such interface is a pointer to document text subrange along with additional
properties (and they all extends same Range parent class). There is additional Range
implementations like Table, TableRow, TableCell, etc. Some structures like Bookmark or Field
can also provide subranges pointers.
</p>
<p>Changing file content usually requires a lot of synchronized changes in those structures like
updating property boundaries, position handlers, etc. Because of that HWPF API shall be
considered as not thread safe. In addition, there is a "one pointer" rule for changing
content. It means you should not use two different Range instances at one time. More
precisely, if you are changing file content using some range pointer, all other range
pointers except parents' ones become invalid. For example if you obtain overall range (1),
paragraph range (2) from overall range and character run range (3) from paragraph range and
change text of paragraph, character run range is now invalid and should not be used, but
overall range pointer still valid. Each time you obtaining range (pointer) new instance is
created. It means if you obtained two range pointers and changed document text using first
range pointer, second one became invalid.
</p>
</section>
<section>
<title>XWPF Patches Required!</title>
<p>At the moment, XWPF covers many common use cases for reading and writing
.docx files. Whilst this is a great thing, it does mean that XWPF does
everything that the current POI committers need it to do, and so none of
the committers are actively adding new features.</p>
<p>If you come across a feature in XWPF that you need, and isn't currently
there, please do send in a patch to add the extra functionality! More details
on contributing patches are available on the <a
href="site:guidelines">"Contribution to POI" page</a>.</p>
</section>
<section>
<title>HWPF Patches Required!</title>
<p>At the moment we unfortunately do not have someone taking care for HWPF
and fostering its development. What we need is someone to stand up, take
this thing under his hood as his baby and push it forward. Ryan Ackley,
who put a lot of effort into HWPF, is no longer on board, so HWPF is an
orphan child waiting to be adopted.</p>
<p>If <strong>you</strong> are interested in becoming the new HWPF
pointman, you should look into the Microsoft Word internals. A good
starting point seems to be Ryan Ackley's <a
href="site:docformat">overview</a>. An introduction to the binary
file formats is <a
href="https://msdn.microsoft.com/en-us/library/cc998577%28v=office.12%29.aspx">available
from Microsoft</a>, which has some good references and links. After that,
the full details on the word format are available from
<a href="https://msdn.microsoft.com/en-us/library/cc313153%28v=office.12%29.aspx">Microsoft</a>,
but the documentation can be a little hard to get into at first... Try reading the
<a href="site:docformat">overview</a> first, and looking at the existing
code, then finally look up the documentation for specific missing features.</p>
<p>As a first step you should familiarize yourself with the source code,
examples, test cases, and the HWPF patches available at <a
href="https://issues.apache.org/">Bugzilla</a> (if any). Then you
should compile an overview of</p>
<ul>
<li>the current HWPF status,</li>
<li>the patches in <a
href="https://issues.apache.org/bugzilla/">Bugzilla</a> to be checked
in (and those that should better be ditched),</li>
<li>the available test cases and the test cases still to be written,</li>
<li>the available documentation and the docs to be written,</li>
<li>anything else that seems reasonable</li>
</ul>
<p>When you start coding, you will not yet have write access to the
SVN repository. Please submit your patches to <a
href="https://issues.apache.org/">Bugzilla</a> and nag <a
href="mailto:dev@poi.apache.org">the dev list</a> until someone commits
them. Besides the actual checking in of HWPF patches, current POI
committers will also do some minor reviews now and then of your source code
patches, test cases and documentation to help ensure software quality. But
most of the time you will be on your own. However, anyone offering useful
contributions over a period of time will be offered committership!</p>
<p>Please do not forget to write <a
href="https://www.junit.org/">JUnit</a> test cases and documentation!
We won't accept code that doesn't come with test cases. And please
consider that other contributors should be able to understand your source
code easily. If you need any help getting started with JUnit test cases
for HWPF, please ask on the developers' mailing list! If you show that you
are prepared to stick at it you will most likely be given SVN commit
access. See <a href="site:guidelines">"Contribution to POI" page</a>
for more details and help getting started.</p>
<p>Of course we will help you as best as we can. However, presently there
is no committer who is really familiar with the Word format, so you'll be
mostly on your own. We are looking forward for you and your contributions!
Honor and glory of becoming a POI committer are waiting!</p>
</section>
</body>
</document>
<!-- Keep this comment at the end of the file
Local variables:
mode: xml
sgml-omittag:nil
sgml-shorttag:nil
sgml-namecase-general:nil
sgml-general-insert-case:lower
sgml-minimize-attributes:nil
sgml-always-quote-attributes:t
sgml-indent-step:1
sgml-indent-data:t
sgml-parent-document:nil
sgml-exposed-tags:nil
sgml-local-catalogs:nil
sgml-local-ecat-files:nil
End:
-->
|