From: Nick Burch Date: Wed, 30 Jun 2010 17:40:33 +0000 (+0000) Subject: Update HWPF documentation to include the newly added word 6/95 text extraction suppor... X-Git-Tag: REL_3_7_BETA2~43 X-Git-Url: https://source.dussan.org/?a=commitdiff_plain;h=f71a66b0abdb0a4f1b84958fea72b085e91eae0f;p=poi.git Update HWPF documentation to include the newly added word 6/95 text extraction support, as well as mention XWPF + Microsoft spec docs git-svn-id: https://svn.apache.org/repos/asf/poi/trunk@959384 13f79535-47bb-0310-9956-ffa450edef68 --- diff --git a/src/documentation/content/xdocs/hwpf/index.xml b/src/documentation/content/xdocs/hwpf/index.xml index c7f58122e3..1d14be318b 100644 --- a/src/documentation/content/xdocs/hwpf/index.xml +++ b/src/documentation/content/xdocs/hwpf/index.xml @@ -35,8 +35,12 @@
Overview

HWPF is the name of our port of the Microsoft Word 97(-2007) file format - to pure Java. It does not support the new Word 2007 .docx - file format, which is not OLE2 based.

+ to pure Java. It also provides limited read only support for the older + Word 6 and Word 95 file formats.

+ +

The partner to HWPF for the new Word 2007 .docx format is XWPF. + Whilst HWPF and XWPF provide similar features, there is not a common + interface across the two of them at this time.

HWPF is still in early development. It is in the @@ -53,6 +57,20 @@ code.

+
+ XWPF Patches Required! + +

At the moment, XWPF covers many common use cases for reading and writing + .docx files. Whilst this is a great thing, it does mean that XWPF does + everything that the current POI committers need it to do, and so none of + the committers are actively adding new features.

+ +

If you come across a feature in XWPF that you need, and isn't currently + there, please do send in a patch to add the extra functionality! More details + on contributing patches are available on the "Contribution to POI" page.

+
+
HWPF Pointman Needed! @@ -65,12 +83,12 @@

If you are interested in becoming the new HWPF pointman, you should look into the Microsoft Word internals. A good starting point seems to be Ryan Ackley's overview. This document contains a link to - a detailled Word format description you can find somewhere at - http://www.wotsit.org/. Please - do not contact Ryan Ackley directly, because he is working for a company - now that signed a NDA with Microsoft and thus he will be no longer able to - answer questions.

+ href="docoverview.html">overview. Full details on the word format + is available from + Microsoft, + but the documentation can be a little hard to get into at first... Try reading the + overview first, and looking at the existing + code, then finally look up the documentation for specific missing features.

As a first step you should familiarize yourself with the source code, examples, test cases, and the HWPF patches available at

When you start coding, you will not yet have write access to the - CVS repository. Please submit your patches to Bugzilla and nag Rainer Klute until he commits - them. Besides the actual checking in of HWPF patches Rainer will also do - some minor reviews now and then of your source code patches, test cases - and documentation to help ensure software quality. But most of the time - you will be on your own.

+ href="mailto:dev@poi.apache.org">the dev list until someone commits + them. Besides the actual checking in of HWPF patches, current POI + committers will also do some minor reviews now and then of your source code + patches, test cases and documentation to help ensure software quality. But + most of the time you will be on your own. However, anyone offering useful + contributions over a period of time will be offered committership!

Please do not forget to write JUnit test cases and documentation! @@ -102,15 +121,9 @@ consider that other contributors should be able to understand your source code easily. If you need any help getting started with JUnit test cases for HWPF, please ask on the developers' mailing list! If you show that you - are prepared to stick at it you will most likely be given CVS commit - access.

- -

Important: It is legally vital for POI that you have - never seen any documentation or specification from Microsoft that required - you or your employer to sign an NDA to get it. Please do read the "Contribution to POI" page for - details! This page also contains further information for you to start POI - development.

+ are prepared to stick at it you will most likely be given SVN commit + access. See "Contribution to POI" page + for more details and help getting started.

Of course we will help you as best as we can. However, presently there is no committer who is really familiar with the Word format, so you'll be diff --git a/src/documentation/content/xdocs/index.xml b/src/documentation/content/xdocs/index.xml index d2cc1d068c..8af82859c7 100644 --- a/src/documentation/content/xdocs/index.xml +++ b/src/documentation/content/xdocs/index.xml @@ -86,7 +86,8 @@ provide this functionality. Examples include: Cocoon for which there are serializers for HSSF; Open Office.org with whom we collaborate in documenting the - XLS format; and Lucene + XLS format; and Tika / + Lucene, for which we provide format interpretors. When practical, we donate components directly to those projects for POI-enabling them.

diff --git a/src/documentation/content/xdocs/overview.xml b/src/documentation/content/xdocs/overview.xml index 35ae35f380..34c9d15ea4 100644 --- a/src/documentation/content/xdocs/overview.xml +++ b/src/documentation/content/xdocs/overview.xml @@ -50,14 +50,16 @@
HWPF and XWPF for Word Documents

HWPF is our port of the Microsoft Word 97 (-2003) file format to pure - Java. It supports read, and limited write capabilities. Please see the HWPF project page for more + Java. It supports read, and limited write capabilities. It also provides + simple text extraction support for the older Word 6 and Word 95 formats. + Please see the HWPF project page for more information. This component remains in early stages of development. It can already read and write simple files.

We are also working on the XWPF for the WordprocessingML (2007+) format from the - OOXML specification. + OOXML specification. This provides read and write support for simpler + files, along with text extraction capabilities.

HSLF and XSLF for PowerPoint Documents @@ -108,8 +110,8 @@
HSMF for Outlook Messages

HSMF is our port of the Microsoft Outlook message file format to pure - Java. It currently only some of the textual content of MSG files. - Further support and documentation is expected over the comming weeks and months. + Java. It currently only some of the textual content of MSG files, and + some attachments. Further support and documentation is coming in slowly. For now, users are advised to consult the unit tests for example use. Please see the HPBF project page for more information. diff --git a/src/documentation/content/xdocs/text-extraction.xml b/src/documentation/content/xdocs/text-extraction.xml index 61bc5c4643..0357f34fa9 100644 --- a/src/documentation/content/xdocs/text-extraction.xml +++ b/src/documentation/content/xdocs/text-extraction.xml @@ -81,11 +81,15 @@

Word -

For .doc files, in scratchpad there is +

For .doc files from Word 97 - Word 2003, in scratchpad there is org.apache.poi.hwpf.extractor.WordExtractor, which will - return text for your document. Those using POI 3.5 can also use + return text for your document.

+

Those using POI 3.7 can also extract simple textual content from + older Word 6 and Word 95 files, using the scratchpad class + org.apache.poi.hwpf.extractor.Word6Extractor.

+

Since POI 3.5, it is possible to use org.apache.poi.xwpf.extractor.XPFFWordExtractor, to perform - a similar task for .docx files.

+ text extraction for .docx files.

PowerPoint @@ -97,6 +101,12 @@ perform a similar task for .pptx files.

+
Publisher +

For .pub files, in scratchpad there is + org.apache.poi.hpbf.extractor.PublisherExtractor, which + will return text for your file.

+
+
Visio

For .vsd files, in scratchpad there is org.apache.poi.hdgf.extractor.VisioTextExtractor, which