diff options
author | Nick Burch <nick@apache.org> | 2010-06-30 17:40:33 +0000 |
---|---|---|
committer | Nick Burch <nick@apache.org> | 2010-06-30 17:40:33 +0000 |
commit | f71a66b0abdb0a4f1b84958fea72b085e91eae0f (patch) | |
tree | 86049c569b5efbeeb80e1824dc4192ab0680104b /src | |
parent | fd922298ef4f40256e7c3803e54c59cd9ddefb80 (diff) | |
download | poi-f71a66b0abdb0a4f1b84958fea72b085e91eae0f.tar.gz poi-f71a66b0abdb0a4f1b84958fea72b085e91eae0f.zip |
Update HWPF documentation to include the newly added word 6/95 text extraction support, as well as mention XWPF + Microsoft spec docs
git-svn-id: https://svn.apache.org/repos/asf/poi/trunk@959384 13f79535-47bb-0310-9956-ffa450edef68
Diffstat (limited to 'src')
-rw-r--r-- | src/documentation/content/xdocs/hwpf/index.xml | 59 | ||||
-rw-r--r-- | src/documentation/content/xdocs/index.xml | 3 | ||||
-rw-r--r-- | src/documentation/content/xdocs/overview.xml | 12 | ||||
-rw-r--r-- | src/documentation/content/xdocs/text-extraction.xml | 16 |
4 files changed, 58 insertions, 32 deletions
diff --git a/src/documentation/content/xdocs/hwpf/index.xml b/src/documentation/content/xdocs/hwpf/index.xml index c7f58122e3..1d14be318b 100644 --- a/src/documentation/content/xdocs/hwpf/index.xml +++ b/src/documentation/content/xdocs/hwpf/index.xml @@ -35,8 +35,12 @@ <section><title>Overview</title> <p>HWPF is the name of our port of the Microsoft Word 97(-2007) file format - to pure Java. It <em>does not</em> support the new Word 2007 .docx - file format, which is not OLE2 based.</p> + to pure Java. It also provides limited read only support for the older + Word 6 and Word 95 file formats.</p> + + <p>The partner to HWPF for the new Word 2007 .docx format is <em>XWPF</em>. + Whilst HWPF and XWPF provide similar features, there is not a common + interface across the two of them at this time.</p> <p>HWPF is still in early development. It is in the <link href="http://svn.apache.org/viewcvs.cgi/poi/trunk/src/scratchpad/"> @@ -54,6 +58,20 @@ </p> <section> + <title>XWPF Patches Required!</title> + + <p>At the moment, XWPF covers many common use cases for reading and writing + .docx files. Whilst this is a great thing, it does mean that XWPF does + everything that the current POI committers need it to do, and so none of + the committers are actively adding new features.</p> + + <p>If you come across a feature in XWPF that you need, and isn't currently + there, please do send in a patch to add the extra functionality! More details + on contributing patches are available on the <link + href="../getinvolved/index.html">"Contribution to POI" page</link>.</p> + </section> + + <section> <title>HWPF Pointman Needed!</title> <p>At the moment we unfortunately do not have someone taking care for HWPF @@ -65,12 +83,12 @@ <p>If <strong>you</strong> are interested in becoming the new HWPF pointman, you should look into the Microsoft Word internals. A good starting point seems to be Ryan Ackley's <link - href="docoverview.html">overview</link>. This document contains a link to - a detailled Word format description you can find somewhere at - <link href="http://www.wotsit.org/">http://www.wotsit.org/</link>. Please - do not contact Ryan Ackley directly, because he is working for a company - now that signed a NDA with Microsoft and thus he will be no longer able to - answer questions.</p> + href="docoverview.html">overview</link>. Full details on the word format + is available from + <link href="http://www.microsoft.com/interop/docs/OfficeBinaryFormats.mspx">Microsoft</link>, + but the documentation can be a little hard to get into at first... Try reading the + <link href="docoverview.html">overview</link> first, and looking at the existing + code, then finally look up the documentation for specific missing features.</p> <p>As a first step you should familiarize yourself with the source code, examples, test cases, and the HWPF patches available at <link @@ -88,13 +106,14 @@ </ul> <p>When you start coding, you will not yet have write access to the - CVS repository. Please submit your patches to <link + SVN repository. Please submit your patches to <link href="http://issues.apache.org/">Bugzilla</link> and nag <link - href="mailto:klute@apache.org">Rainer Klute</link> until he commits - them. Besides the actual checking in of HWPF patches Rainer will also do - some minor reviews now and then of your source code patches, test cases - and documentation to help ensure software quality. But most of the time - you will be on your own.</p> + href="mailto:dev@poi.apache.org">the dev list</link> until someone commits + them. Besides the actual checking in of HWPF patches, current POI + committers will also do some minor reviews now and then of your source code + patches, test cases and documentation to help ensure software quality. But + most of the time you will be on your own. However, anyone offering useful + contributions over a period of time will be offered committership!</p> <p>Please do not forget to write <link href="http://www.junit.org/">JUnit</link> test cases and documentation! @@ -102,15 +121,9 @@ consider that other contributors should be able to understand your source code easily. If you need any help getting started with JUnit test cases for HWPF, please ask on the developers' mailing list! If you show that you - are prepared to stick at it you will most likely be given CVS commit - access.</p> - - <p><strong>Important:</strong> It is legally vital for POI that you have - never seen any documentation or specification from Microsoft that required - you or your employer to sign an NDA to get it. Please do read the <link - href="../getinvolved/index.html">"Contribution to POI" page</link> for - details! This page also contains further information for you to start POI - development.</p> + are prepared to stick at it you will most likely be given SVN commit + access. See <link href="../getinvolved/index.html">"Contribution to POI" page</link> + for more details and help getting started.</p> <p>Of course we will help you as best as we can. However, presently there is no committer who is really familiar with the Word format, so you'll be diff --git a/src/documentation/content/xdocs/index.xml b/src/documentation/content/xdocs/index.xml index d2cc1d068c..8af82859c7 100644 --- a/src/documentation/content/xdocs/index.xml +++ b/src/documentation/content/xdocs/index.xml @@ -86,7 +86,8 @@ provide this functionality. Examples include: <link href="http://xml.apache.org/cocoon">Cocoon</link> for which there are serializers for HSSF; <link href="http://www.openoffice.org">Open Office.org</link> with whom we collaborate in documenting the - XLS format; and <link href="http://lucene.apache.org/">Lucene</link> + XLS format; and <link href="http://tika.apache.org/">Tika</link> / + <link href="http://lucene.apache.org/">Lucene</link>, for which we provide format interpretors. When practical, we donate components directly to those projects for POI-enabling them. </p> diff --git a/src/documentation/content/xdocs/overview.xml b/src/documentation/content/xdocs/overview.xml index 35ae35f380..34c9d15ea4 100644 --- a/src/documentation/content/xdocs/overview.xml +++ b/src/documentation/content/xdocs/overview.xml @@ -50,14 +50,16 @@ <section><title>HWPF and XWPF for Word Documents</title> <p> HWPF is our port of the Microsoft Word 97 (-2003) file format to pure - Java. It supports read, and limited write capabilities. Please see <link - href="./hwpf/index.html">the HWPF project page for more + Java. It supports read, and limited write capabilities. It also provides + simple text extraction support for the older Word 6 and Word 95 formats. + Please see <link href="./hwpf/index.html">the HWPF project page for more information</link>. This component remains in early stages of development. It can already read and write simple files. </p> <p> We are also working on the XWPF for the WordprocessingML (2007+) format from the - OOXML specification. + OOXML specification. This provides read and write support for simpler + files, along with text extraction capabilities. </p> </section> <section><title>HSLF and XSLF for PowerPoint Documents</title> @@ -108,8 +110,8 @@ <section><title>HSMF for Outlook Messages</title> <p> HSMF is our port of the Microsoft Outlook message file format to pure - Java. It currently only some of the textual content of MSG files. - Further support and documentation is expected over the comming weeks and months. + Java. It currently only some of the textual content of MSG files, and + some attachments. Further support and documentation is coming in slowly. For now, users are advised to consult the unit tests for example use. Please see <link href="./hsmf/index.html">the HPBF project page for more information</link>. diff --git a/src/documentation/content/xdocs/text-extraction.xml b/src/documentation/content/xdocs/text-extraction.xml index 61bc5c4643..0357f34fa9 100644 --- a/src/documentation/content/xdocs/text-extraction.xml +++ b/src/documentation/content/xdocs/text-extraction.xml @@ -81,11 +81,15 @@ </section> <section><title>Word</title> - <p>For .doc files, in scratchpad there is + <p>For .doc files from Word 97 - Word 2003, in scratchpad there is <em>org.apache.poi.hwpf.extractor.WordExtractor</em>, which will - return text for your document. Those using POI 3.5 can also use + return text for your document.</p> + <p>Those using POI 3.7 can also extract simple textual content from + older Word 6 and Word 95 files, using the scratchpad class + <em>org.apache.poi.hwpf.extractor.Word6Extractor</em>.</p> + <p>Since POI 3.5, it is possible to use <em>org.apache.poi.xwpf.extractor.XPFFWordExtractor</em>, to perform - a similar task for .docx files.</p> + text extraction for .docx files.</p> </section> <section><title>PowerPoint</title> @@ -97,6 +101,12 @@ perform a similar task for .pptx files.</p> </section> + <section><title>Publisher</title> + <p>For .pub files, in scratchpad there is + <em>org.apache.poi.hpbf.extractor.PublisherExtractor</em>, which + will return text for your file.</p> + </section> + <section><title>Visio</title> <p>For .vsd files, in scratchpad there is <em>org.apache.poi.hdgf.extractor.VisioTextExtractor</em>, which |