Apache POI - POIFS - Documents embeded in other documents

diff --git a/src/documentation/content/xdocs/poifs/book.xml b/src/documentation/content/xdocs/poifs/book.xml index c96a71eeef..d2fc4a3c78 100644 --- a/src/documentation/content/xdocs/poifs/book.xml +++ b/src/documentation/content/xdocs/poifs/book.xml @@ -30,6 +30,7 @@

diff --git a/src/documentation/content/xdocs/poifs/embeded.xml b/src/documentation/content/xdocs/poifs/embeded.xml new file mode 100644 index 0000000000..d888e2ed53 --- /dev/null +++ b/src/documentation/content/xdocs/poifs/embeded.xml @@ -0,0 +1,95 @@ + + + + +

+ Apache POI - POIFS - Documents embeded in other documents + Overview + + + + +

+ +

Overview +

It is possible for one OLE 2 based document to have other + OLE 2 documents embeded in it. For example, and Excel file + may have a word document and a powerpoint slideshow + embeded as part of it.

Normally, these other documents are stored in subdirectories + of the OLE 2 (POIFS) filesystem. The exact location of the + embeded documents will vary depending on the type of the + master document, and the exact directory names will differ + each time. To figure out exactly which directory to look + in, you will either need to process the appropriate OLE 2 + linking entry in the master document, or simple iterate + over all the directories in the filesystem.

As a general rule, you will find the same OLE 2 entries + in the subdirectories, as you would've found at the root + of the filesystem were a document to not be embeded.

+ +

Files embeded in Excel +

Excel normally stores embeded files in subdirectories + of the filesystem root. Typically these subdirectories + are named starting with MBD, with 8 hex characters following.

+ +

Files embeded in Word +

Word normally stores embeded files in subdirectories + of the ObjectPool directory, itself a subdirectory of the + filesystem root. Typically these subdirectories and named + starting with an underscore, followed by 10 numbers.

+ +

Files embeded in PowerPoint +

PowerPoint does not normally store embeded files + in the OLE2 layer. Instead, they are held within records + of the main PowerPoint file. To get at them, you need to + find the appropriate data within the PowerPoint stream, + and work from that.

+ +

Listing POIFS contents +

POIFS provides a simple tool for listing the contents of + OLE2 files. This can allow you to see what your POIFS file + contents, and hence if it has any embeded documents in it, + and where.

The tool to use is org.apache.poi.poifs.dev.POIFSLister. + This tool may be run from the command line, and takes a filename + as its parameter. It will print out all the directories and + files contained within the POIFS file.

+ +

Opening embeded files +

All of the POIDocument classes (HSSFWorkbook, HSLFSlideShow, + HWPFDocument and HDGFDiagram) can either be opened from + a POIFSFileSystem, or from a specific directory within a + POIFSFileSystem. So, to open embeded files, simply locate the + appropriate DirectoryNode that represents the subdirectory + of interest, and pass this + the overall POIFSFileSystem to + the constructor.

I you want to extract the textual contents of the embeded file, + then open the appropriate POIDocument, and then pass this to + the extractor class, instead of simply passing the POIFSFilesystem + to the extractor.

+ + diff --git a/src/documentation/content/xdocs/text-extraction.xml b/src/documentation/content/xdocs/text-extraction.xml new file mode 100644 index 0000000000..397aa1b39d --- /dev/null +++ b/src/documentation/content/xdocs/text-extraction.xml @@ -0,0 +1,106 @@ + + + + + +

+ POI Text Extraction + + + +

+ + +

Overview +

POI provides text extraction for all the supported file + formats. In addition, it provides access to the metadata + associated with a given file, such as title and author.

In addition to providing direct text extraction classes, + POI works closely with the + Apache Tika + text extraction library. Users may wish to simply utilise + the functionality provided by Tika.

+ +

Common functionality +

All of the POI text extractors extend from + org.apache.poi.POITextExtractor. This provides a common + method across all extractors, getText(). For many cases, the text + returned will be all you need. However, many extractors do provide + more targetted text extraction methods, so you may wish to use + these in some cases.

All POIFS / OLE 2 based text extractors also extend from + org.apache.poi.POIOLE2TextExtractor. This additionally + provides common methods to get at the HPFS + document metadata.

All OOXML based text extractors (available in POI 3.5 and later) + also extend from + org.apache.poi.POIOOXMLTextExtractor. This additionally + provides common methods to get at the OOXML metadata.

+ +

Text Extractor Factory - POI 3.5 or later +

A new class in POI 3.5, + org.apache.poi.extractor.ExtractorFactory provides a + similar function to WorkbookFactory. You simply pass it an + InputStream, a file, a POIFSFileSystem or a OOXML Package. It + figures out the correct text extractor for you, and returns it.

+ +

Excel +

For .xls files, there is + org.apache.poi.hssf.extractor.ExcelExtractor, which will + return text, optionally with formulas instead of their contents. + Those using POI 3.5 can also use + org.apache.poi.xssf.extractor.XSSFExcelExtractor, to perform + a similar task for .xlsx files.

+ +

Word +

For .doc files, in scratchpad there is + org.apache.poi.hwpf.extractor.WordExtractor, which will + return text for your document. Those using POI 3.5 can also use + org.apache.poi.xwpf.extractor.XPFFWordExtractor, to perform + a similar task for .docx files.

+ +

PowerPoint +

For .ppt files, in scratchpad there is + org.apache.poi.hslf.extractor.PowerPointExtractor, which + will return text for your slideshow, optionally restricted to just + slides text or notes text. Those using POI 3.5 can also use + org.apache.poi.xslf.extractor.XSLFPowerPointExtractor, to + perform a similar task for .pptx files.

+ +

Visio +

For .vsd files, in scratchpad there is + org.apache.poi.hdgf.extractor.VisioTextExtractor, which + will return text for your file.

+ + + +