Basic Text Extraction

For basic text extraction, make use of -org.apache.poi.extractor.PowerPointExtractor. It accepts a file or an input +org.apache.poi.hslf.extractor.PowerPointExtractor. It accepts a file or an input stream. The getText() method can be used to get the text from the slides, and the getNotes() method can be used to get the text from the notes. Finally, getText(true,true) will get the text from both. @@ -22,8 +22,8 @@ from both.

Specific Text Extraction -

To get specific bits of text, first create a org.apache.poi.usermodel.SlideShow -(from a org.apache.poi.HSLFSlideShow, which accepts a file or an input +

To get specific bits of text, first create a org.apache.poi.hslf.usermodel.SlideShow +(from a org.apache.poi.hslf.HSLFSlideShow, which accepts a file or an input stream). Use getSlides() and getNotes() to get the slides and notes. These can be queried to get their page ID (though they should be returned in the right order).

@@ -44,7 +44,7 @@ same character and paragraph formatting. about getting duplicate blocks of text, you don't care about getting text from master sheets, and you don't care about getting old text, then - org.apache.poi.extractor.QuickButCruddyTextExtractor + org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor might be of use.

QuickButCruddyTextExtractor doesn't use the normal record parsing code, instead it uses a tree structure blind search @@ -109,7 +109,7 @@ same character and paragraph formatting.

org.apache.poi.hslf.extractor.PowerPointExtractor Uses the model code to allow extraction of text from files

org.apache.poi.extractor.QuickButCruddyTextExtractor +

org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor Uses the record code to extract all the text from files very fast, but including deleted text (and other bits of Crud).

-- 2.39.5