For basic text extraction, make use of
org.apache.poi.sl.extractor.SlideShowExtractor
.
It accepts a slideshow which can be created from a file or stream via org.apache.poi.sl.usermodel.SlideShowFactory
.
The getText()
method can be used to get the text from the slides.
To get specific bits of text, first create a org.apache.poi.hslf.usermodel.HSLFSlideShow
(from a org.apache.poi.hslf.usermodel.HSLFSlideShowImpl
, which accepts a file or an input
stream). Use getSlides()
and getNotes()
to get the slides and notes.
These can be queried to get their page ID (though they should be returned
in the right order).
You can then call getTextParagraphs()
on these, to get
their blocks of text. (A list of HSLFTextParagraph
normally holds all the text in a
given area of the page, eg in the title bar, or in a box).
From the HSLFTextParagraph
, you can extract the text, and check
what type of text it is (eg Body, Title). You can also call
getTextRuns()
, which will return the
HSLFTextRun
s that make up the TextParagraph
. A
HSLFTextRun
is a text fragment, having the same character formatting.
The paragraph formatting is defined in the parent HSLFTextParagraph
.
If speed is the most important thing for you, you don't care
about getting duplicate blocks of text, you don't care about
getting text from master sheets, and you don't care about getting
old text, then
org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor
might be of use.
QuickButCruddyTextExtractor doesn't use the normal record parsing code, instead it uses a tree structure blind search method to get all text holding records. You will get all the text, including lots of text you normally wouldn't ever want. However, you will get it back very very fast!
There are two ways of getting the text back.
getTextAsString()
will return a single string with all
the text in it. getTextAsVector()
will return a
vector of strings, one for each text record found in the file.
It is possible to change the text via
HSLFTextParagraph.setText(List<HSLFTextParagraph>,String)
or
HSLFTextRun.setText(String)
. It is possible to add additional TextRuns
with HSLFTextParagraph.appendText(List<HSLFTextParagraph>,String,boolean)
or HSLFTextParagraph.addTextRun(HSLFTextRun)
When calling HSLFTextParagraph.setText(List<HSLFTextParagraph>,String)
, all
the text will end up with the same formatting. When calling
HSLFTextRun.setText(String)
, the text will retain
the old formatting of that HSLFTextRun
.
You may add new slides by calling
HSLFSlideShow.createSlide()
, which will add a new slide
to the end of the SlideShow. It is possible to re-order slides with HSLFSlideShow.reorderSlide(...)
.
org.apache.poi.hslf.usermodel.HSLFSlideShowImpl
Handles reading in and writing out files. Calls
org.apache.poi.hslf.record.record
to build a tree
of all the records in the file, which it allows access to.
org.apache.poi.hslf.record.Record
Base class of all records. Also provides the main record generation
code, which will build up a tree of records for a file.
org.apache.poi.hslf.usermodel.HSLFSlideShow
Builds up model entries from the records, and presents a user facing
view of the file
org.apache.poi.hslf.usermodel.HSLFSlide
A user facing view of a Slide in a slideshow. Allows you to get at the
Text of the slide, and at any drawing objects on it.
org.apache.poi.hslf.usermodel.HSLFTextParagraph
A list of HSLFTextParagraph
s holds all the text in a given area of the Slide, and will
contain one or more HSLFTextRun
s.
org.apache.poi.hslf.usermodel.HSLFTextRun
Holds a run of text, all having the same character stylings. It is possible to modify text, and/or text stylings.
org.apache.poi.sl.extractor.SlideShowExtractor
Uses the model code to allow extraction of text from files
org.apache.poi.hslf.extractor.QuickButCruddyTextExtractor
Uses the record code to extract all the text from files very fast,
but including deleted text (and other bits of Crud).