I'm doing topic modelling on a pdf e-book and need to extract text paragraph by paragraph. For this I use apache pdfBox which is efficiently extract text from pdf.
PDFParser parser;
PDFTextStripper pdfStrip = null;
parsedText = pdfStrip.getText(pdDoc);
But I cannot extract paragraphs separately. This tool provides a way to set the paragraph start/end identifier, but I need to know the paragraph break identifier for this.
Is there a way to do this, or if there some other tool available which can do paragraph extraction effectively?