java - Apache Poi - Word 文書からすべてのリンクを削除する方法

Question

Word 文書のすべてのハイパーリンクを削除し、テキストを保持したいと考えています。doc および docx 拡張子を持つ Word 文書を読むには、これら 2 つの方法があります。

private void readDocXExtensionDocument(){
    File inputFile = new File(inputFolderDir, "test.docx");
    try {
        XWPFDocument document = new XWPFDocument(OPCPackage.open(new   FileInputStream(inputFile)));
        XWPFWordExtractor extractor = new XWPFWordExtractor(document);
        extractor.setFetchHyperlinks(true);
        String context =  extractor.getText();
        System.out.println(context);
    } catch (InvalidFormatException e) {
        e.printStackTrace();
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

}

private void readDocExtensionDocument(){
    File inputFile = new File(inputFolderDir, "test.doc");
    POIFSFileSystem fs;
    try {
        fs = new POIFSFileSystem(new FileInputStream(inputFile));
        HWPFDocument document = new HWPFDocument(fs);
        WordExtractor wordExtractor = new WordExtractor(document);
        String[] paragraphs = wordExtractor.getParagraphText();
        System.out.println("Word document has " + paragraphs.length + " paragraphs");
        for(int i=0; i<paragraphs.length; i++){
            paragraphs[i] = paragraphs[i].replaceAll("\\cM?\r?\n", "");
            System.out.println(paragraphs[i]);
        }
    } catch (IOException e) {
        e.printStackTrace();
    }
}

apache poi ライブラリを使用して、Word 文書のすべてのリンクを削除することは可能ですか? そうでない場合、これを提供できる他のライブラリはありますか?

score 2 · Accepted Answer

少なくとも .docx カテゴリに対する私の解決策は、正規表現を使用することです。これをチェックしてください

private void readDocXExtensionDocument(){
   Pattern p = Pattern.compile("\\<(.+?)\\>");
   File inputFile = new File(inputFolderDir, "test.docx");
   try {
      XWPFDocument document = new XWPFDocument(OPCPackage.open(new   FileInputStream(inputFile)));
      XWPFWordExtractor extractor = new XWPFWordExtractor(document);
      extractor.setFetchHyperlinks(true);
      String context =  extractor.getText();
      Matcher m = p.matcher(context);
      while (m.find()) {
         String link = m.group(0); // the bracketed part
         String textString = m.group(1); // the text of the link without the brackets
         context = context.replaceAll(link, ""); // ordering important.  Link then textString
         context = context.replaceAll(textString, "");
      }
      System.out.println(context);
   } catch (InvalidFormatException e) {
    e.printStackTrace();
   } catch (FileNotFoundException e) {
    e.printStackTrace();
   } catch (IOException e) {
    e.printStackTrace();
   }
  }

このアプローチの唯一の注意点は、リンクではないこれらの角度の付いたブラケットを含む素材がある場合、それも削除される可能性があるということです。どのような種類のリンクが表示されるかについてよりよくわかっている場合は、私が提供したものではなく、より具体的な正規表現を試してみてください。

java - Apache Poi - Word 文書からすべてのリンクを削除する方法

1 に答える 1

Related

Reference