java - Apache POI WordToHtmlConverter を使用してドキュメントのテキストを取得する

Question

Word 文書を HTML に変換するソフトウェアがあります。私が必要としているのは、ドキュメントのテキストだけを取得する方法です。以下は私のコードのサンプルです。ドキュメントが html に変換されると、ドキュメントは html タグに書き込まれるようです。HTML や Web 開発の経験はあまりありません。私は主にアプリとデスクトップの開発者です。段落テキストだけを取得する簡単な方法は何ですか。以下は私がこれまでに持っているコードです。解析できるように、情報を文字列として取得する必要があります。

private void convertWordDoctoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
    HWPFDocumentCore wordDocument = null;
    try {
        wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(file));
    } catch (IOException ex) {
        Exceptions.printStackTrace(ex);
    }

    WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
    wordToHtmlConverter.processDocument(wordDocument);
    org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult(out);

    StringWriter stringWriter = new StringWriter();
    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
    serializer.setOutputProperty(OutputKeys.METHOD, "html");
    serializer.transform(domSource, streamResult);
    out.close();

    String result = new String(out.toByteArray());
    System.out.println(result);
    docEditorPane.setPage(selectedFile.toURI().toURL());
    newDocText = result;
}

java - Apache POI WordToHtmlConverter を使用してドキュメントのテキストを取得する

0 に答える 0

Related

Reference