java - Apache POI を使用して .docx を html に変換し、テキストを取得しない

Question

私は現在、.doc ドキュメントを html に変換するコードをいくつか持っていますが、.docx をテキストに変換するために使用しているコードは、残念ながらテキストを取得して変換しません。以下は私のコードです。

private void convertWordDocXtoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
    XWPFDocument wordDocument = null;
    try {
        wordDocument = new XWPFDocument(new FileInputStream(file));
    } catch (IOException ex) {
        Exceptions.printStackTrace(ex);
    }

    WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
    org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
    ByteArrayOutputStream out = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult(out);

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
    serializer.setOutputProperty(OutputKeys.METHOD, "html");
    serializer.transform(domSource, streamResult);
    out.close();

    String result = new String(out.toByteArray());
    acDocTextArea.setText(newDocText);
    String htmlText = result;

}

これが機能しない理由についてのアイデアは大歓迎です。ByteArrayOutput は html 全体を返す必要がありますが、空でテキストがありません。

score 5 · Accepted Answer

マーク、フォーマットのみをサポートする HWPF パッケージを使用しています。.docこの説明を参照してください。このドキュメントでは、 XWPF パッケージ.docxを通じて、ファイルのインターフェイスを提供する試みについても言及しています。しかし、人的資源が不足しているようで、ユーザーは拡張機能を提出するよう奨励されています。ただし、利用できる機能は限られているはずですが、テキストの抽出はその 1 つでなければなりません。

この質問も表示されるはずです: How to Extract docx (word 2007 above) using apache POI .

score 0 · Accepted Answer

私もこの時点で衝撃を受けました。これで、docx を html に変換するサードパーティ API が正常に機能
することがわかりましたhttps://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

java - Apache POI を使用して .docx を html に変換し、テキストを取得しない

2 に答える 2

Related

Reference