java - JavaでJTextPanes/JEditorPanes htmlコンテンツを文字列にきれいにする方法は?

Question

JTextPane からきれいな (クリーンな) テキストコンテンツを取得しようとしています。からのサンプルコードは次のJTextPaneとおりです。

JTextPane textPane = new JTextPane ();
textPane.setContentType ("text/html");
textPane.setText ("This <b>is</b> a <b>test</b>.");
String text = textPane.getText ();
System.out.println (text);

テキストは次のようになりますJTexPane。

これはテストです。

この種のコンソールへの出力を取得します。

<html>
  <head>

  </head>
  <body>
    This <b>is</b> a <b>test</b>.
  </body>
</html>

substring()および/またはreplace()コードを使用しましたが、使用するのは不快です:

String text = textPane.getText ().replace ("<html> ... <body>\n    , "");

<b>タグ(コンテンツ)以外のすべてのタグを文字列から削除する簡単な機能はありますか?

JTextPaneコンテンツの周りにタグを追加すること<p>があるので、それらも削除したいと思います。

このような：

<html>
  <head>

  </head>
  <body>
    <p style="margin-top: 0">
      hdfhdfgh
    </p>
  </body>
</html>

タグ付きのテキストコンテンツのみを取得したい:

This <b>is</b> a <b>test</b>.

score 5 · Accepted Answer

サブクラス化HTMLWriterしてオーバーライドstartTagし、endTagの外側のすべてのタグをスキップし<body>ました。

私はあまりテストしませんでしたが、うまくいくようです。欠点の 1 つは、出力文字列に大量の空白があることです。それを取り除くことはそれほど難しいことではありません。

import java.io.*;
import javax.swing.*;
import javax.swing.text.*;
import javax.swing.text.html.*;

public class Foo {

    public static void main(String[] args) throws Exception {
        JTextPane textPane = new JTextPane();
        textPane.setContentType("text/html");
        textPane.setText("<p>This</p> <b>is</b> a <b>test</b>.");

        StringWriter writer = new StringWriter();
        HTMLDocument doc = (HTMLDocument) textPane.getStyledDocument();

        HTMLWriter htmlWriter = new OnlyBodyHTMLWriter(writer, doc);
        htmlWriter.write();

        System.out.println(writer.toString());
    }

    private static class OnlyBodyHTMLWriter extends HTMLWriter {

        public OnlyBodyHTMLWriter(Writer w, HTMLDocument doc) {
            super(w, doc);
        }

        private boolean inBody = false;

        private boolean isBody(Element elem) {
            // copied from HTMLWriter.startTag()
            AttributeSet attr = elem.getAttributes();
            Object nameAttribute = attr
                    .getAttribute(StyleConstants.NameAttribute);
            HTML.Tag name = null;
            if (nameAttribute instanceof HTML.Tag) {
                name = (HTML.Tag) nameAttribute;
            }
            return name == HTML.Tag.BODY;
        }

        @Override
        protected void startTag(Element elem) throws IOException,
                BadLocationException {
            if (inBody) {
                super.startTag(elem);
            }
            if (isBody(elem)) {
                inBody = true;
            }
        }

        @Override
        protected void endTag(Element elem) throws IOException {
            if (isBody(elem)) {
                inBody = false;
            }
            if (inBody) {
                super.endTag(elem);
            }
        }
    }
}

score 1 · Accepted Answer

JEditorPane 自体が使用する HTML パーサーを使用できますHTMLEditorKit.ParserDelegator。

この例とAPI ドキュメントを参照してください。

score 0 · Accepted Answer

substringとreplace-methodsを使用して、この問題の解決策を見つけました。

// Get textPane content to string
String text = textPane.getText();

// Then I take substring to remove tags (html, head, body)
text = text.substring(44, text.length() - 19);

// Sometimes program sets <p style="margin-top: 0"> and </p> -tags so I remove them
// This isn't necessary to use.
text = text.replace("<p style=\"margin-top: 0\">\n      ", "").replace("\n    </p>", ""));

// This is for convert possible escape characters example &amp; -> &
text = StringEscapeUtils.unescapeHtml(text);

エスケープ文字を通常のビューに戻すStringEscapeUtils-librariesへのリンクがあります。提案してくれたOzhanDuzに感謝します。

（commons- lang-ダウンロード）

score 0 · Accepted Answer

0

String text = textPane.getDocument.getText (0,textPane.getText().length());

于 2018-09-11T19:31:40.007 に答える

java - JavaでJTextPanes/JEditorPanes htmlコンテンツを文字列にきれいにする方法は?

4 に答える 4

Related

Reference