java - Java 標準ライブラリを使用して HTML 文字をテキストに戻す

Question

Java 標準ライブラリを使用して、一部の HTML 文字をテキストに変換したいと考えています。ライブラリが私の目的を達成するかどうか疑問に思っていましたか?

/**
 * @param args the command line arguments
 */
public static void main(String[] args) {
    // TODO code application logic here

    // "Happy & Sad" in HTML form.
    String s = "Happy &amp; Sad";
    System.out.println(s);

    try {
        // Change to "Happy & Sad". DOESN'T WORK!
        s = java.net.URLDecoder.decode(s, "UTF-8");
        System.out.println(s);
    } catch (UnsupportedEncodingException ex) {

    }
}

score 59 · Accepted Answer

I think the Apache Commons Lang library's StringEscapeUtils.unescapeHtml3() and unescapeHtml4() methods are what you are looking for. See https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.

score 28 · Accepted Answer

ここでは、アプリケーションの lib jsoup に jar ファイルを追加してから、このコードを使用する必要があります。

import org.jsoup.Jsoup;

public class Encoder {
    public static void main(String args[]) {
        String s = Jsoup.parse("&lt;Fran&ccedil;ais&gt;").text();
        System.out.print(s);
    }
}

jsoup をダウンロードするためのリンク: http://jsoup.org/download

score 7 · Accepted Answer

java.net.URLDecoder deals only with the application/x-www-form-urlencoded MIME format (e.g. "%20" represents space), not with HTML character entities. I don't think there's anything on the Java platform for that. You could write your own utility class to do the conversion, like this one.

score 5 · Accepted Answer

The URL decoder should only be used for decoding strings from the urls generated by html forms which are in the "application/x-www-form-urlencoded" mime type. This does not support html characters.

After a search I found a Translate class within the HTML Parser library.

score 2 · Accepted Answer

標準ライブラリを使用してそれを行う方法を知りません。しかし、私は html エンティティを扱うこのクラスを知っていて、使用しています。

「HTMLEntities は、特殊文字と拡張文字を HTML エンティティに、またはその逆に変換するための静的メソッド (htmlentities、unhtmlentities など) のコレクションを含むオープンソース Java クラスです。」

http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities

score 1 · Accepted Answer

@jem が提案したように、jsoup を使用できます。

jSoup 1.8.3 では、元の html を保持するメソッドParser.unescapeEntitiesを使用できます。

import org.jsoup.parser.Parser;
...
String html = Parser.unescapeEntities(original_html, false);

一部の以前のリリースでは、このメソッドは存在しないようです。

java - Java 標準ライブラリを使用して HTML 文字をテキストに戻す

8 に答える 8

Related

Reference