java - HTMLコードが表示されているテキスト/画像を表しているかどうかを確認します

Question

HTMLコードを含む文字列があります。HTMLコードが表示されているテキストを表しているのか画像を表しているのかを知りたいのですが。私はJavaを使用して、次の正規表現を使用してこの問題を解決しました（RegExpsを使用してHTMLを解析できないことはわかっていますが、RegExpsまでは十分だと思いました）。

public static String regex_html_tags_1 = "<\\s*br\\s*[/]?>";
public static String regex_html_tags_2 = "<\\s*([a-zA-Z0-9]+)\\s*([^=/>]+\\s*=\\s*[^/>]+\\s*)*\\s*/>"; 
public static String regex_html_tags_3 = "<\\s*([a-zA-Z0-9]+)\\s*([^=>]+\\s*=\\s*[^>]+\\s*)*\\s*>\\s*</\\s*\\1\\s*>"; 

public static String[] HTMLWhiteSpaces = {"&nbsp;", "&#160;"};

これらの正規表現を使用するコードは、次のような文字列に対して正常に機能します。

<h2></h2>

または同様に。しかし、文字列

<img src="someImage.png"></img>

また、空であると考えられています。

正規表現を使用して、ブラウザで解釈されたときに人間が読めるテキストを実際に表すHTMLコードがあるかどうかを調べるよりも良いアイデアはありますか？それとも、私のアプローチが最終的に成功につながると思いますか？

よろしくお願いします。

score 2 · Accepted Answer

JSoupを使用してみてください。cssセレクター（jqueryスタイル）を使用してHTMLドキュメントを解析できます。

空でない要素をすべて選択する非常に簡単な例は次のとおりです。

Document doc = Jsoup.connect("http://my.awesome.site.com").get();
Elements nonEmpties = doc.select(":not(:empty)");

もちろん、本格的なソリューションには、次のような追加の作業が必要になります。

要素のリストを反復処理し、
cssスタイルのチェック（displayまたはvisibilityまたはサイズまたはオーバーレイ要素）
src画像の属性を確認する
等

しかし、それは間違いなく価値があります。新しいフレームワークを学び、HTML / CSSでコンテンツを「非表示」にする可能性を発見し、-最も重要な-HTML解析に正規表現を使用するのをやめます;-)

score 1 · Accepted Answer

私は次のコードを思いつきました。これは、目に見えない要素を考慮する必要がない私の設定でうまく機能します。

// HTML white spaces that might occur in between tags; this list probably needs to be extended
public static String[] HTML_WHITE_SPACES = {"&nbsp;", "&#160;"};

/**
 * check if the given HTML text contains visible text or images
 * 
 * @param htmlText String the text that is checked for visibility
 * @return boolean    (1) true if the htmlText contains some visible elements 
 *                 or (2) false in case (1) does not hold
 */
public static boolean containsVisibleElements(String htmlText) {

    // do not analyze the HTML text if it is blank already
    if (StringUtil.isBlank(htmlText)) {
        return false;
    }

    // the string from which all whitespaces are removed
    String htmlTextRemovedWhiteSpaces = htmlText; 

    // first, remove white spaces from the string
    for (String whiteSpace: HTML_WHITE_SPACES) {
        htmlTextRemovedWhiteSpaces = htmlTextRemovedWhiteSpaces.replaceAll(whiteSpace, "");
    }

    // the HTML text is blank 
    if (StringUtil.isBlank(htmlTextRemovedWhiteSpaces)) {
        return false;
    }

    // parse the HTML text from which the white space have been removed
    Document doc = Jsoup.parse(htmlTextRemovedWhiteSpaces);

    // find real text within the body (and its children)
    String text = doc.body().text(); 

    // there exists visible text
    if (!StringUtil.isBlank(text.trim())) {
        return true;
    }

    // now we know that there does not exist visible text and that the string 
    // htmlTextRemovedWhiteSpaces is not blank

    // look for images as they are visible and not a text ;-)
    Elements images = doc.select("img");

    // there do not exist any image elements
    if (images.isEmpty()) {
        return false;
    }       

    // none of the above checks succeeded, so there must exist some visible elements such as text or images
    return true;
}

java - HTMLコードが表示されているテキスト/画像を表しているかどうかを確認します

2 に答える 2

Related

Reference