java - HTMLの検索機能

Question

HTMLDocument 内のテキストを検索し、その単語/文のインデックスと最後のインデックスを返しますが、検索時にタグを無視するにはどうすればよいですか..

検索中: stackoverflow
html: <p class="red">stack<b>overflow</b></p>

これにより、インデックス 15 と 31 が返されます。

ウェブページを検索するときのブラウザと同じです。

score 0 · Accepted Answer

Java でそれを行いたい場合は、Jsoup を使用した大まかな例を次に示します。ただし、もちろん、コードが特定の html に対して適切に解析できるように、詳細を実装する必要があります。

String html = "<html><head><title>First parse</title></head>"
      + "<body><p class=\"red\">stack<b>overflow</b></p></body></html>";

String search = "stackoverflow";

Document doc = Jsoup.parse(html);
String pPlainText = doc.body().getElementsByTag("p").first().text(); // will return stackoverflow

if(search.matches(pPlainText)){
    System.out.println("text found in html");

    String pElementString = doc.body().html(); // this will return <p class="red">stack<b>overflow</b></p></body>
    String firstWord = doc.body().getElementsByTag("p").first().ownText(); // "stack"
    String secondWord = doc.body().getElementsByTag("p").first().children().first().ownText(); // "overflow"

    //search the text in pElementString
    int start = pElementString.indexOf(firstWord); // 15
    int end = pElementString.lastIndexOf(secondWord) + secondWord.length(); // 31
    System.out.println(start + " >> " + end);

}else{
    System.out.println("cannot find searched text");
}

java - HTMLの検索機能

1 に答える 1

Related

Reference