java - HTML エンティティとそのコンテンツを削除する

Question

とを使用して抽出されたhtmlのスニペットがDocument doc =jsoup.connect(someUrl).get()ありますElements body=doc.select("div.chapter")

String myHtml = "
<div class="chapter">
  <h1>Hello this is my example</h1>
  <p>This is paragraph one</p>
  <p>This is paragraph two <sup class="num">Nuisance 1</sup><span class="notes">Nuisance 2</span></p>
  <p>This is paragraph three</p>
</div>"

<sup> </sup>と<span> <\span>そのコンテンツを JSOUP で削除したい。正規表現構文を使用するのは悪い考えだと読みました。そして、ほとんどの例と回答は、タグを削除してコンテンツを保持するためにこの質問に対処しています。私が取得したいのは次のとおりです。

String newHtml = "
<div class="chapter">
  <h1>Hello this is my example</h1>
  <p>This is paragraph one</p>
  <p>This is paragraph two</p>
  <p>This is paragraph three</p>
</div>"

JSOUP を使用しましたが、満足のいく結果は得られませんでした (SUP および SPAN エンティティ/タグが保持されます)。

score 1 · Accepted Answer

もっと読んで（もっと！）さまざまなオプションを試した後、私は自分のケースに解決策を適応させました：

doc.getElementsByClass("notes").remove();
doc.getElementsByClass("num").remove(); 
Elements newElement = doc.select("div.chapter");
String newHtml=newElement.toString();

score 1 · Accepted Answer

body.select("p > sup.num, p > span.notes").remove();
System.out.println(body.html());

あなたの場合は完璧なはずです。

java - HTML エンティティとそのコンテンツを削除する

3 に答える 3

Related

Reference