unix - sedマジックでhtmlから抽出する
多くのテーブルを含む html ページがあります。

`<html> <table> POINTER_TEXT some other stuff <table that i want START> </table that i want END> some other stuff <table bad> </tab`

Question

多くのテーブルを含む html ページがあります。

<html>
<table>
  POINTER_TEXT
  some other stuff
  <table that i want START>
  </table that i want END>
  some other stuff
  <table bad>
  </table bad>
</table>
</html>

特定のテキストの後にあるテーブルを取得したいと考えています。この段階までは元気です。

curl -silent http://xyz.com/1.htm | sed -n '/POINTER_TEXT/,$p'

これは私に与えます

  POINTER_TEXT
  some other stuff
  <table that i want START>
  </table that i want END>
  some other stuff
  <table bad>
  </table bad>
</table>
</html>

次に、これを追加します。

curl -silent http://xyz.com/1.htm | sed -n '/POINTER_TEXT/,$p' | sed -n '/<table*/,/<\/table>/p'

これは私にこれを与えます：

  <table that i want START>
  </table that i want END>
  <table bad>
  </table bad>

私の問題は、これが必要なだけです：

  <table that i want START>
  </table that i want END>

みんな助けてください！

score 1 · Accepted Answer

追加

| sed '\=</table={p;Q}'

最後に。これにより、最初のテーブル終了後にすべてが破棄されます。

しかし、html に改行がない場合、スクリプトはどうしますか? 実際のパーサーを使用して HTML を処理する方がはるかに堅牢です。

score 0 · Accepted Answer

必要なガイドは次のとおりです。クリック

（1）一般的な解決策は、これらの範囲式の1つでGNUsedまたはssedを使用することです。最初のスクリプト（「最初の一致のみを出力」）は、任意のバージョンのsedで機能します。
 sed -n '/RE/{p;q;}' file       # print only the first match
 sed '0,/RE/{//d;}' file        # delete only the first match
 sed '0,/RE/s//to_that/' file   # change only the first match

score 0 · Accepted Answer

何をしようとしているのかによっては、チョロバが示唆したように、実際のパーサーを使用した方がうまくいくかもしれません。便利なことに、W3C はすでにCSS3 セレクターを受け入れるものを提供しています。

入力例 "infile":

<html>
<table>
  POINTER_TEXT
  some other stuff
  <table>
  Wanted data
  </table>
  some other stuff
  <table>
  Not wanted
  </table>
</table>
</html>

<table>の最初の子孫を抽出するには、次のよう<table>に使用します。hxselect

hxselect 'table > table:first-child' < infile

score 0 · Accepted Answer

これはうまくいくかもしれません（GNU sed）：

sed '/POINTER_TEXT/,${/<table/,/<\/table/{/<\/table/!b;q}};d' file

unix - sedマジックでhtmlから抽出する多くのテーブルを含む html ページがあります。

4 に答える 4

Related

Reference

unix - sedマジックでhtmlから抽出する
多くのテーブルを含む html ページがあります。

`<html> <table> POINTER_TEXT some other stuff <table that i want START> </table that i want END> some other stuff <table bad> </tab`