python - すべてではないが複数のテーブルセルに一致する xpath 文字列を作成するには、ヘルプが必要です

Question

注：初期の回答の一部が提供されたため、質問が更新されました。それはまだ同じ質問ですが、うまくいけばより明確になります。

サイトスクレーパーを適切に動作させようとしていますが、いくつかのテーブルセルに適した xpath 文字列を見つけるのに問題があります。

<tbody>
  <tr>
    <td class="Label" width="20%" valign="top">Uninteresting section</td>
    <td class="Data"> I don't care about this</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I don't care about this</td>
  </tr>
  <tr>
    <td class="Label" width="20%" valign="top">Interesting section</td>
    <td class="Data"> I want this-1</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I want this-2</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I want this-n</td>
  </tr>
  <tr>
    <td class="Label" width="20%" valign="top">Uninteresting section</td>
    <td class="Data"> I don't care about this</td>
  </tr>
  <tr>
    <td></td>
    <td class="Data"> I don't care about this</td>
  </tr>
</tbody>

対象セクションのすべてのデータフィールドの内容が必要です。これらは任意の数存在する可能性があります。コード内の他のことは気にしませんが、これらすべてが必要です。

上記の例では、次のようになります。

関連がある場合は、Python 2.7 で xml.dom.minidom と py-dom-xpath を使用しています。

score 1 · Accepted Answer

セクション (他のセクションを含む) の後にすべての n tds を取得するには、次のようにします。

 //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text()

次に、不要な次のセクションのすべての m tds を取得できます

//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()

次に、Python で最初の n - m tds を使用できます。

position 関数と count 関数を使用して、XPath で同じことを試みることができます。

  //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"][position() <= (count(//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text())  - count(//tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()) )]/text()

XPath 2.0 を使用している場合は、次のexcept演算子を使用してエレガントに実行できます。

 //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class = "Data"]/text() except  //tr[@class="Entry"]//tr/td[contains(text(), "Section title")]/following::td[@class="Label"][1]/following::td[@class = "Data"]/text()

score 0 · Accepted Answer

0

//tr[@class="Entry"]/td[@class="Data"]/text()

于 2012-07-25T14:13:36.953 に答える

score 0 · Accepted Answer

//tbody[tr/td[contains(text(),"Section title")]]/tr/td[@class="Data"]/text()

更新しました。これが何をするか：

「セクションタイトル」tbodyをtr含む取得td
それらから、それぞれのテキストtdを cで取得しますlass="Data"

python - すべてではないが複数のテーブル セルに一致する xpath 文字列を作成するには、ヘルプが必要です

3 に答える 3

Related

Reference

python - すべてではないが複数のテーブルセルに一致する xpath 文字列を作成するには、ヘルプが必要です