xpath - XPathを介してノード間のテキストを抽出する

Question

XPathを介してWebページの特定の部分を読み込もうとしています。このページはあまり整形式ではありませんが、変更することはできません...

<root>
    <div class="textfield">
        <div class="header">First item</div>
        Here is the text of the <strong>first</strong> item.
        <div class="header">Second item</div>
        <span>Here is the text of the second item.</span>
        <div class="header">Third item</div>
        Here is the text of the third item.
    </div>
    <div class="textfield">
        Footer text
    </div>
</root>

さまざまなアイテムのテキスト、つまりヘッダーdivの間にあるテキストを抽出したいと思います（たとえば、「これが最初のアイテムのテキストです。」）。これまで、このXPath式を使用してきました。

//text()[preceding::*[@class='header' and contains(text(),'First item')] and following::*[@class='header' and contains(text(),'Second item')]]

ただし、スクレイプしたいページでアイテムの順序が異なるため、終了アイテム名をハードコーディングできません（たとえば、「最初のアイテム」の後に「3番目のアイテム」が続く場合があります）。

XPathクエリを適応させる方法についてのヘルプをいただければ幸いです。

score 2 · Accepted Answer

//*[@class='header' and contains(text(),'First item')]/following::text()[1]の後の最初のテキストノードを選択します<div class="header">First item</div>。後に
//*[@class='header' and contains(text(),'Second item')]/following::text()[1]最初のテキストノードを選択します。編集：申し訳ありませんが、これはケースでは機能しません。私の答えを更新しますEDIT2 ： @Michielの部分を使用しました。omgのように見えますが、機能します：これはより良い解決策で解決する必要があるようです:)<div class="header">Second item</div>
<strong>
//div[@class='textfield'][1]//text()[preceding::*[@class='header' and contains(text(),'First item')]][following::*[preceding::*[not(self::strong) and not(self::span)][1][contains(text(),'First item')]] or not(//*[preceding::*[@class='header' and contains(text(),'First item')]])]

score 2 · Accepted Answer

それを見つけた！

//text()[preceding::*[@class='header' and contains(text(),'First item')]][following::*[preceding::*[@class='header'][1][contains(text(),'First item')]]]

実際、あなたの解決策であるAlehは、テキスト内のタグに対しては機能しません。

ここで、残りの1つのケースは最後の項目であり、その後にclass=headerを持つ要素が続きません。したがって、ドキュメントの最後までに見つかったすべてのテキストが含まれます。アイデア？

score 1 · Accepted Answer

完全を期すために、スレッド全体のさまざまな提案で構成される最終的なクエリ：

//*[
    @class='textfield' and position() = 1
]
//text() [
    preceding::*[
        @class='header' and contains(text(),'First item')
    ]
][
    following::*[
        preceding::*[
            @class='header'
        ][1][
            contains(text(),'First item')
        ]
    ]
]

xpath - XPathを介してノード間のテキストを抽出する

3 に答える 3

Related

Reference