python - XPath: 現在のノード属性によって現在および次のノードのテキストを選択する

Question

これが繰り返しの質問である場合は、申し訳ありませんが、SOまたは他の場所で、必要なものを処理していると思われる別の質問が見つかりません。これが私の質問です：

このscrapyWebページから情報を取得するために使用しています。明確にするために、以下はその Web ページのソースコードのブロックです。これは私にとって興味深いものです。

<p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
                        <span class='distribution'>(SCI)</span></p> 

<span class='normaltext'> 
Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is  directed  to answering the question: What makes us human? This course is a survey of  biological  anthropology and  archaeology.  [<span class='Helpcourse'
            onMouseover="showtip(this,event,'24 Lectures')"
            onMouseout="hidetip()">24L</span>, <span class='Helpcourse'
            onMouseover="showtip(this,event,'12 Tutorials')"
            onMouseout="hidetip()">12T</span>]<br> 

<span class='title2'>Exclusion: </span><a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a><br>

<span class='title2'>Prerequisite: </span><a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a><br> 
</span><br/><br/<br/>

そのページのほとんどすべてのコードは、上記のブロックのように見えます。

このすべてから、次のものを取得する必要があります。

ANT101H5 生物人類学と考古学入門
除外: ANT100Y5
前提条件: ANT102H5

問題は、それExclusion:が a の中にANT100Y5あり、次のの中にあること<a>です。

このソースコードから両方を取得することはできないようです。現在、次のようなコードを取得しようとする (そして失敗する) コードがありますANT100Y5。

hxs = HtmlXPathSelector(response)
    sites = hxs.select("//*[(name() = 'p' and @class = 'titlestyle') or (name() = 'a' and @href and preceding-sibling::'//span/@class=title2')]")

「これに完全に答えるこの他のSOの質問が表示されないために盲目である」場合でも、これに関する助けをいただければ幸いです（その場合、私はこれを閉じるために投票します）。私は本当に頭がおかしいです。

前もって感謝します

編集: @Dimitre によって提案された変更後に元のコードを完成させる

私は次のコードを使用しています：

class regcalSpider(BaseSpider):
    name = "disc"
    allowed_domains = ['www.utm.utoronto.ca']
    start_urls = ['http://www.utm.utoronto.ca/regcal/WEBLISTCOURSES1.html']

    def parse(self, response):
            items = []
            hxs = HtmlXPathSelector(response)
            sites = hxs.select("/*/p/text()[1] | \
                              (//span[@class='title2'])[1]/text() | \
                              (//span[@class='title2'])[1]/following-sibling::a[1]/text() | \
                              (//span[@class='title2'])[2]/text() | \
                              (//span[@class='title2'])[2]/following-sibling::a[1]/text()")

            for site in sites:
                    item = RegcalItem()
                    item['title'] = site.select("a/text()").extract()
                    item['link'] = site.select("a/@href").extract()
                    item['desc'] = site.select("text()").extract()
                    items.append(item)
            return items

            filename = response.url.split("/")[-2]
            open(filename, 'wb').write(response.body)

これにより、次の結果が得られます。

[{"title": [], "link": [], "desc": []},
 {"title": [], "link": [], "desc": []},
 {"title": [], "link": [], "desc": []}]

これは私が必要とする出力ではありません。私は何を間違っていますか？前述のように、このスクリプトをthisで実行していることに注意してください。

score 3 · Accepted Answer

.1. ANT101H5 生物人類学と考古学入門

p[@class='titlestyle']/text()

.2. 除外: ANT100Y5

concat(
    span/span[@class='title2'][1],
    span/span[@class='title2'][1]/following-sibling::a[1]
    )

.3. 前提条件: ANT102H5

concat(
    span/span[@class='title2'][2],
    span/span[@class='title2'][2]/following-sibling::a[1]
    )

score 2 · Accepted Answer

参照する 3 つのノードを選択することは難しくありません (Flack などの手法を使用します)。難しいのは、（a）不要なものを選択せずにそれらを選択することと、（b）入力がわずかに異なる場合でもそれらを選択できるように選択を堅牢にすることです。入力内容を正確に把握していないと仮定する必要があります。把握している場合は、XPath 式を記述して調べる必要はありません。

あなたが掴みたいものを3つ教えてくれました。しかし、これら3つを選択し、他のものを選択しない基準は何ですか? あなたが探しているものについて、どれだけ知られていますか?

問題を XPath の問題として表現しましたが、別の方法で取り組みます。XSLTを使用して、あなたが示した入力をより良い構造のものに変換することから始めます。特に、要素内にないすべての兄弟要素を要素にラップし、連続する要素の各グループを 段落として扱います。<xsl:for-each-group group-ending-with>これは、XSLT 2.0の構文を使用すればそれほど難しくなく実行できます。

score 1 · Accepted Answer

私の答えは @Flack のものと非常に似ています:

この XML ドキュメントを持つ (多数の閉じられていない s を閉じ、単一の最上位要素にすべてをラップすることで、提供されたものを修正しました):

<body>
    <p class="titlestyle">ANT101H5 Introduction to Biological Anthropology and Archaeology 
        <span class='distribution'>(SCI)</span>
    </p>
    <span class='normaltext'> Anthropology is the global and holistic study of human biology and behaviour, and includes four subfields: biological anthropology, archaeology, sociocultural anthropology and linguistics. The material covered is directed to answering the question: What makes us human? This course is a survey of biological anthropology and archaeology. [
        <span class='Helpcourse' onMouseover="showtip(this,event,'24 Lectures')" onMouseout="hidetip()">24L</span>, 
        <span class='Helpcourse' onMouseover="showtip(this,event,'12 Tutorials')" onMouseout="hidetip()">12T</span>]
        <br/>
        <span class='title2'>Exclusion: </span>
        <a href='javascript:OpenCourse("WEBCOURSENOTFOUND.html")'>ANT100Y5</a>
        <br/>
        <span class='title2'>Prerequisite: </span>
        <a href='javascript:OpenCourse("WEBCOURSEANT102H5.pl?fv=1")'>ANT102H5</a>
        <br/>
    </span>
    <br/>
    <br/>
    <br/>
</body>

この XPath 式:

normalize-space(/*/p/text()[1])

評価されると、必要な文字列が生成されます（周囲の引用符は結果に含まれていません。生成された正確な文字列を表示するためにそれらを追加しました）：

"ANT101H5 Introduction to Biological Anthropology and Archaeology"

この XPath 式:

concat((//span[@class='title2'])[1],
            (//span[@class='title2'])[1]
                   /following-sibling::a[1]
            )

評価されると、次の望ましい結果が生成されます。

"Exclusion: ANT100Y5"

この XPath 式:

concat((//span[@class='title2'])[2],
            (//span[@class='title2'])[2]
                   /following-sibling::a[1]
            )

評価されると、次の望ましい結果が生成されます。

"Prerequisite: ANT102H5"

注: この特定のケースでは、略語//は必要ありません。実際、この略語は可能な限り常に避ける必要があります。これは、式の評価が遅くなり、多くの場合、完全な (サブ) ツリートラバーサルが発生するためです。提供された XML フラグメントでは XML ドキュメントの完全な構造が得られないため、意図的に「//」を使用しています。また、これは使用した結果を正しくインデックス化する方法を示して//います (周囲の括弧に注意してください)。そうしようとする際の非常に頻繁な間違いを防ぐのに役立ちます

UPDATE : OP は、必要なすべてのテキストノードを選択する単一の XPath 式を要求しました。これは次のとおりです。

/*/p/text()[1]
   |
    (//span[@class='title2'])[1]/text()
   |
    (//span[@class='title2'])[1]/following-sibling::a[1]/text()
   |
    (//span[@class='title2'])[2]/text()
   |
    (//span[@class='title2'])[2]/following-sibling::a[1]/text()

上記と同じ XML ドキュメントに適用すると、テキストノードの連結がまさに必要になります。

ANT101H5 Introduction to Biological Anthropology and Archaeology          
        Exclusion: ANT100Y5Prerequisite: ANT102H5

この結果は、次の XSLT 変換を実行することで確認できます。

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
   "/*/p/text()[1]
   |
    (//span[@class='title2'])[1]/text()
   |
    (//span[@class='title2'])[1]/following-sibling::a[1]/text()
   |
    (//span[@class='title2'])[2]/text()
   |
    (//span[@class='title2'])[2]/following-sibling::a[1]/text()
   "/>
 </xsl:template>
</xsl:stylesheet>

この変換が同じ XML ドキュメント (この回答で以前に指定されたもの) に適用されると、必要な正しい結果が生成されます。

ANT101H5 Introduction to Biological Anthropology and Archaeology          
        Exclusion: ANT100Y5Prerequisite: ANT102H5

最後に: 次の単一の XPath 式は、提供されたリンクを使用して、HTML ページ内の必要なすべてのテキストノードを正確に選択します (整形式の XML になるように整理した後)。

  (//p[@class='titlestyle'])[2]/text()[1]
|
  (//span[@class='title2'])[2]/text()
|
  (//span[@class='title2'])[2]/following-sibling::a[1]/text()
|
  (//span[@class='title2'])[3]/text()
|
  (//span[@class='title2'])[3]/following-sibling::a[1]/text()

python - XPath: 現在のノード属性によって現在および次のノードのテキストを選択する

3 に答える 3

Related

Reference