python - 名前空間付きの HTML を解析するために lxml を使用していますか?

Question

これは私を完全に狂わせています。私は何時間も苦労してきました。どんな助けでも大歓迎です。

私はPyQuery 1.2.9 (上に構築されている) を使用して、この URLlxmlをスクレイピングしています。セクション内のすべてのリンクのリストを取得したいだけです。.linkoutlist

これは私の要求の全文です:

response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')
doc = pq(response.content)
links = doc('#maincontent .linkoutlist a')
print links

しかし、それは空の配列を返します。代わりにこのクエリを使用すると:

links = doc('#maincontent .linkoutlist')

次に、この HTML を取得します。

<div xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" class="linkoutlist">
   <h4>Full Text Sources</h4>
   <ul>
      <li><a title="Full text at publisher's site" href="http://meta.wkhealth.com/pt/pt-core/template-journal/lwwgateway/media/landingpage.htm?issn=0268-1315&amp;volume=19&amp;issue=3&amp;spage=125" ref="itool=Abstract&amp;PrId=3159&amp;uid=15107654&amp;db=pubmed&amp;log$=linkoutlink&amp;nlmid=8609061" target="_blank">Lippincott Williams &amp; Wilkins</a></li>
      <li><a href="http://ovidsp.ovid.com/ovidweb.cgi?T=JS&amp;PAGE=linkout&amp;SEARCH=15107654.ui" ref="itool=Abstract&amp;PrId=3682&amp;uid=15107654&amp;db=pubmed&amp;log$=linkoutlink&amp;nlmid=8609061" target="_blank">Ovid Technologies, Inc.</a></li>
   </ul>
   <h4>Other Literature Sources</h4>
   ...
</div>

したがって、親セレクターは多くの<a>タグを含む HTML を返します。これも有効な HTML のようです。

xmlnsさらに実験を重ねると、何らかの理由で lxml が最初の div の属性を好まないことがわかります。

lxmlでこれを無視して、通常のHTMLのように解析するにはどうすればよいですか?

更新: 試行ns_clean中、まだ失敗:

    parser = etree.XMLParser(ns_clean=True)
    tree = etree.parse(StringIO(response.content), parser)
    sel = CSSSelector('#maincontent .rprt_all a')
    print sel(tree)

score 6 · Accepted Answer

空の名前空間を含め、名前空間を処理する必要があります。

作業ソリューション:

from pyquery import PyQuery as pq
import requests


response = requests.get('http://www.ncbi.nlm.nih.gov/pubmed/?term=The%20cost-effectiveness%20of%20mirtazapine%20versus%20paroxetine%20in%20treating%20people%20with%20depression%20in%20primary%20care')

namespaces = {'xi': 'http://www.w3.org/2001/XInclude', 'test': 'http://www.w3.org/1999/xhtml'}
links = pq('#maincontent .linkoutlist test|a', response.content, namespaces=namespaces)
for link in links:
    print link.attrib.get("title", "No title")

セレクターに一致するすべてのリンクのタイトルを出力します。

Full text at publisher's site
No title
Free resource
Free resource
Free resource
Free resource

または、に設定しparserて"html"、名前空間を忘れてください:

links = pq('#maincontent .linkoutlist a', response.content, parser="html")
for link in links:
    print link.attrib.get("title", "No title")

python - 名前空間付きの HTML を解析するために lxml を使用していますか?

3 に答える 3

Related

Reference