python - upper case html tags encoded in lxml

Question

I am parsing an html file using lxml.html....The html file contains tags with small case letters and also large case letters. A part of my code is shown below:

        response = urllib2.urlopen(link)
        html = response.read().decode('cp1251')
        content_html = etree.HTML(html_1)
        first_link_xpath =  content_html.xpath('//TR')
        print (first_link_xpath)

A small part of my HTML file is shown below:

<TR>
    <TR vAlign="top" align="left">
        <!--<TD><B  onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->
        <TD></TD>
    </TR>
 </TR>

So when i run my above code for the below html sample, it gives an empty list. Then i tried to run this line first_link_xpath = content_html_1.xpath('//tr/node()') , all the upper case tags were represented as \r\n\t\t\t\t' in the output: What is the reason behind this issue??

NOte: If the question is not convincing please let me know for modification

score 1 · Accepted Answer

unutbu の回答をフォローアップするには、lxmlXML パーサーと HTML パーサーを比較することをお勧めします。特に、lxml.etree.tostring(). さまざまなタグ、タグの大文字と小文字、および階層を確認できます (人間が考えるものとは異なる場合があります ;)

$ python
>>> import lxml.etree
>>> doc = """<TR>
...     <TR vAlign="top" align="left">
...         <!--<TD><B  onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->
...         <TD></TD>
...     </TR>
...  </TR>"""
>>> xmldoc = lxml.etree.fromstring(doc)
>>> xmldoc
<Element TR at 0x1e79b90>
>>> htmldoc = lxml.etree.HTML(doc)
>>> htmldoc
<Element html at 0x1f0baa0>
>>> lxml.etree.tostring(xmldoc)
'<TR>\n    <TR vAlign="top" align="left">\n        <!--<TD><B  onmouseover="tips.Display(\'Metadata_WEB\', event)" onmouseout="tips.Hide(\'Metadata_WEB\')">Meta Data:</B></TD>-->\n        <TD/>\n    </TR>\n </TR>'
>>> lxml.etree.tostring(htmldoc)
'<html><body><tr/><tr valign="top" align="left"><!--<TD><B  onmouseover="tips.Display(\'Metadata_WEB\', event)" onmouseout="tips.Hide(\'Metadata_WEB\')">Meta Data:</B></TD>--><td/>\n    </tr></body></html>'
>>>

HTML パーサーを使用すると、囲みタグhtmlとタグが作成され、最初にbody空のtrノードがあることがわかります。これは、HTML では aが atrに直接続くことができないためtrです (指定した HTML フラグメントは、タイプミスまたは元のドキュメントも壊れています）

次に、unutbu で提案されているように、さまざまな XPath 式を試すことができます。

>>> xmldoc.xpath('//tr')
[]
>>> xmldoc.xpath('//TR')
[<Element TR at 0x1e79b90>, <Element TR at 0x1f0baf0>]
>>> xmldoc.xpath('//TR/node()')
['\n    ', <Element TR at 0x1f0baf0>, '\n        ', <!--<TD><B  onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, '\n        ', <Element TD at 0x1f0bb40>, '\n    ', '\n ']
>>> 
>>> htmldoc.xpath('//tr')
[<Element tr at 0x1f0bbe0>, <Element tr at 0x1f0bc30>]
>>> htmldoc.xpath('//TR')
[]
>>> htmldoc.xpath('//tr/node()')
[<!--<TD><B  onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, <Element td at 0x1f0bbe0>, '\n    ']
>>>

確かに、unutbu が強調したように、HTML の場合、XPath 式は小文字のタグを使用して要素を選択する必要があります。

私にとって、「\r\n\t\t\t\t」の出力はエラーではなく、さまざまなタグtrとtdタグの間の単なる空白です。テキストコンテンツの場合、この空白が必要ない場合は、たとえばlxml.etree.tostring(element, memthod="text", encoding=unicode).strip()、where elementcomes from XPath を使用できます。(これは先頭と末尾の空白に対して機能します)。(引数は重要であることに注意してくださいmethod。デフォルトでは、上記でテストした HTML 表現が出力されます)

>>> map(lambda element: lxml.etree.tostring(element, method="text", encoding=unicode), htmldoc.xpath('//tr'))
[u'', u'\n    ']
>>>

また、テキスト表現がすべて空白であることを確認できます。

score 0 · Accepted Answer

HTML パーサーは、すべてのタグ名を小文字に変換します。xpath('//TR')これが空のリストを返す理由です。

大文字のタグがとして出力される 2 番目の問題を再現できません\r\n\t\t\t\t'。問題を示すために、以下のコードを変更できますか?

import lxml.etree as ET

content = '''\
<TR>
    <TR vAlign="top" align="left">
        <!--<TD><B  onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->
        <TD></TD>
    </TR>
 </TR>'''

root = ET.HTML(content)
print(root.xpath('//TR'))
# []
print(root.xpath('//tr/node()'))
# [<!--<TD><B  onmouseover="tips.Display('Metadata_WEB', event)" onmouseout="tips.Hide('Metadata_WEB')">Meta Data:</B></TD>-->, <Element td at 0xb77463ec>, '\n    ']
print(root.xpath('//tr'))
# [<Element tr at 0xb77462fc>, <Element tr at 0xb77463ec>]

python - upper case html tags encoded in lxml

2 に答える 2

Related

Reference