python - lxml で「--」のコメントチェックを無効にする

Question

使用事例：

lxml でhttps://www.banca-romaneasca.ro/en/tools-and-resources/の解析に失敗します。

...
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/html5parser.py:468: in processComment
    self.tree.insertComment(token, self.tree.openElements[-1])
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree_lxml.py:312: in insertCommentMain
    super(TreeBuilder, self).insertComment(data, parent)
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/_base.py:262: in insertComment
    parent.appendChild(self.commentClass(token["data"]))
/opt/python-env/ciur/local/lib/python2.7/site-packages/html5lib/treebuilders/etree.py:148: in __init__
    self._element = ElementTree.Comment(data)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

- src/lxml/lxml.etree.pyx:3017: ValueError: Comment may not contain '--' or end with '-'

それは lxml から来ました > https://github.com/lxml/lxml/blob/master/src/lxml/lxml.etree.pyx#L3017

https://www.banca-romaneasca.ro/en/tools-and-resources/で悪いコメントを見つける

...
<script type="text/javascript" src="/_res/js/forms.js"></script>

<!-- Google Code for Remarketing Tag -->
<!--------------------------------------------------
Remarketing tags may not be associated with personally identifiable information or placed on pages related to sensitive categories. See more information and instructions on how to setup the tag on: http://google.com/ads/remarketingsetup
--------------------------------------------------->
<script type="text/javascript">
/* <![CDATA[ */
var google_conversion_id = 958631629;
var google_custom_params = window.google_tag_params;
...

次のような解決策を求めます。

チェックを無効にする (いくつかのマジック、フラグ、xml で)

if b'--' in text or text.endswith(b'-'):
    raise ValueError("Comment may not contain '--' or end with '-'")

モンキーパッチ (コードの変更、インジェクション ...)

更新 1:

私は html5lib を使用しており、サウンド、セクション、ビデオなどのタグを html5 で利用できるようにしたいと考えています。

from lxml.html import html5parser, fromstring

context = fromstring(document.content) # work    
context = html5parser.fromstring(document.content) # do not work

context = html5lib.parse(  # do not work
    document.content,
    treebuilder="lxml",
    namespaceHTMLElements=document.namespace,
    encoding=document.encoding
)

バージョン:

html5lib==0.9999999
lxml==3.5.0 (lxml のダウングレードも解決策ではありません)

更新 2::

これは lxml https://github.com/lxml/lxml/pull/172#issuecomment-169084439の改善/問題のようです。

lxml 開発者のフィードバックを待っています。

更新 3::

フィードバックがありました。html5lib の障害のようです。github の最後の開発バージョンには既に修正が含まれていました。

score 2 · Accepted Answer

解決策は、github の @opottone に基づいて見つかりました。

githubhtml5parserから最新版をインストールしてみました。現在、エラーではなく警告のみが表示されます。

score 1 · Accepted Answer

これは解析しようとしている HTML データであるためlxml.html、 and notを使用しlxml.etreeます。

私のために働いた：

>>> import requests
>>> import lxml.html
>>> 
>>> data = requests.get("https://www.banca-romaneasca.ro/en/tools-and-resources/").content
>>> tree = lxml.html.fromstring(data)
>>> tree.xpath("//title/text()")
['Tools and resources - Banca Romaneasca']

python - lxml で「--」のコメント チェックを無効にする

使用事例：

次のような解決策を求めます。

更新 1:

更新 2::

更新 3::

2 に答える 2

Related

Reference

python - lxml で「--」のコメントチェックを無効にする