python - lxml: clean_html は html タグを div に置き換えますか?

Question

lxml 3.1.0 (easy_install でインストール) を使用していますが、奇妙な結果が表示されます:

> from lxml.html.clean import clean_html
> clean_html("<html><body><h1>hi</h1></body></html>")
'<div><body><h1>hi</h1></body></div>'

タグはhtmlに置き換えられていdivます。

http://lxml.de/lxmlhtml.html#cleaning-up-htmlに従って、サンプル html でも同じことが起こります。

何を与える？lxml のバグ、または libxml2 とのバージョンの非互換性に遭遇したのでしょうか、それともこれはどういうわけか予想されることですか?

score 4 · Accepted Answer

Cleanerそのままにしておく必要があると思いますpage_structure：

>>> from lxml.html.clean import Cleaner                                                           
>>> cleaner = Cleaner(page_structure=False)                                          
>>> cleaner.clean_html("<html><body><h1>hi</h1></body></html>")
'<html><body><h1>hi</h1></body></html>'

hereで説明されているように、デフォルトでpage_structureはです。Trueあなたが提供したサイトのドキュメントが間違っているか、古くなっていると思われます。

編集＃1：これが予想される動作であることの別の確認は、ソースコードのこのテストで見つけることができます. ドキュメントを修正するために、プルリクエストが送信されました。

編集＃2：プルリクエストは2013-04-28の時点でマスターにマージされました.

score 3 · Accepted Answer

<head>、<html>などのページの構造部分は、デフォルトのの<title>場合は削除されpage_structure=Trueます。これを変更するには:

import lxml.html.clean as clean
content = '<html><body><h1>hi</h1></body></html>'
cleaner = clean.Cleaner(page_structure=False)
cleaned = cleaner.clean_html(content)
print(cleaned)
# <html><body><h1>hi</h1></body></html>

class のドキュメント文字列を参照してくださいclean.Cleaner:

In [105]: clean.Cleaner?
Type:       type
String Form:<class 'lxml.html.clean.Cleaner'>
File:       /usr/lib/python2.7/dist-packages/lxml/html/clean.py
Definition: clean.Cleaner(self, doc)
Docstring:
Instances cleans the document of each of the possible offending
elements.  The cleaning is controlled by attributes; you can
override attributes in a subclass, or set them in the constructor.

``scripts``:
    Removes any ``<script>`` tags.

``javascript``:
    Removes any Javascript, like an ``onclick`` attribute.

``comments``:
    Removes any comments.

``style``:
    Removes any style tags or attributes.

``links``:
    Removes any ``<link>`` tags

``meta``:
    Removes any ``<meta>`` tags

``page_structure``:
    Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.

``processing_instructions``:
    Removes any processing instructions.

``embedded``:
    Removes any embedded objects (flash, iframes)

``frames``:
    Removes any frame-related tags

``forms``:
    Removes any form tags

``annoying_tags``:
    Tags that aren't *wrong*, but are annoying.  ``<blink>`` and ``<marquee>``

``remove_tags``:
    A list of tags to remove.

``allow_tags``:
    A list of tags to include (default include all).

``remove_unknown_tags``:
    Remove any tags that aren't standard parts of HTML.

``safe_attrs_only``:
    If true, only include 'safe' attributes (specifically the list
    from `feedparser
    <http://feedparser.org/docs/html-sanitization.html>`_).

``add_nofollow``:
    If true, then any <a> tags will have ``rel="nofollow"`` added to them.

``host_whitelist``:
    A list or set of hosts that you can use for embedded content
    (for content like ``<object>``, ``<link rel="stylesheet">``, etc).
    You can also implement/override the method
    ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to
    implement more complex rules for what can be embedded.
    Anything that passes this test will be shown, regardless of
    the value of (for instance) ``embedded``.

    Note that this parameter might not work as intended if you do not
    make the links absolute before doing the cleaning.

``whitelist_tags``:
    A set of tags that can be included with ``host_whitelist``.
    The default is ``iframe`` and ``embed``; you may wish to
    include other tags like ``script``, or you may want to
    implement ``allow_embedded_url`` for more control.  Set to None to
    include all tags.

This modifies the document *in place*.
Constructor information:
 Definition:clean.Cleaner(self, **kw)

python - lxml: clean_html は html タグを div に置き換えますか?

2 に答える 2

Related

Reference