web-scraping - html5lib: TypeError: init() が予期しないキーワード引数 'encoding' を取得しました

Question

をインストールしようとしていhtml5libます。最初は最新バージョン (8 または 9 ナイン) をインストールしようとしましたが、私の BeautifulSoup と競合したため、古いバージョン (0.9999999、セブンナイン) を試すことにしました。インストールしましたが、使用しようとすると：

>>> with urlopen("http://example.com/") as f:
    document = html5lib.parse(f, encoding=f.info().get_content_charset())

エラーが発生します：

Traceback (most recent call last):
  File "<pyshell#11>", line 2, in <module>
    document = html5lib.parse(f, encoding=f.info().get_content_charset())
  File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 35, in parse
    return p.parse(doc, **kwargs)
  File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 235, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "C:\Python\Python35-32\lib\site-packages\html5lib\html5parser.py", line 85, in _parse
    self.tokenizer = _tokenizer.HTMLTokenizer(stream, parser=self, **kwargs)
  File "C:\Python\Python35-32\lib\site-packages\html5lib\_tokenizer.py", line 36, in __init__
    self.stream = HTMLInputStream(stream, **kwargs)
  File "C:\Python\Python35-32\lib\site-packages\html5lib\_inputstream.py", line 151, in HTMLInputStream
    return HTMLBinaryInputStream(source, **kwargs)
TypeError: __init__() got an unexpected keyword argument 'encoding'

何が問題で、どうすればよいですか?

score 9 · Accepted Answer

bs4 に関して html5lib の最新バージョンで何かが壊れているのがわかります。それはうまくいきます：

 pip3 install -U html5lib=="0.9999999"

bs4 4.4.1 を使用してテスト済み:

In [1]: import bs4

In [2]: bs4.__version__
Out[2]: '4.4.1'

In [3]: import html5lib

In [4]: html5lib.__version__
Out[4]: '0.9999999'

In [5]: from urllib.request import  urlopen

In [6]: with urlopen("http://example.com/") as f:
   ...:         document = html5lib.parse(f, encoding=f.info().get_content_charset())
   ...:     

In [7]:

このコミットで変更を確認できます名前が変更されたパブリックステータスを反映するために、treebuilders._baseの名前を .base に変更します。

表示されるエラーは、 html5lib/_inputstream.pyで最新バージョンをまだ使用しているためです。HTMLBinaryInputStreamにはエンコード引数がありません。

class HTMLBinaryInputStream(HTMLUnicodeInputStream):
    """Provides a unicode stream of characters to the HTMLTokenizer.

    This class takes care of character encoding and removing or replacing
    incorrect byte-sequences and also provides column and line tracking.

    """

    def __init__(self, source, override_encoding=None, transport_encoding=None,
                 same_origin_parent_encoding=None, likely_encoding=None,
                 default_encoding="windows-1252", useChardet=True):

override_encoding=f.info().get_content_charset()を設定するとうまくいくはずです。

また、bs4 の最新バージョンへのアップグレードは、html5lib の最新バージョンで正常に動作します。

In [16]: bs4.__version__
Out[16]: '4.5.1'

In [17]: html5lib.__version__
Out[17]: '0.999999999'

In [18]: with urlopen("http://example.com/") as f:
             document = html5lib.parse(f, override_encoding=f.info().get_content_charset())
   ....:     

In [19]:

web-scraping - html5lib: TypeError: __init__() が予期しないキーワード引数 'encoding' を取得しました

1 に答える 1

Related

Reference

web-scraping - html5lib: TypeError: init() が予期しないキーワード引数 'encoding' を取得しました