python - urllib を使用してソースコードを検索する

Question

Web サイトのソースコード内のテキストを検索するスクリプトを作成しようとしています。私はそれを持っているので、ソースコードを正常に取得して出力し、次のようになります: b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE html...など

ただし、を使用してコード内の 'div' タグを検索しようとすると、バイトリテラルを受け取っているという事実に関係していると思わprint(page.find('div'))れるというエラーが表示されます。TypeError: Type str doesn't support the buffer API文字列を検索できるようにするには、これを UTF-8 または ASCII としてエンコードするにはどうすればよいですか?

必要に応じて、実行している簡単なコードを次に示します。

import urllib.request
from urllib.error import URLError

def get_page(url):
  #make the request
  req = urllib.request.Request(url)
  the_page = urllib.request.urlopen(req)

  #get the results of the request
  try:
    #read the page
    page = the_page.read()
    print(page)
    print(page.find('div'))

  #except error
  except URLError as e:
    #if error has a reason (thus is url error) print the reason
    if hasattr(e, 'reason'):
      print(e.reason)
    #if error has a code (thus is html error) print the code and the error
    if hasattr(e, 'code'):
      print(e.code)
      print(e.read())

score 0 · Accepted Answer

Python v.3を使用していると思います（ステートメントではなく関数としての印刷から述べられているように）。

Python 3 では、ページはバイトオブジェクトです。そのため、バイトオブジェクトも使用して検索する必要があります。これを試してください：

print(page.find(b'div'))

これが役立つことを願っています

python - urllib を使用してソース コードを検索する

1 に答える 1

Related

Reference

python - urllib を使用してソースコードを検索する