python - easyhtmlparserを使用してpythonでhtmlファイルからすべてのリンクを取得するにはどうすればよいですか?

Question

HTMLパーサーhttp://easyhtmlparser.sourceforge.net/を使用して、ページからすべてのリンクと画像を取得しようとしています

fd = open('file.html', 'r')
data = fd.read()
fd.close()
html = Html()
dom = html.feed(data)
for ind in dom.sail():
    if ind.name == 'a':
        print ind.attr['ref']

score 1 · Accepted Answer

特に easyhtmlparser のドキュメントを読みたいわけではありませんが、Beautiful Soupを使用したい場合は、次のようにします。

from bs4 import BeautifulSoup
fd = open('file.html', 'r')
data = fd.read()
fd.close()
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
    print(link.get('href')) #or do whatever with it

動作するはずですが、テストしていません。幸運を！

編集：今、私は持っています。できます。

編集 2: 画像を見つけるには、すべての画像タグなどを検索し、src リンクを見つけます。Beautiful Soup または easyhtmlparser のドキュメントでその方法を見つけることができると思います。

ダウンロードしてフォルダに入れるには、

import urllib
urllib.urlretrieve(IMAGE_URL, path_to_folder/imagename)

または、urllib から読み取ることもできます。最終的にはすべてが単なる文字列であり、読み取りは取得よりも簡単だからです。

score 0 · Accepted Answer

私はこのようにします。

from ehp import *

with open('file.html', 'r') as fd:
    data = fd.read()

html = Html()
dom = html.feed(data)

for ind in dom.sail():
    if ind.name == 'a':
        print ind.attr['href']
    elif ind.name == 'img':
        print ind.attr['src']

python - easyhtmlparserを使用してpythonでhtmlファイルからすべてのリンクを取得するにはどうすればよいですか?

2 に答える 2

Related

Reference