python - ソースhtmlのすべてのリンクサイトを取得する (python)

Question

1 つの Web ページですべてのリンクを取得したいのですが、この機能は 1 つのリンクのみですが、すべてのリンクを取得する必要があります。もちろん、The One Ring true が必要なことは知っていますが、使い方はわかりません

私はすべてのリンクを取得する必要があります

def get_next_target(page):
start_link = page.find('<a href=')
start_quote = page.find('"', start_link)
end_quote = page.find('"', start_quote + 1)
url = page[start_quote + 1:end_quote]
return url, end_quote

score 2 · Accepted Answer

ここで、HTML パーサーが役に立ちます。私はお勧めしBeautifulSoupます：

from bs4 import BeautifulSoup as BS
def get_next_target(page)
    soup = BS(page)
    return soup.find_all('a', href=True)

score 1 · Accepted Answer

そのためにlxmlを使用できます。

 import lxml.html

 def get_all_links(page):
     document = lxml.html.parse(page)
     return document.xpath("//a")

score 0 · Accepted Answer

site = urllib.urlopen('http://somehwere/over/the/rainbow.html')
site_data = site.read()
for link in BeautifulSoup(site_data, parseOnlyThese=SoupStrainer('a')):
    if link.has_attr('href'):
        print(link['href'])

python - ソースhtmlのすべてのリンクサイトを取得する (python)

3 に答える 3

Related

Reference