python - HTMLから画像のURLを選択する

Question

非常に長いhtmlファイルから画像のURLを選択しようとしています。ファイルは次のようになります。

...Lots_of_html><a href=somelink.com>Human Readable Text</a><img src="http://image.com">....

上記のhtmlからhttp://image.comを選びたいのですが、運が悪かったので次のことを試しました。

sed -n ‘s%.*src=%%;s%\".*%%p’ image_urls.txt

sed -n ‘s%.*src=%%;s%\".*%%p’ image_urls.txt


import re
rex = re.compile(r'src=.(.*?)>',re.S|re.M)
data="<long html string>"
match = rex.match(data)

私は正規表現の経験があまりないので、上記でいくつかの基本的なエラーが発生していると思います。助けていただければ幸いですが、特にsedコマンドの1つを機能させて、bashスクリプトに簡単に統合できるようにしたいと思います。

前もって感謝します。

score 2 · Accepted Answer

urllib2+ xpathクエリを使用してモジュールlxmlを使用することをお勧めします。例：

#!/usr/bin/env python
# -*- coding: utf8 -*-
# vim:ts=4:sw=4

import cookielib, urllib2
from lxml import etree

cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
page = opener.open("http://stackoverflow.com/q/14129900/465183")
page.addheaders = [('User-agent', 'Mozilla/5.0')]
reddit = etree.HTML(page.read())

for img in reddit.xpath('//img/@src'):
    print img

score 2 · Accepted Answer

これをPythonとしてタグ付けしたので、BeautifulSoupを使用します。

Beautiful Soupは、あなたが与えたものをすべて解析し、ツリートラバーサルを実行します。「すべてのリンクを検索」、「クラスexternalLinkのすべてのリンクを検索」、「URLが「foo.com」と一致するすべてのリンクを検索」、または「太字のテキストが含まれるテーブル見出しを検索」と言うことができます。私にそのテキスト。」

>>> from bs4 import BeautifulSoup
>>> html = """<a href=somelink.com>Human Readable Text</a><img src="http://image.com">"""
>>> soup = BeautifulSoup(html)
>>> img_tags = soup.find_all("img")
>>> for img in img_tags:
>>> ...     print img.get("src")
http://image.com

または、さらに簡単に行うこともできます。

>>> soup.find_all("img", src="http://image.com")
[<img src="http://image.com"/>]

score 0 · Accepted Answer

perl

あなたはすでに2つのPythonソリューションを持っているので、perl WWW ::Mechanizeでそれを行うことができる1つの方法があります：

perl -MWWW::Mechanize -e '
  $m = WWW::Mechanize->new;
  $m->get($ARGV[0]);
  $m->dump_images(undef, 1)' file://`pwd`/image_urls.txt

sed

入力についていくつかの仮定を立てることができれば、単純なsed正規表現で逃げることができます。

提供したテストデータでsedを使用する方法は次のとおりです。

sed -n 's%.*src="\([^"]*\)".*%\1%p'

これにより、引用符の間の内容がキャプチャされ、他の\1すべてが削除されます。

何が一致するかに注意しながら、自分のやり方でそれを行うこともできます。2番目の置換コマンドは削除しすぎます。これを回避する1つの方法は次のとおりです。

sed -n 's%.*src="%%; s%".*%%p'

score -1 · Accepted Answer

この機能を使用できます。

#
#
# get_url_images_in_text()
#
# @param html - the html to extract urls of images from him.
# @param protocol - the protocol of the website, for append to urls that not start with protocol.
#
# @return list of images url.
#
#
def get_url_images_in_text(html, protocol):
    urls = []
    # Do regex for get all images urls, here i get only urls of png and jpg but you can add any prefix that you want.
    all_urls = re.findall(r'((http\:|https\:)?\/\/[^"\' ]*?\.(png|jpg))', html, flags=re.IGNORECASE | re.MULTILINE | re.UNICODE)
    for url in all_urls:
        if not url[0].startswith("http"):
            urls.append(protocol + url[0])
        else:
            urls.append(url[0])

    return urls

#
#
# get_images_from_url()
#
# @param url - the url for extract images url from him. 
#
# @return list of images url.
#
#
def get_images_from_url(url):
    protocol = url.split('/')[0]
    resp = requests.get(url)
    return get_url_images_in_text(resp.text, protocol)

python - HTMLから画像のURLを選択する

4 に答える 4

Related

Reference