python - Python：簡単な部分文字列/構文解析

Question

私はそのような文字列を持っています

 <img src="http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg" /><br/>
 Begado is the newest online casino in our listings. As the newest
 member of the Affactive group, Begado features NuWorks slots and games
 for both US and international players.
<img src="http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo" height="1" width="1"/>

src最初のimgタグから取得する必要があります

とにかく簡単にできますか？

score 4 · Accepted Answer

PythonでのHTML画面スクレイピングには、BeautifulSoupライブラリをお勧めします。

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
images = list(soup.findAll('img'))
print images[0]['src']

score 2 · Accepted Answer

必須の「正規表現でHTMLを解析しないでください」警告：https ：//stackoverflow.com/a/1732454/505154

邪悪な正規表現ソリューション：

import re
re.findall(r'<img\s*src="([^"]*)"\s*/>', text)

これにより、属性のみを含むsrcすべての<img>タグの属性を含むリストが返されます（最初のタグとのみ一致させたいと言ったため）。src

score 0 · Accepted Answer

これを行う1つの方法は、正規表現を使用することです。

もう1つの方法は、文字列を引用符で分割してから、返された2番目の要素を取得することです。

splits = your_string.split('"')
print splits[1]

score 0 · Accepted Answer

これは、ライブラリなしでそれを行うための迅速で醜い方法です。

"""
    >>> get_src(data)
    ['http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg', 'http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo']
"""

data = """<img src="http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg" /><br/>
 Begado is the newest online casino in our listings. As the newest
 member of the Affactive group, Begado features NuWorks slots and games
 for both US and international players.
<img src="http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo" height="1" width="1"/>"""

def get_src(lines):
    srcs = []
    for line in data.splitlines():
        i = line.find('src=') + 5
        f = line.find('"', i)
        if i > 0 and f > 0:
            srcs.append(line[i:f])
    return srcs

ただし、Beatiful Soupを使用することをお勧めします。これは、実際のWeb（壊れたHTMLなど）を処理するように設計された非常に優れたライブラリです。データが有効なXMLの場合は、Python標準ライブラリのElementTreeを使用できます。

python - Python：簡単な部分文字列/構文解析

4 に答える 4

Related

Reference