python - BeautifulSoupで抽出した後、正規表現で属性値を実行する方法は?

Question

一部、特にwidgetidを解析したいURLがあります。

<a href="http://www.somesite.com/process.asp?widgetid=4530">Widgets Rock!</a>

私はこのPythonを書きました（私はPythonの初心者です-バージョンは2.7です）：

import re
from bs4 import BeautifulSoup

doc = open('c:\Python27\some_xml_file.txt')
soup = BeautifulSoup(doc)


links = soup.findAll('a')

# debugging statements

print type(links[7])
# output: <class 'bs4.element.Tag'>

print links[7]
# output: <a href="http://www.somesite.com/process.asp?widgetid=4530">Widgets Rock!</a>

theURL = links[7].attrs['href']
print theURL
# output: http://www.somesite.com/process.asp?widgetid=4530

print type(theURL)
# output: <type 'unicode'>

is_widget_url = re.compile('[0-9]')
print is_widget_url.match(theURL)
# output: None (I know this isn't the correct regex but I'd think it
#         would match if there's any number in there!)

正規表現 (または正規表現の使用方法の理解) に何かが欠けていると思いますが、それを理解できません。

ご協力いただきありがとうございます！

score 5 · Accepted Answer

この質問は BeautifulSoup とは関係ありません。

問題は、ドキュメントで説明されているように、文字列の先頭でmatchのみ一致することです。検索する数字は文字列の末尾にあるため、何も返されません。

任意の数字に一致させるには、search- を使用します。おそらく\d数字にはエンティティを使用する必要があります。

matches = re.search(r'\d+', theURL)

score 4 · Accepted Answer

私はあなたが再を望んでいないと思います - あなたが望む可能性があります:

from urlparse import urlparse, parse_qs
s = 'http://www.somesite.com/process.asp?widgetid=4530'
qs = parse_qs(urlparse(s).query)
if 'widgetid' in qs:
   # it's got a widget, a widget it has got...

score 2 · Accepted Answer

urlparseを使用します:

from urlparse import urlparse, parse_qs
o = urlparse("http://www.somesite.com/process.asp?widgetid=4530")
if "widgetId" in parse_qs(o.query):
    # this is a 'widget URL'

python - BeautifulSoupで抽出した後、正規表現で属性値を実行する方法は?

3 に答える 3

Related

Reference