python - Pythonで正規表現を使用して値を取得するにはどうすればよいですか？

Question

私はこのようなコードを書きました：

print re.findall(r'(<td width="[0-9]+[%]?" align="(.+)">|<td align="(.+)"> width="[0-9]+[%]?")([ \n\t\r]*)([0-9,]+\.[0-9]+)([ \n\t\r]*)([&]?[a-zA-Z]+[;]?)([ \n\t\r]*)<span class="(.+)">',r.text,re.MULTILINE)

この行を取得するには：

<td width="47%" align="left">556.348&nbsp;<span class="uccResCde">

値556.348が必要です。正規表現を使用して取得するにはどうすればよいですか？

score 3 · Accepted Answer

HTMLParser のドキュメントから直接カットアンドペーストすると、正規表現を使用せずにタグからデータを取得できます。

from HTMLParser import HTMLParser

# Create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# Instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<td width="47%" align="left">556.348&nbsp;<span class="uccResCde">')

score 0 · Accepted Answer

これは、一致したグループを取得する方法を説明するソリューションです。ドキュメントを読む必要があります。

import re

text_to_parse= '<td width="47%" align="left">556.348&nbsp;<span class="uccResCde">'
pattern = r'(<td width="[0-9]+[%]?" align="(.+)">|<td align="(.+)"> width="[0-9]+[%]?")([ \n\t\r]*)([0-9,]+\.[0-9]+)([ \n\t\r]*)([&]?[a-zA-Z]+[;]?)([ \n\t\r]*)<span class="(.+)">'
m = re.search(pattern, text_to_parse)
m.group(5)

しかし、HTML を解析するために正規表現を使用する必要はありません。代わりに、 Beautiful Soupなどの HTML パーサーを使用します。

from bs4 import BeautifulSoup

soup = BeautifulSoup(text_to_parse)
soup.text

python - Pythonで正規表現を使用して値を取得するにはどうすればよいですか？

2 に答える 2

Related

Reference