python - HTML 解析用の Python 正規表現 (BeautifulSoup)

Question

HTML の非表示の入力フィールドの値を取得したいと考えています。

<input type="hidden" name="fooId" value="12-3456789-1111111111" />

fooIdHTML の行が次の形式に従っていることがわかっている場合、の値を返す正規表現を Python で記述したいと考えています。

<input type="hidden" name="fooId" value="**[id is here]**" />

値の HTML を解析するために、誰かが Python で例を提供できますか?

score 27 · Accepted Answer

この特定のケースでは、BeautifulSoup は正規表現よりも書くのが難しいですが、はるかに堅牢です...使用する正規表現が既にわかっていることを考えると、私は BeautifulSoup の例に貢献しているだけです :-)

from BeautifulSoup import BeautifulSoup

#Or retrieve it from the web, etc. 
html_data = open('/yourwebsite/page.html','r').read()

#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId.attrs[2][1] #The value of the third attribute of the desired tag 
                          #or index it directly via fooId['value']

score 18 · Accepted Answer

Vinko BeautifulSoupが進むべき道であることに同意します。ただし、値が 3 番目の属性であることに依存するのではなく、属性を取得するためにfooId['value']使用することをお勧めします。

from BeautifulSoup import BeautifulSoup
#Or retrieve it from the web, etc.
html_data = open('/yourwebsite/page.html','r').read()
#Create the soup object from the HTML data
soup = BeautifulSoup(html_data)
fooId = soup.find('input',name='fooId',type='hidden') #Find the proper tag
value = fooId['value'] #The value attribute

score 8 · Accepted Answer

import re
reg = re.compile('<input type="hidden" name="([^"]*)" value="<id>" />')
value = reg.search(inputHTML).group(1)
print 'Value is', value

score 5 · Accepted Answer

解析は、エッジケースとバグを何年にもわたって追跡することになるため、回避できる場合は本当に自分で作成したくない領域の 1 つです。

BeautifulSoupの使用をお勧めします。非常に評判が良く、ドキュメントからは非常に使いやすいようです。

score 1 · Accepted Answer

Pyparsing は、BeautifulSoup と正規表現の間の適切な中間ステップです。HTML タグの解析では、大文字と小文字、空白、属性の存在/非存在/順序のバリエーションが考慮されるため、単なる正規表現よりも堅牢ですが、BS を使用するよりもこの種の基本的なタグ抽出を行う方が簡単です。

探しているものはすべて開始の「input」タグの属性にあるため、例は特に単純です。これは、正規表現に適合する入力タグのいくつかのバリエーションを示すpyparsingの例であり、コメント内にある場合にタグを一致させない方法も示しています。

html = """<html><body>
<input type="hidden" name="fooId" value="**[id is here]**" />
<blah>
<input name="fooId" type="hidden" value="**[id is here too]**" />
<input NAME="fooId" type="hidden" value="**[id is HERE too]**" />
<INPUT NAME="fooId" type="hidden" value="**[and id is even here TOO]**" />
<!--
<input type="hidden" name="fooId" value="**[don't report this id]**" />
-->
<foo>
</body></html>"""

from pyparsing import makeHTMLTags, withAttribute, htmlComment

# use makeHTMLTags to create tag expression - makeHTMLTags returns expressions for
# opening and closing tags, we're only interested in the opening tag
inputTag = makeHTMLTags("input")[0]

# only want input tags with special attributes
inputTag.setParseAction(withAttribute(type="hidden", name="fooId"))

# don't report tags that are commented out
inputTag.ignore(htmlComment)

# use searchString to skip through the input 
foundTags = inputTag.searchString(html)

# dump out first result to show all returned tags and attributes
print foundTags[0].dump()
print

# print out the value attribute for all matched tags
for inpTag in foundTags:
    print inpTag.value

版画:

['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
- empty: True
- name: fooId
- startInput: ['input', ['type', 'hidden'], ['name', 'fooId'], ['value', '**[id is here]**'], True]
  - empty: True
  - name: fooId
  - type: hidden
  - value: **[id is here]**
- type: hidden
- value: **[id is here]**

**[id is here]**
**[id is here too]**
**[id is HERE too]**
**[and id is even here TOO]**

pyparsing はこれらの予測不可能なバリエーションに一致するだけでなく、個々のタグ属性とその値を簡単に読み取ることができるオブジェクトでデータを返すことがわかります。

score 0 · Accepted Answer

0

/<input type="hidden" name="fooId" value="([\d-]+)" \/>/

于 2008-09-10T21:56:05.290 に答える

score 0 · Accepted Answer

/<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>/

>>> import re
>>> s = '<input type="hidden" name="fooId" value="12-3456789-1111111111" />'
>>> re.match('<input\s+type="hidden"\s+name="([A-Za-z0-9_]+)"\s+value="([A-Za-z0-9_\-]*)"\s*/>', s).groups()
('fooId', '12-3456789-1111111111')

python - HTML 解析用の Python 正規表現 (BeautifulSoup)

7 に答える 7

Related

Reference