python - 文字列の内容を正規表現に置き換える

Question

Web ページから検索するデータを囲むすべての html を削除しようとしています。これにより、データベースに入力できる生データだけが残ります。だから私は次のようなものを持っている場合:

<p class="location"> Atlanta, GA </p>

次のコードは戻ります

Atlanta, GA </p>

しかし、私が期待しているのは、返されるものではありません。これは、私がここで見つけた基本的な問題に対するより具体的な解決策です。どんな助けでも大歓迎です、ありがとう！コードは以下にあります。

def delHTML(self, html):
    """
    html is a list made up of items with data surrounded by html
    this function should get rid of the html and return the data as a list
    """

    for n,i in enumerate(html):
        if i==re.match('<p class="location">',str(html[n])):
            html[n]=re.sub('<p class="location">', '', str(html[n]))

    return html

score 2 · Accepted Answer

コメントで正しく指摘されているように、特定のライブラリを使用して HTML を解析し、テキストを抽出する必要があります。いくつかの例を次に示します。

html2text : 機能は限られていますが、まさに必要なものです。
BeautifulSoup : より複雑に、より強力に。

score 0 · Accepted Answer

タグに含まれるデータを抽出するだけだと仮定すると、次のように、Pythonモジュール (単純な HTML SAX パーサー) を<p class="location">使用して、クイック & ダーティ (ただし正しい) アプローチを使用できます。HTMLParser

from HTMLParser import HTMLParser

class MyHTMLParser(HTMLParser):
    PLocationID=0
    PCount=0
    buf=""
    out=[]

    def handle_starttag(self, tag, attrs):
        if tag=="p":
            self.PCount+=1
            if ("class", "location") in attrs and self.PLocationID==0:
                self.PLocationID=self.PCount

    def handle_endtag(self, tag):
        if tag=="p":
            if self.PLocationID==self.PCount:
                self.out.append(self.buf)
                self.buf=""
                self.PLocationID=0
            self.PCount-=1

    def handle_data(self, data):
        if self.PLocationID:
            self.buf+=data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed("""
<html>
<body>
<p>This won't appear!</p>
<p class="location">This <b>will</b></p>
<div>
<p class="location">This <span class="someclass">too</span></p>
<p>Even if <p class="location">nested Ps <p class="location"><b>shouldn't</b> <p>be allowed</p></p> <p>this will work</p></p> (this last text is out!)</p>
</div>
</body>
</html>
""")
print parser.out

出力：

['This will', 'This too', "nested Ps shouldn't be allowed this will work"]

<p class="location">これにより、タグ内に含まれるすべてのテキストが抽出され、その中のすべてのタグが削除されます。個別のタグ (ネストされていない場合 - 段落には許可されません) は、outリストに個別のエントリがあります。

より複雑な要件の場合、これは簡単に手に負えなくなることに注意してください。そのような場合、DOM パーサーの方が適しています。

python - 文字列の内容を正規表現に置き換える

2 に答える 2

Related

Reference