python - extracting facebook page from html using regex

Question

I am trying to get the address of a facebook page of websites using regular expression search on the html

usually the link appears as <a href="http://www.facebook.com/googlechrome">Facebook</a>

but sometimes the address will be http://www.facebook.com/some.other

and sometimes with numbers

at the moment the regex that I have is

'(facebook.com)\S\w+'

but it won't catch the last 2 possibilites

what is it called when I want the regex to search but not fetch it? (for instance I want the regex to match the www.facbook.com part but not have that part in the result, only the part that comes after it

note I use python with re and urllib2

score 1 · Accepted Answer

あなたの主な問題は、十分な正規表現を理解していないことです。

fb_re = re.compile(r'www.facebook.com([^"]+)')

次に簡単に：

results = fb_re.findall(url)

これが機能する理由:

正規表現では、括弧内の部分()がキャプチャされます。www.facebook.com 部分を括弧内に入れていたため、他には何も得られませんでした。

ここでは、文字セット[]を使用してそこにあるものと一致させ、^演算子を使用してそれを否定しました。これは、セットにないものを意味し、次に"文字を指定したため、www.facebook.com の後に続くものと一致するまでa に到達し"てから停止します。

注 - これは、埋め込まれている facebook のリンクをキャッチします。facebook のリンクが単純にプレーンテキストのページにある場合は、次を使用できます。

fb_re = re.compile(r'www.facebook.com(\S+)')

これは、空白以外の文字を取得することを意味するため、空白がなくなると停止します。

リンクがピリオドで終わるのが心配な場合は、次のように追加できます。

fb_re = re.compile(r'www.facebook.com(\S+)\.\s')

これは、上記と同じものを検索するように指示しますが、文の終わりに到達すると停止し、.その後にスペースやエンターなどの空白が続きます。このようにして、次のようなリンクを引き続き取得しますが、次のような/some.otherものがある場合/some.other.は、最後のリンクを削除します.

score 0 · Accepted Answer

私が正しく仮定すると、URLは常に二重引用符で囲まれています。右？

re.findall(r'"http://www.facebook.com(.+?)"',url)

全体として、正規表現で html を解析しようとするのは悪い考えです。lxml.htmlリンクを見つけてから使用するようなhtmlパーサーを使用することをお勧めしますurlparse

>>> from urlparse import urlparse # in 3.x use from urllib.parse import urlparse
>>> url = 'http://www.facebook.com/some.other'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'facebook.com'
>>> parse_object.path
'/some.other'

python - extracting facebook page from html using regex

2 に答える 2

Related

Reference