python - Python 正規表現の代替

Question

Webページ上のすべてのリンクを次の形式で検索しようとしている"http://something"かhttps://something.、正規表現を作成しましたが、機能します:

L = re.findall(r"http://[^/\"]+/|https://[^/\"]+/", site_str)

しかし、これを短く書く方法はありますか？おそらく必要なく ://[^/\"]+/ を2回繰り返しています。さまざまなことを試しましたが、機能しません。試しました：

L = re.findall(r"http|https(://[^/\"]+/)", site_str)
L = re.findall(r"(http|https)://[^/\"]+/", site_str)
L = re.findall(r"(http|https)(://[^/\"]+/)", site_str)

ここに何かが欠けているか、Pythonの正規表現を十分に理解していないことは明らかです。

score 10 · Accepted Answer

キャプチャグループを使用しており、.findall()それらを使用すると動作が変わります (キャプチャグループの内容のみが返されます)。正規表現は単純化できますが、代わりに非キャプチャグループを使用すると、バージョンが機能します。

L = re.findall(r"(?:http|https)://[^/\"]+/", site_str)

式の前後に一重引用符を使用する場合、二重引用符をエスケープする必要はありません。また、式内で変更する必要があるだけなので、次sのようs?にも機能します。

L = re.findall(r'https?://[^/"]+/', site_str)

デモ：

>>> import re
>>> example = '''
... "http://someserver.com/"
... "https://anotherserver.com/with/path"
... '''
>>> re.findall(r'https?://[^/"]+/', example)
['http://someserver.com/', 'https://anotherserver.com/']

python - Python 正規表現の代替

1 に答える 1

Related

Reference