python - ロボットの正規表現を改善する

Question

ロボットのリンクを抽出するために、次の正規表現を作成しました。

re.compile(r"/\S+(?:\/+)")

そして、私は次の結果を得る：

/includes/
/modules/
/search/
/?q=user/password/
/?q=user/register/
/node/add/
/logout/
/?q=admin/
/themes/
/?q=node/add/
/admin/
/?q=comment/reply/
/misc/
//example.com/
//example.com/site/
/profiles/
//www.robotstxt.org/wc/
/?q=search/
/user/password/
/?q=logout/
/comment/reply/
/?q=filter/tips/
/?q=user/login/
/user/register/
/user/login/
/scripts/
/filter/tips/
//www.sxw.org.uk/computing/robots/

次のような2つのスラッシュを持つリンクを除外するにはどうすればよいですか。

 //www.sxw.org.uk/computing/robots/
 //www.robotstxt.org/wc/
 //example.com/
 //example.com/site/

何か案は？？

score 1 · Accepted Answer

if条件を追加することをお勧めします。

 if not line.startswith(r'//'):
     #then do something here

score 1 · Accepted Answer

サンプルのように、一致する文字列が各行にあると仮定すると、正規表現を固定して否定先読みを使用できます

^(?!//)/\S+(?:\/+)

^ を行頭に一致させる正規表現修飾子を必ず設定してください。

私のPythonは錆びていますが、これでうまくいくはずです

for match in re.finditer(r"(?m)^(?!//)/\S+(?:/+)", subject):
    # match start: match.start()
    # match end (exclusive): match.end()
    # matched text: match.group()

python - ロボットの正規表現を改善する

2 に答える 2

Related

Reference