python - Python で特定の文字列からすべての形式の URL を削除する

Question

私は python が初めてで、特定の文字列で見つかる可能性のあるすべての形式の URL に一致するより良い解決策があるかどうか疑問に思っていました。グーグルで検索すると、ドメインを抽出したり、リンクに置き換えたりするソリューションがたくさんあるようですが、文字列からそれらを削除/削除するソリューションはありません。参考までに、以下にいくつかの例を挙げました。ありがとう！

str = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|

(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))', '', thestring)

print '==' + URLless_string + '=='

エラーログ：

C:\Python27>python test.py
  File "test.py", line 7
SyntaxError: Non-ASCII character '\xab' in file test.py on line 7, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

score 8 · Accepted Answer

コードにエラーがあります (実際には 2 つ):

1. 最後から 2 番目の単一引用符の前にバックスラッシュを付けてエスケープする必要があります。

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}     /)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

str2.予約済みのキーワードであるため、変数の名前として使用しないでください。そのため、名前を付けるthestringか、他の名前を付けます

例：

thestring = 'this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, http://www.google.com and http://www.google.co.uk and www.domain.co.uk and etc.'

URLless_string = re.sub(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', '', thestring)

print URLless_string

結果:

this is some text that will have one form or the other url embeded, most will have valid URLs while there are cases where they can be bad. for eg, and and and etc.

score 7 · Accepted Answer

ソースファイルの先頭にエンコード行を含めます（正規表現文字列には、のような非ASCII記号が含まれます»）。例：

# -*- coding: utf-8 -*-
import re
...

また、正規表現文字列をトリプルシングル（またはダブル）引用符で囲みます。'''または"""、この文字列にはすでに引用符記号自体（'および"）が含まれているため、シングルではありません。

r'''(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))'''

python - Python で特定の文字列からすべての形式の URL を削除する

2 に答える 2

Related

Reference