不要な文字を削除したい Python 文字列がいくつかあります。
例:
"This is '-' a test"
should be "This is a test"
"This is a test L)[_U_O-Y OH : l’J1.l'}/"
should be "This is a test"
"> FOO < BAR"
should be "FOO BAR"
"I<<W5§!‘1“¢!°\" I"
should be ""
(because if only words are extracted then it returns I W I and none of them form words)
"l‘?£§l%nbia ;‘\\~siI.ve_rswinq m"
should be ""
"2|'J]B"
should be ""
これは私がこれまでに持っているものですが、単語間の元のスペースを保持していません。
>>> line = re.sub(r"\W+","","This is '-' a test")
>>> line
'Thisisatest'
>>> line = re.sub(r"\W+","","This is a test L)[_U_O-Y OH : l’J1.l'}/")
>>> line
'ThisisatestL_U_OYOHlJ1l'
#although i would prefer this to be "This is a test" but if not possible i would
prefer "This is a test L_U_OYOHlJ1l"
>>> line = re.sub(r"\W+","","> FOO < BAR")
>>> line
'FOOBAR'
>>> line = re.sub(r"\W+","","I<<W5§!‘1“¢!°\" I")
>>> line
'IW51I'
>>> line = re.sub(r"\W+","","l‘?£§l%nbia ;‘\\~siI.ve_rswinq m")
>>> line
'llnbiasiIve_rswinqm'
>>> line = re.sub(r"\W+","","2|'J]B")
>>> line
'2JB'
後で定義済みの単語のリストを使用して、正規表現でクリーンアップされた単語をフィルタリングします。