python - Python で長い文字列から特定の文字を削除する

Question

私は、いくつかのソースコードを取得し、それをページに表示される単語だけに煮詰めるプロジェクトに取り組んでいます。すべての html タグと script タグ間のすべてのものを削除するようにできますが、バックスラッシュで始まるすべての文字を削除する方法がわかりません。ページには \t、\n、および \x** が含まれます。ここで、* は小文字または数字のように見えます。

文字列のこれらすべての部分をスペースに置き換えるコードをどのように記述すればよいでしょうか? 私はパイソンで働いています。

たとえば、これは Web ページの文字列です。

\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba\xd0

次のようになります。

Apple - Wikipedia, the free encyclopedia Language:English sAsturianuAz rbaycanca Basa Banyumasan

score 1 · Accepted Answer

s = repr('''\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba\xd0''')
s =  re.sub(r'\\[tn]', '', s)
s =  re.sub(r'\\x..', '', s)
print s

score 0 · Accepted Answer

プレーンテキストの単語に少なくとも 3 つの文字が含まれているとします。

' '.join(re.findall(r'\w{3,}', s)) # where s represents the string

または：

' '.join(re.findall(r'(?:\w{3,}|-(?=\s))', s)) # in order to preserve the dash char

score 0 · Accepted Answer

ウィキペディアでは UTF-8 文字列エンコーディングを使用しています。プレーン ASCII に変換するには、

UTF-8 から Unicode への変換
Unicode から ASCII に変換し、コード化できない文字を置き換えます
uncodable-character-replacements をスペースに変換します
複数の空白 (タブ、改行など) を単一のスペースに変換します
先頭と末尾のスペースを取り除く

.

s = "\n\t\n\t\n\t\tApple - Wikipedia, the free encyclopedia\n\t\t\n\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\n\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\n\t\t\t\t\t\t\n\t\t\t\n\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\t\n\t\t\t\t\n\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\t\t\tLanguage:English\xd8\xa7\xd9\x84\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a\xd8\xa9Aragon\xc3\xa9sAsturianuAz\xc9\x99rbaycanca\xe0\xa6\xac\xe0\xa6\xbe\xe0\xa6\x82\xe0\xa6\xb2\xe0\xa6\xbeB\xc3\xa2n-l\xc3\xa2m-g\xc3\xbaBasa Banyumasan\xd0\x91\xd0\xb5\xd0\xbb\xd0\xb0\xd1\x80\xd1\x83\xd1\x81\xd0\xba"

import re
whitespaces = re.compile('\s+', flags=re.M)
def utf8_to_ascii(s, ws=whitespaces):
    s = s.encode("utf8")
    s = s.decode("ascii", errors="replace")
    s = s.replace(u"\ufffd", " ")
    s = ws.sub(" ", s)
    return s.strip()

s = utf8_to_ascii(s)

最終的に文字列になります

Apple - Wikipedia, the free encyclopedia Language:English Aragon sAsturianuAz rbaycanca B n-l m-g Basa Banyumasan

score 0 · Accepted Answer

必要なすべてのパターンに一致する正規表現を作成し、それらをスペースに置き換えます。

score 0 · Accepted Answer

デフォルトの ascii エンコーディングを仮定すると、悪意のある正規表現 ;) を使用せずに、文字列を反復処理し、を使用してエンコーディング値に基づいて値を削除するord(i) < 128か、または選択した仕様を使用して、これを 1 行で非常にうまく行うことができます。

>>> ' '.join(''.join([i if ord(i) < 128 else ' ' for i in mystring]).split())
#Output:
Apple - Wikipedia, the free encyclopedia Language:English Aragon sAsturianuAz rbaycanca B n-l m-g Basa Banyumasan

または、許可された文字列を指定して「in」を使用することもできます string.ascii_letters。

>>> import string
>>> ' '.join(''.join([i if i in string.ascii_letters else ' ' for i in mystring]).split())
#Output:
Apple Wikipedia the free encyclopedia Language English Aragon sAsturianuAz rbaycanca B n l m g Basa Banyumasan

これにより、句読点も削除されます (ただし、必要に応じて、これらの文字を文字列チェック定義に追加し直すことで簡単に回避できますcheck = string.ascii_letters + ',.-:') 。

python - Python で長い文字列から特定の文字を削除する

5 に答える 5

Related

Reference