python - utf-8以外の記号を含むディレクトリ内のすべてのファイルを削除します

Question

データのセットがありますが、データのみを処理する必要があるため、記号utf-8以外のすべてのデータを削除する必要があります。utf-8

これらのファイルを操作しようとすると、次のようになります。

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 3062: character maps to <undefined> and UnicodeDecodeError: 'utf8' codec can't decode byte 0xc1 in position 1576: invalid start byte

私のコード

class Corpus:
        def __init__(self,path_to_dir=None):
                self.path_to_dir = path_to_dir if path_to_dir else []


        def emails_as_string(self):
                for file_name in os.listdir(self.path_to_dir):
                        if not file_name.startswith("!"):
                                with io.open(self.add_slash(self.path_to_dir)+file_name,'r', encoding ='utf-8') as body:
                                        yield[file_name,body.read()]                        

        def add_slash(self, path):
                if path.endswith("/"): return path
                return path + "/"

yield[file_name,body.read()]私はこことここでエラーを受け取りますlist_of_emails = mailsrch.findall(text)が、utf-8を使用するとすべて素晴らしいです。

score 2 · Accepted Answer

errors='ignore'の引数を使用したいと思いますbytes.decode。詳細については、 http： //docs.python.org/3/howto/unicode.html#unicode-howtoおよびhttp://docs.python.org/3/library/stdtypes.html#bytes.decodeを参照してください。

編集：

これを行うための良い方法を示す例を次に示します。

for file_name in os.listdir(self.path_to_dir):
    if not file_name.startswith("!"):
        fullpath = os.path.join(self.path_to_dir, file_name)
        with open(fullpath, 'r', encoding ='utf-8', errors='ignore') as body:
            yield [file_name, body.read()]

を使用するとos.path.join、メソッドを削除して、add_slashクロスプラットフォームで確実に機能するようにできます。

python - utf-8以外の記号を含むディレクトリ内のすべてのファイルを削除します

1 に答える 1

Related

Reference