python - UTF-8 でエンコードされたファイルは、chardetect によって ASCII として選択されます

翻译自：https://stackoverflow.com/questions/45574275 2017-08-08T17:09:56.260

349 次

フォルダー内に存在するすべてのファイルを組み合わせた単一のファイルを作成しています.テキストファイルをUTF-8でエンコードしたい.私のコードは次のとおりです

import os
import codecs
import re
def file_concatenation(path):
    with codecs.open('C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt', 'w',encoding='utf8') as outfile:
        for root, dirs, files in os.walk(path):            
                    for dir_name in dirs:    
                        for fname in os.listdir(root+"/"+dir_name):
                            with open(root+"/"+dir_name+"/"+fname) as infile:
                                for line in infile:                                    
                                    new_line = re.sub('[^a-zA-Z]', ' ',line)                                      
                                    outfile.write(re.sub("\s\s+", " ", new_line.lstrip()))
file_concatenation('C:/Users/JAYASHREE/Documents/NLP/bbc-fulltext/bbc')

chardetect を使用してエンコーディングを見つけると、信頼度 1.0 で ASCII として表示されます

C:\Users\JAYASHREE>chardetect "C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt"
C:/Users/JAYASHREE/Documents/NLP/text-corpus.txt: ascii with confidence 1.0

問題を解決してください。ありがとう

python - UTF-8 でエンコードされたファイルは、chardetect によって ASCII として選択されます

1 に答える 1

Related

Reference