python - CSV ファイルをインポートして、区切り文字を自動的に推測できますか?

Question

「;」を使用する 2 種類の CSV ファイルをインポートしたい区切り文字などには「,」を使用します。これまでのところ、次の 2 つの行を切り替えてきました。

reader=csv.reader(f,delimiter=';')

また

reader=csv.reader(f,delimiter=',')

区切り文字を指定せずに、プログラムに正しい区切り文字をチェックさせることはできますか?

以下のソリューション (Blender と sharth) は、カンマ区切りのファイル (Libroffice で生成) ではうまく機能するようですが、セミコロン区切りのファイル (MS Office で生成) では機能しないようです。以下は、セミコロンで区切られた 1 つのファイルの最初の行です。

ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

score 62 · Accepted Answer

csvモジュールは、この問題に対して csv sniffer を使用することを推奨しているようです。

彼らは、私があなたのケースに適応した次の例を示しています。

with open('example.csv', 'rb') as csvfile:  # python 3: 'r',newline=""
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,")
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    # ... process CSV file contents here ...

試してみましょう。

[9:13am][wlynch@watermelon /tmp] cat example 
#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect)

        for line in reader:
            print line

def main():
    print 'Comma Version:'
    parse('comma_separated.csv')

    print
    print 'Semicolon Version:'
    parse('semicolon_separated.csv')

    print
    print 'An example from the question (kingdom.csv)'
    parse('kingdom.csv')

if __name__ == '__main__':
    main()

サンプル入力

[9:13am][wlynch@watermelon /tmp] cat comma_separated.csv 
test,box,foo
round,the,bend

[9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv 
round;the;bend
who;are;you

[9:22am][wlynch@watermelon /tmp] cat kingdom.csv 
ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

サンプルプログラムを実行すると、次のようになります。

[9:14am][wlynch@watermelon /tmp] ./example 
Comma Version:
['test', 'box', 'foo']
['round', 'the', 'bend']

Semicolon Version:
['round', 'the', 'bend']
['who', 'are', 'you']

An example from the question (kingdom.csv)
['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes']
['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']
['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']

また、私が使用している python のバージョンも注目に値するでしょう。

[9:20am][wlynch@watermelon /tmp] python -V
Python 2.7.2

score 12 · Accepted Answer

, (コンマ) と | の両方を扱うプロジェクトがあるとします。(垂直バー) で区切られた適切な形式の CSV ファイルで、次のことを試しました ( https://docs.python.org/2/library/csv.html#csv.Snifferで指定):

dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|')

ただし、| で区切られたファイルでは、「区切り文字を特定できませんでした」という例外が返されました。各行の区切り記号の数が同じ場合 (引用符で囲まれているものは数えません)、スニフヒューリスティックが最適に機能する可能性があると推測するのは合理的であるように思われます。そのため、ファイルの最初の 1024 バイトを読み取る代わりに、最初の 2 行全体を読み取ってみました。

temp_lines = csvfile.readline() + '\n' + csvfile.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|')

これまでのところ、これは私にとってはうまくいっています。

score 2 · Accepted Answer

これに対する完全に一般的な解決策があるとは思いません (,区切り記号として使用する理由の 1 つは、データフィールドの一部に ... を含める必要があるため;です)。決定するための単純なヒューリスティックは、単純に最初の行 (またはそれ以上) を読み取り、含まれる文字数,と文字数を数えることです (ファイルを作成する際にエントリを適切かつ一貫して引用;する場合は、引用符内の文字を無視する可能性があります)。 .csv2 つのうち、正しい区切り文字です。

python - CSV ファイルをインポートして、区切り文字を自動的に推測できますか?

5 に答える 5

Related

Reference