固定幅ファイルの扱いが特に厄介です。私が思っていた形式でエンコードされていません。一言で言えば、私はさまざまなことをやろうとしています:
- RPLY01 の前のすべての空白をスキップします。
- これらの奇抜な \x00* 文字を削除します。それらを削除することを検討しましたが、アスタリスク (*) で覆われている文字は、何を試しても削除できませんでした。
- FDXSPD01/で分割
- CHKP01 で分割します。
- 最終的に、すべての CSV を適切なリストに分割し、他のすべてのがらくた (RPLY と CHKP) を解析可能なわかりやすい形式にトラップします。
元の出力は次のとおりです。
�cRPLY01 IREQ 0000011
N00 �9FDXSPD01"CASH","","",10219575.34,0.00,0,"000000000000773"
�EFDXSPD01"CAD","CANADA DOLLAR","CU",-14564.52,0.00,0,"000000000000773"
�PFDXSPD01"CCTUSD","CURRENCY CONTRACT - USD","CC",3644.00,0.00,0,"000000000000773"
�QFDXSPD01"CCTUSD","CURRENCY CONTRACT - USD","CC",-3641.07,0.00,0,"000000000000773"
�PFDXSPD01"CCTUSD","CURRENCY CONTRACT - USD","CC",1457.00,0.00,0,"000000000000773"
�QFDXSPD01"CCTUSD","CURRENCY CONTRACT - USD","CC",-1456.43,0.00,0,"000000000000773"
�DFDXSPD01"CD5427","JOHN CD","CD",100000.00,197.95,0,"000000000000773"
�LFDXSPD01"CP5427","COMMERCIAL PAPER","CP",9925000.00,0.00,0,"000000000000773"
�JFDXSPD01"FS5427","JOHNFORSTOCK","FS",81000.00,10000.00,0,"000000000000773"
�FFDXSPD01"FUT5427","JOHNFUTURE","FT",264000.00,0.00,0,"000000000000773"
�BFDXSPD01"JKSTOCK","JK STOCK","S",31500.00,0.00,0,"000000000000773"
�LFDXSPD01"MB5427","JOHN MUNI BOND","M",255000.00,15611.92,0,"000000000000773"
�QFDXSPD01"MBS5427","JOHNMORTGAGEBACKED","G1",996500.00,2916.67,0,"000000000000773"
�EFDXSPD01"OPT5427","JOHNOPTION","O",464000.00,0.00,0,"000000000000773"
�CFDXSPD01"TB5427","TREASURY BILL","TI",0.00,0.00,0,"000000000000773"
�HFDXSPD01"UB5427","JOHN BOND","G",2994000.00,13281.26,0,"000000000000773"
�9FDXSPD01"UNITS","UNITS","S",0.00,0.00,0,"000000000000773"
�CHKP01 N0000000170
これは同じ出力ですが、次のrepr
とおりです。
\x00cRPLY01 IREQ 0000011
N00 \x009FDXSPD01"CASH","","",10219575.34,0.00,0,"000000000000773"
\x00EFDXSPD01"CAD","CANADA DOLLAR","CU",-14564.52,0.00,0,"000000000000773"
\x00PFDXSPD01"CCTUSD","CURRENCY CONTRACT - USD","CC",3644.00,0.00,0,"000000000000773"
\x00QFDXSPD01"CCTUSD","CURRENCY CONTRACT - USD","CC",-3641.07,0.00,0,"000000000000773"
\x00PFDXSPD01"CCTUSD","CURRENCY CONTRACT - USD","CC",1457.00,0.00,0,"000000000000773"
\x00QFDXSPD01"CCTUSD","CURRENCY CONTRACT - USD","CC",-1456.43,0.00,0,"000000000000773"
\x00DFDXSPD01"CD5427","JOHN CD","CD",100000.00,197.95,0,"000000000000773"
\x00LFDXSPD01"CP5427","COMMERCIAL PAPER","CP",9925000.00,0.00,0,"000000000000773"
\x00JFDXSPD01"FS5427","JOHNFORSTOCK","FS",81000.00,10000.00,0,"000000000000773"
\x00FFDXSPD01"FUT5427","JOHNFUTURE","FT",264000.00,0.00,0,"000000000000773"
\x00BFDXSPD01"JKSTOCK","JK STOCK","S",31500.00,0.00,0,"000000000000773"
\x00LFDXSPD01"MB5427","JOHN MUNI BOND","M",255000.00,15611.92,0,"000000000000773"
\x00QFDXSPD01"MBS5427","JOHNMORTGAGEBACKED","G1",996500.00,2916.67,0,"000000000000773"
\x00EFDXSPD01"OPT5427","JOHNOPTION","O",464000.00,0.00,0,"000000000000773"
\x00CFDXSPD01"TB5427","TREASURY BILL","TI",0.00,0.00,0,"000000000000773"
\x00HFDXSPD01"UB5427","JOHN BOND","G",2994000.00,13281.26,0,"000000000000773"
\x009FDXSPD01"UNITS","UNITS","S",0.00,0.00,0,"000000000000773"
\x00\x13CHKP01 N0000000170'
これまでの私の思考の流れは次のとおりです。
#!/usr/local/bin/python
import string
import struct
import re
import array
cols = []
splitcols = []
def removeNonAscii(s): return "".join(i for i in s if ord(i) < 128)
rgx = r'[\x00-\x20\x22\x2F\x3A\x3C\x3E\x5C]'
fname2 = open('/Users/abcd/Documents/SVN/scripts/Python/project/I1DL/output.txt', 'r')
fname = open('/Users/abcd/Documents/SVN/scripts/Python/project/I1DL/output.txt', 'r')
ename2 = open('/Users/abcd/Documents/SVN/scripts/Python/project/I1DL/error_output.txt', 'r')
ename = open('/Users/abcd/Documents/SVN/scripts/Python/project/I1DL/error_output.txt', 'r')
losfile = open('/Users/abcd/Documents/SVN/scripts/Python/project/I1DL/error_output.txt', 'r')
losfile2 = open('/Users/abcd/Documents/SVN/scripts/Python/project/I1DL/error_output.txt', 'r')
filename = '/Users/abcd/Documents/SVN/scripts/Python/project/I1DL/output.txt'
filename2 = '/Users/abcd/Documents/SVN/scripts/Python/project/I1DL/error_output.txt'
almost = open('/Users/abcd/Documents/SVN/scripts/Python/project/I1DL/output.txt', 'r')
almost2 = open('/Users/abcd/Documents/SVN/scripts/Python/project/I1DL/output.txt', 'r')
for line in fname.read().split('\t'):
print repr(line)
filtered_string = filter(lambda x: x in string.printable, line)
print "unfilter line: " + line
print "filtered line: " + filtered_string
removeNonAscii(line)
print "======================================================================================"
for line in ename.read().split('\t'):
filtered_string = filter(lambda x: x in string.printable, line)
print "error unfilter line: " + line
print "error filtered line: " + filtered_string
print "======================================================================================"
for line in ename2.read().split('\t'):
for chars in line:
if chars in string.printable:
print chars
else:
print "!!!!!!!!!!!!!!!"
print "======================================================================================"
#for line in fname2.read().split('\t'):
#for chars in line:
#if chars in string.printable:
#print chars
#else:
#print "@@@@@@@@@@@@@@@"
print "======================================================================================"
print "======================================================================================"
testlist = []
testlist2 = []
with open(filename) as f:
readarray = f.readline()
for eachitem in readarray[6:]:
if(all(ord(c) < 127 and c in string.printable for c in eachitem)):
testlist.append(eachitem)
print repr(''.join(testlist))
with open(filename2) as f:
readarray2 = f.readline()
for eachitem in readarray2[6:]:
testlist2.append(removeNonAscii(eachitem))
print repr(''.join(testlist2))
with open(filename) as f:
readarray = f.readline()
print repr(readarray[6:].split(' FDXSPD01'))
for eachitem in readarray[6:]:
if(all(ord(c) < 127 and c in string.printable for c in eachitem)):
testlist.append(eachitem)
print repr(''.join(testlist))
#for eachitem in almost.read():
#if(all(ord(c) < 127 and c in string.printable for c in eachitem)):
#testlist.append(eachitem)
#print ''.join(testlist)
#for eachitem in losfile2.read():
#testlist2.append(removeNonAscii(eachitem))
#print ''.join(testlist2)
寝不足なのかもしれませんが、正解が見つからないようです。おそらく、Python に精通している人が道を教えてくれるでしょう。