python - テキスト CSV バイナリファイルを変換し、メモリに読み込まずに Python でランダムな行を取得する

Question

次の形式の CSV テキストファイルがいくつかあります。

1.3, 0, 1.0
20.0, 3.2, 0
30.5, 5.0, 5.2

ファイルのサイズは約 3.5Gb で、有用な時間内に Pandas のメモリに読み込むことができません。

しかし、すべてのファイルを読み取る必要はありません。ファイルからランダムな行をいくつか選択して、そこにある値を読み取ることです。たとえば、バイナリファイルの float16 のように、すべてのフィールドが同じサイズになるようにします。

さて、質問への回答で指定された NumPy メソッドを使用して、それを変換できると思います: How to output list of floats to a binary file in Python

しかし、変換が完了した後、そこからランダムな行を選択するにはどうすればよいですか?

通常のテキストファイルでは、次のようにできます。

import random
offset = random.randrange(filesize)
f = open('really_big_file')
f.seek(offset)                  #go to random position
f.readline()                    # discard - bound to be partial line
random_line = f.readline()      # bingo!

しかし、NumPy で作成されたバイナリファイルでこれを機能させる方法が見つかりません。

score 2 · Accepted Answer

私はstructバイナリに変換するために使用します：

import struct
with open('input.txt') as fin, open('output.txt','wb') as fout:
     for line in fin:
         #You could also use `csv` if you're not lazy like me ...
         out_line = struct.pack('3f',*(float(x) for x in line.split(',')))
         fout.write(out_line)

これにより、ほとんどのシステムですべてが標準の 4 バイト float として書き込まれます。

ここで、データを再度読み取るには:

with open('output.txt','rb') as fin:
    line_size = 12 #each line is 12 bytes long (3 floats, 4 bytes each)
    offset = random.randrange(filesize//line_size)  #pick n'th line randomly
    f.seek(offset*line_size) #seek to position of n'th line
    three_floats_bytes = f.read(line_size)
    three_floats = struct.unpack('3f',three_floats_bytes)

ディスク容量が気になり、(2 バイト浮動小数点数) を使用してデータを圧縮したい場合は、上記の基本的なnp.float16スケルトンを使用してそれを行うこともnp.fromstringできstruct.unpackます。 -- そして6 に落ちます ...)。ndarray.tostringstruct.packline_size

score 0 · Accepted Answer

したがって、役立つ回答で提供された例を使用して、誰かが興味を持っている場合に NumPy でそれを行う方法を見つけました。

# this converts the file from text CSV to bin
with zipfile.ZipFile("input.zip", 'r') as inputZipFile:
    inputCSVFile = inputZipFile.open(inputZipFile.namelist()[0], 'r') # it's 1 file only zip

    with open("output.bin", 'wb') as outFile:
        outCSVFile = csv.writer(outFile, dialect='excel')
        for line in inputCSVFile:
            lineParsed = ast.literal_eval(line)
            lineOut = numpy.array(lineParsed,'float16')
            lineOut.tofile(outFile)
        outFile.close()

    inputCSVFile.close()
    inputZipFile.close()

# this reads random lines from the binary file
with open("output.bin", 'wb') as file:
    file.seek(0)

    lineSize = 20 # float16 has 2 bytes and there are 10 values:
    fileSize = os.path.getsize("output.bin")

    offset = random.randrange(fileSize//lineSize)
    file.seek(offset * lineSize)
    random_line = file.read(lineSize)
    randomArr = numpy.fromstring(random_line, dtype='float16')

score 0 · Accepted Answer

ストレージのサイズに応じてオフセットをいじる必要がありますが、次のようになります。

import csv
import struct
import random

count = 0
with open('input.csv') as fin, open('input.dat', 'wb') as fout:
    csvin = csv.reader(fin)
    for row in csvin:
        for col in map(float, row):
            fout.write(struct.pack('f', col))
            count += 1


with open('input.dat', 'rb') as fin:
    i = random.randrange(count)
    fin.seek(i * 4)
    print struct.unpack('f', fin.read(4))

python - テキスト CSV バイナリ ファイルを変換し、メモリに読み込まずに Python でランダムな行を取得する

3 に答える 3

Related

Reference

python - テキスト CSV バイナリファイルを変換し、メモリに読み込まずに Python でランダムな行を取得する