python - 繰り返しで2つのテキストファイルから行を失う

Question

次のような 2 つのテキストファイル (A と B) があります。

A:
1 stringhere 5
1 stringhere 3
...
2 stringhere 4
2 stringhere 4
...

B:
1 stringhere 4
1 stringhere 5
...
2 stringhere 1
2 stringhere 2
...

私がしなければならないことは、次のような新しいテキストファイルを作成するよりも、2 つのファイルを読み取ることです。

1 stringhere 5
1 stringhere 3
...
1 stringhere 4
1 stringhere 5
...
2 stringhere 4
2 stringhere 4
...
2 stringhere 1
2 stringhere 2
...

forループを使用して、関数を作成しました（Pythonを使用）：

def find(arch, i):
    l = arch   
    for line in l:
        lines = line.split('\t')
        if i == int(lines[0]):
           write on the text file
        else:            
            break

次に、次のように関数を呼び出します。

for i in range(1,3):        
    find(o, i)
    find(r, i)

異なる番号を含む最初の行が読み取られるため、一部のデータが失われますが、最終的な .txt ファイルにはありません。この例では、2 stringhere 4 と 2stringhere 1 が失われています。

これを回避する方法はありますか？

前もって感謝します。

score 2 · Accepted Answer

これを達成するためのより簡単な方法があるかもしれません。以下では、ファイルに表示される順序で行を保持します。これは、実行したいようです。

lines = []
lines.extend(open('file_a.txt').readlines())
lines.extend(open('file_b.txt').readlines())
lines = [line.strip('\n') + '\n' for line in lines]
key = lambda line: int(line.split()[0])
open('out_file.txt', 'w').writelines(sorted(lines, key=key))

最初の 3 行は、入力ファイルを単一の行配列に読み取ります。

4 行目では、各行の末尾に改行が 1 つだけあることが保証されます。両方のファイルが改行で終わることが確実な場合は、この行を省略できます。

5 行目では、並べ替えのキーを、文字列の最初の単語の整数バージョンとして定義しています。

6 行目は行をソートし、結果を出力ファイルに書き込みます。

score 2 · Accepted Answer

If the files fit in memory:

with open('A') as file1, open('B') as file2:
     L = file1.read().splitlines() 
     L.extend(file2.read().splitlines()) 
L.sort(key=lambda line: int(line.partition(' ')[0])) # sort by 1st column
print("\n".join(L)) # print result

It is an efficient method if total number of lines is under a million. Otherwise and especially if you have many sorted files; you could use heapq.merge() to combine them.

score 2 · Accepted Answer

ループで、改行と同じ値で行が始まらないが、iすでに 1 行を消費しているため、関数がで 2 回目に呼び出されると、2i+1番目の有効な行から開始されます。

事前にメモリ内のファイル全体を読み取るか (@JFSebastian の回答を参照)、またはそれがオプションでない場合は、関数を次のようなものに置き換えます。

def find(arch, i):
    l = arch
    while True:
        line=l.readline()
        lines = line.split('\t')
        if line != "" and i == int(lines[0]): # Need to catch end of file
            print " ".join(lines),
        else:
            l.seek(-len(line), 1) # Need to 'unread' the last read line
            break

readlineこのバージョンでは、次の呼び出しで正しい行が再度読み取られるように、カーソルを「巻き戻します」。暗黙的なものfor line in lとseek呼び出しを混在させることは推奨されないことに注意してくださいwhile True。

例:

$ cat t.py
o = open("t1")
r = open("t2")
print o
print r


def find(arch, i):
    l = arch
    while True:
        line=l.readline()
        lines = line.split(' ')
        if line != "" and i == int(lines[0]):
            print " ".join(lines),
        else:
            l.seek(-len(line), 1)
            break

for i in range(1, 3):
    find(o, i)
    find(r, i)

$ cat t1 
1 stringhere 1
1 stringhere 2
1 stringhere 3
2 stringhere 1
2 stringhere 2
$ cat t2
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
$ python t.py
<open file 't1', mode 'r' at 0x100261e40>
<open file 't2', mode 'r' at 0x100261ed0>
1 stringhere 1
1 stringhere 2
1 stringhere 3
1 stringhere 4
1 stringhere 5
2 stringhere 1
2 stringhere 2
2 stringhere 1
2 stringhere 2
$

python - 繰り返しで2つのテキストファイルから行を失う

3 に答える 3

Related

Reference