python - 異なる列を一致させ、Pythonを使用してそれらを結合する

Question

2 つのテキストファイルがあります。

1 つ目は、スペースで区切られたリストです。

23 dog 4
24 cat 5
28 cow 7

2 番目は - で'|'区切られたリストです。

?dog|parallel|numbering|position
Dogsarebarking
?cat|parallel|nuucers|position
CatisBeautiful

次のような出力ファイルを取得したい：

?dog|paralle|numbering|position|23
?cat|parallel|nuucers|position|24

これ'|'は、両方のファイルの 2 列目の値が一致する最初のファイルの 1 列目の値が追加された 2 番目のファイルの値を含む、で区切られたリストです。

score 3 · Accepted Answer

これは、pandasライブラリが優れているタスクの種類です。

import pandas as pd
df1 = pd.read_csv("c1.txt", sep="|", header=None).dropna()
df2 = pd.read_csv("c2.txt", sep=" ", header=None)
merged = df1.merge(df2, on=1).ix[:,:-1]
merged.to_csv("merged.csv", sep="|", header=None, index=None)

いくつかの説明が続きます。まず、ファイルを DataFrames と呼ばれるオブジェクトに読み込みます。

>>> df1 = pd.read_csv("c1.txt", sep="|", header=None).dropna()
>>> df1
               0      1          2         3
0      ?parallel    dog  numbering  position
3      ?parallel    cat    nuucers  position
6  ?non parallel  honey  numbering  position
>>> df2 = pd.read_csv("c2.txt", sep=" ", header=None)
>>> df2
    0    1  2
0  23  dog  4
1  24  cat  5
2  28  cow  7

.dropna()データがない場合はスキップします。あるいは、df1 = df1[df1[0].str.startswith("?")]別の方法があったでしょう。

次に、最初の列でそれらをマージします。

>>> df1.merge(df2, on=1)
         0_x    1        2_x         3  0_y  2_y
0  ?parallel  dog  numbering  position   23    4
1  ?parallel  cat    nuucers  position   24    5

最後の列は必要ないので、スライスします。

>>> df1.merge(df2, on=1).ix[:,:-1]
         0_x    1        2_x         3  0_y
0  ?parallel  dog  numbering  position   23
1  ?parallel  cat    nuucers  position   24

そして、を使用して書き出すと、次のようになりto_csvます。

>>> !cat merged.csv
?parallel|dog|numbering|position|23
?parallel|cat|nuucers|position|24

現在、多くの単純なタスクではやり過ぎになる可能性があり、モジュールpandasのような低レベルツールの使用方法も学ぶことが重要です。csvOTOH、今すぐ何かをしたいときは、とても便利です。

score 3 · Accepted Answer

を使用csvして最初のファイルを読み取り、ディクショナリを使用して file1 行を格納します。2 番目のファイルは FASTA 形式であるため、で始まる行のみを取得します?。

import csv

with open('file1', 'rb') as file1:
    file1_data = dict(line.split(None, 2)[1::-1] for line in file1 if line.strip())

with open('file2', 'rb') as file2, open('output', 'wb') as outputfile:
    output = csv.writer(outputfile, delimiter='|')
    for line in file2:
        if line[:1] == '?':
            row = line.strip().split('|')
            key = row[0][1:]
            if key in file1_data:
                 output.writerow(row + [file1_data[key]])

これにより、次が生成されます。

?dog|parallel|numbering|position|23
?cat|parallel|nuucers|position|24

あなたの入力例のために。

score 0 · Accepted Answer

これは、リレーショナルデータベースにおけるJOINの目的とまったく同じように見えます。

内部結合は、アプリケーションで使用される最も一般的な結合操作であり、デフォルトの結合タイプと見なすことができます。内部結合は、結合述部に基づいて 2 つのテーブル (A と B) の列の値を結合することにより、新しい結果テーブルを作成します。クエリは、A の各行を B の各行と比較して、結合述語を満たすすべての行のペアを見つけます。join-predicate が満たされると、一致した A と B の行の各ペアの列値が結果行に結合されます。

この例を見てください：

import sqlite3
conn = sqlite3.connect('example.db')

# get hands on the database
c = conn.cursor()

# create and populate table1
c.execute("DROP TABLE table1")
c.execute("CREATE TABLE table1 (col1 text, col2 text, col3 text)")
with open("file1") as f:
    for line in f:
        c.execute("INSERT INTO table1 VALUES (?, ?, ?)", line.strip().split())

# create table2
c.execute("DROP TABLE table2")
c.execute("CREATE TABLE table2 (col1 text, col2 text, col3 text, col4 text)")
with open("file2") as f:
    for line in f:
        c.execute("INSERT INTO table2 VALUES (?, ?, ?, ?)", 
            line.strip().split('|'))

# make changes persistent
conn.commit()

# retrieve desired data and write it to file
with open("file3", "w+") as f:
    for x in c.execute(
        """
        SELECT table2.col1
             , table2.col2
             , table2.col3
             , table2.col4
             , table1.col1 
        FROM table1 JOIN table2 ON table1.col2 = table2.col2
        """):
        f.write("%s\n" % "|".join(x))

# close connection
conn.close()

出力ファイルは次のようになります。

paralle|dog|numbering|position|23
parallel|cat|nuucers|position|24

python - 異なる列を一致させ、Pythonを使用してそれらを結合する

3 に答える 3

Related

Reference