python - csvの2つの列を比較し、別のcsvで文字列類似度を出力する

Question

私はPythonプログラミングに非常に慣れていません。文字列値の 2 つの列を持つ csv ファイルを取得しようとしており、両方の列の文字列の類似度を比較したいと考えています。次に、値を取得して比率を別のファイルに出力したいと思います。

csv は次のようになります。

Column 1|Column 2 
tomato|tomatoe 
potato|potatao 
apple|appel

列 1 の文字列が列 2 とどのように類似しているかを出力ファイルに行ごとに表示する必要があります。比率スコアを出力するために difflib を使用しています。

これは私がこれまでに持っているコードです:

import csv
import difflib

f = open('test.csv')

csf_f = csv.reader(f)

row_a = []
row_b = []

for row in csf_f:
    row_a.append(row[0])
    row_b.append(row[1])

a = row_a
b = row_b

def similar(a, b):
    return difflib.SequenceMatcher(a, b).ratio()

match_ratio = similar(a, b)

match_list = []
for row in match_ratio:
    match_list.append(row)

with open("output.csv", "wb") as f:
    writer = csv.writer(f, delimiter=',')
    writer.writerows(match_list)

f.close()

エラーが発生します：

Traceback (most recent call last):
  File "comparison.py", line 24, in <module>
    for row in match_ratio:
TypeError: 'float' object is not iterable

列リストを正しくインポートして、sequencematcher 関数に対して実行していないように感じます。

score 3 · Accepted Answer

を使用してこれを行う別の方法を次に示しますpandas。

あなたのcsvデータが次のようなものであると考えてください：

Column 1,Column 2 
tomato,tomatoe 
potato,potatao 
apple,appel

コード

import pandas as pd
import difflib as diff
#Read the CSV
df = pd.read_csv('datac.csv')
#Create a new column 'diff' and get the result of comparision to it
df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1) 
#Save the dataframe to CSV and you could also save it in other formats like excel, html etc
df.to_csv('outdata.csv',index=False)

結果

Column 1,Column 2 ,diff
tomato,tomatoe ,0.923076923077
potato,potatao ,0.923076923077
apple,appel ,0.8

python - csvの2つの列を比較し、別のcsvで文字列類似度を出力する

5 に答える 5

Related

Reference