python - Python: リスト内の単語を含む行を削除する

Question

私は正しくないように見えるPythonでスクリプトに取り組んでいます。次の 2 つの入力を使用します。

データファイル
停止ファイル

データファイルは、並べ替えられた 4 つのタブ区切りの列で構成されます。停止ファイルは、ソートされた単語のリストで構成されています。

スクリプトの目的は次のとおりです。

データファイルの列 1 の文字列が「停止ファイル」の文字列と一致する場合、行全体が削除されます。

データファイルの例を次に示します。

abandonment-n   after+n-the+n-a-j   stop-n  1
abandonment-n   against+n-the+ns    leave-n 1
cake-n  against+n-the+vg    rest-v  1
abandonment-n   as+n-a+vd   require-v   1
abandonment-n   as+n-a-j+vg-up  use-v   1

停止ファイルの例を次に示します。

apple-n
banana-n
cake-n
pigeon-n

これが私がこれまでに持っているコードです:

with open("input1", "rb") as oIndexFile:
        for line in oIndexFile: 
            lemma = line.split()
            #print lemma

with open ("input2", "rb") as oSenseFile:
    with open("output", "wb") as oOutFile:
        for line in oSenseFile:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            #print concept
            if concept != lemma:
                outstring = '\t'.join(nounsInterest)
                oOutFile.write(outstring + '\n')
            else: 
                pass

目的の出力は次のとおりです。

abandonment-n   after+n-the+n-a-j-stop-n    1
abandonment-n   against+n-the+ns-leave-n    1
abandonment-n   as+n-a+vd-require-v 1
abandonment-n   as+n-a-j+vg-up-use-v    1

洞察はありますか？

今のところ、私が得ている出力は次のとおりです。これは基本的に、私が行ってきたことの単なる印刷物です。

abandonment-n   after+n-the+n-a-j   stop-n  1
abandonment-n   against+n-the+ns    leave-n 1
cake-n  against+n-the+vg    rest-v  1
abandonment-n   as+n-a+vd   require-v   1
abandonment-n   as+n-a-j+vg-up  use-v   1

*** 私が試したもののいくつかは、まだ機能していません:

if concept != lemma: 私が最初に試した代わりにif concept not in lemma:

前述と同じ出力が生成されます。

また、関数が最初の入力ファイルを呼び出しているのではなく、コードに組み込んでいることにも疑問があります。

with open ("input2", "rb") as oSenseFile:
    with open("tinput1", "rb") as oIndexFile:
        for line in oIndexFile: 
            lemma = line.split()
            with open("out", "wb") as oOutFile:
                for line in oSenseFile:
                    concept, slot, filler, freq = line.split()
                    nounsInterest = [concept, slot, filler, freq]
                    if concept not in lemma:
                        outstring = '\t'.join(nounsInterest)
                        oOutFile.write(outstring + '\n')
                    else: 
                        pass

これにより、空の出力ファイルが生成されます。

ここにあるように、別のアプローチも試しました：

filename = "input1.txt" 
filename2 = "input2.txt"
filename3 = "output1"

def fixup(filename): 
    fin1 = open(filename) 
    fin2 = open(filename2, "r")
    fout = open(filename3, "w") 
    for word in filename: 
        words = word.split()
    for line in filename2:
        concept, slot, filler, freq = line.split()
        nounsInterest = [concept, slot, filler, freq]
        if True in [concept in line for word in toRemove]:
            pass
        else:
            outstring = '\t'.join(nounsInterest)
            fout.write(outstring + '\n')
    fin1.close() 
    fin2.close() 
    fout.close()

hereから適応されましたが、成功しませんでした。この場合、出力はまったく生成されません。

誰かがこのタスクを解決する際に間違っている方向に私を向けることができますか? サンプルファイルは小さいですが、大きなファイルでこれを実行する必要があります。ご協力ありがとうございます。

score 4 · Accepted Answer

私はあなたがこのようなことをしようとしていると思います

with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        nouns_interest = concept, slot, filler, freq = line.split()
        if concept not in lemma:
            outfile.write('\t'.join(nouns_interest) + '\n')

目的の出力は、との間にハイフンを入れているように見えるslotのでfiller、使用したい場合があります

            outfile.write('{}\t{}-{}\t{}\n'.format(*nouns_interest))

score 1 · Accepted Answer

ロジックはまだチェックしていませんが、lemmaそこにある各行を上書きしています。おそらくそれをリストに追加しますか？

lemma = []
for line in oIndexFile:
    lemma.append(line.strip())  #strips everything except the text

または、@gnibbler が提案したように、少し効率的な理由で set を使用できます。

lemma = set()
for line in oIndexFile:
    lemma.add(line.strip())

編集：分割したくないようですが、改行文字を取り除きます。はい、あなたの論理はほぼ正しかったです

2 番目の部分は次のようになります。

with open ("data_php.txt", "rb") as oSenseFile:
    with open("out_FILTER_LINES", "wb") as oOutFile:
        for line in oSenseFile:
            concept, slot, filler, freq = line.split()
            nounsInterest = [concept, slot, filler, freq]
            #print concept
            if concept not in lemma: #check if the concept exists in lemma
                outstring = '\t'.join(nounsInterest)
                oOutFile.write(outstring + '\n')
            else: 
                pass

score 1 · Accepted Answer

データファイルの行が空白で始まっていないことが確実な場合は、行を分割する必要はありません。これは@gnibblerの答えを少し調整したものです。

with open('input1', 'rb') as indexfile:
    lemma = {x.strip() for x in indexfile}

with open('input2', 'rb') as sensefile, open('output', 'wb') as outfile:
    for line in sensefile:
        if not any([line.startswith(x) for x in lemma]):
            outfile.write(line)

python - Python: リスト内の単語を含む行を削除する

3 に答える 3

Related

Reference