python - 一致する文字列をdefaultdict（set）に抽出する方法は？Python

Question

そのような行を含むテキストファイルがあり（以下を参照）、英語の文の後にスペイン語の文が続き、同等の翻訳テーブルが「{##}」で区切られています。（あなたがそれを知っているなら、それはの出力ですgiza-pp）

あなたは、このパートセッションの間に、次の数日の間にこの主題についての討論を要求しました。{##}susseñoríashansolicitadoundebatesobre el tema paralospróximosdías、en elcursodeesteperíododesesions。{##} 0-0 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 12-10 13-11 14-11 15-12 16-13 17 -14 9-15 10-16 11-17 18-18 17-19 19-21 20-22

翻訳表は0-0 0-1、英語の0番目の単語（つまり）がスペイン語の0番目と1番目の単語（つまり）と一致することを意味しますyou。sus señorías

文からスペイン語の翻訳が何であるか知りたいとしましょうcourse、通常私はそれをこのようにします：

from collections import defaultdict
eng, spa, trans =  x.split(" {##} ")
tt = defaultdict(set)
for s,t in [i.split("-") for i in trans.split(" ")]:
  tt[s].add(t)

query = 'course'
for i in spa.split(" ")[tt[eng.index(query)]]:
  print i

上記を行う簡単な方法はありますか？かもしれませregexんか？line.find()？

いくつかの試みの後、MWEや欠落している翻訳のような他の多くの問題をカバーするためにこれをしなければなりません：

def getTranslation(gizaline,query):
    src, trg, trans =  gizaline.split(" {##} ")
    tt = defaultdict(set)
    for s,t in [i.split("-") for i in trans.split(" ")]:
        tt[int(s)].add(int(t))
    try:
        query_translated =[trg.split(" ")[i] for i in tt[src.split(" ").index(query)]]
    except ValueError:
        for i in src.split(" "):
            if "-"+query or query+"-" in i:
                query = i
                break
        query_translated =[trg.split(" ")[i] for i in tt[src.split(" ").index(query)]]

    if len(query_translated) > 0:
        return ":".join(query_translated)
    else:
        return "#NULL"

score 2 · Accepted Answer

この方法は問題なく機能しますが、単語を正しく並べ替えることができるように、list代わりにを使用して少し異なる方法で行います（単語はアルファベット順に出力されますが、希望どおりではありません）。setset

ファイル：q_15125575.py

#-*- encoding: utf8 -*-
from collections import defaultdict

INPUT = """you have requested a debate on this subject in the course of the next few days , during this part-session . {##} sus señorías han solicitado un debate sobre el tema para los próximos días , en el curso de este período de sesiones . {##} 0-0 0-1 1-2 2-3 3-4 4-5 5-6 6-7 7-8 8-9 12-10 13-11 14-11 15-12 16-13 17-14 9-15 10-16 11-17 18-18 17-19 19-21 20-22"""

if __name__ == "__main__":
    english, spanish, trans = INPUT.split(" {##} ")
    eng_words = english.split(' ')
    spa_words = spanish.split(' ')
    transtable = defaultdict(list)
    for e, s in [i.split('-') for i in trans.split(' ')]:
        transtable[eng_words[int(e)]].append(spa_words[int(s)])

    print(transtable['course'])
    print(transtable['you'])
    print(" ".join(transtable['course']))
    print(" ".join(transtable['you']))

出力：
['curso'] [' sus
'、'se \ xc3 \ xb1or \ xc3\xadas']
cursosusseñorías

インデックスの代わりに実際の単語を使用しているため、コードは少し長くなりますが、これにより、から直接検索できます。transtable

ただし、あなたの方法と私の方法の両方が同じ問題で失敗します：単語の繰り返し。
print(" ".join(transtable['this'])
与える：
el este
それは少なくとも単語が現れる順序である、それでそれは実行可能である。'this'翻訳の最初の出現をしたいですか？
transtable['this'][0]あなたに最初の言葉を与えるでしょう。

そしてあなたのコードを使う：

tt = defaultdict(set)
for e, s in [i.split('-') for i in trans.split(' ')]:
    tt[int(e)].add(int(s))

query = 'this'
for i in tt[eng_words.index(query)]:
    print i

与える：
7

コードは、単語の最初の出現のインデックスのみを出力します。

python - 一致する文字列をdefaultdict（set）に抽出する方法は？Python

1 に答える 1

Related

Reference