python - Python で長さの異なる 2 つのファイルをマージする

Question

列数が同じで行数が異なる 2 つのファイルがあります。1 つのファイルはタイムスタンプのリストと単語のリストで、2 番目のファイルはタイムスタンプのリストで、各単語の音のリストです。つまり、次のようになります。

9640 12783 she
12783 17103 had
...

と：

9640 11240 sh
11240 12783 iy
12783 14078 hv
14078 16157 ae
16157 16880 dcl
16880 17103 d
...

これらの 2 つのファイルをマージして、一方の値として単語を、もう一方の値として発音記号を含むエントリのリストを作成します。つまり、次のようになります。

[['she', 'sh iy']
 ['had', 'hv ae dcl d']
  ...

私は完全な Python (およびプログラミング) 初心者ですが、最初のアイデアは、2 番目のファイルで最初のファイルの 2 番目のフィールドを検索し、それらをリストに追加することでした。私はこのようにしてみました：

word = open('SA1.WRD','r')
phone = open('SA1.PHN','r')
word_phone = []

for line in word.readlines():
    words = line.split()
    word = words[2]
    word_phone.append(word)

for line in phone.readlines():
    phones = line.split()
    phone = phones[2]
    if int(phones[1]) <= int(words[1]):
        word_phone.append(phone)

print word_phone

これは出力です：

['she', 'had', 'your', 'dark', 'suit', 'in', 'greasy', 'wash', 'water', 'all', 'year', 'sh', 'iy', 'hv', 'ae', 'dcl', 'd', 'y', 'er', 'dcl', 'd', 'aa', 'r', 'kcl', 'k', 's', 'uw', 'dx', 'ih', 'ng', 'gcl', 'g', 'r', 'iy', 's', 'iy', 'w', 'aa', 'sh', 'epi', 'w', 'aa', 'dx', 'er', 'q', 'ao', 'l', 'y', 'iy', 'axr']

私が言ったように、私は完全な初心者であり、いくつかの提案は非常に役に立ちます.

更新: 可能であれば、この質問を再検討したいと思います。ディレクトリで動作するように Lattyware のコードを変更しました。

phns = []
wrds = []
for root, dir, files in os.walk(sys.argv[1]):
    wrds = wrds + [ os.path.join( root, f ) for f in files if f.endswith( '.WRD' ) ]
    phns = phns + [ os.path.join( root, f ) for f in files if f.endswith( '.PHN' ) ]
phns.sort()
wrds.sort()
files = (zip(wrds,phns))

#OPEN THE WORD AND PHONE FILES, COMPARE THEM
output = []
for file in files:
    with open( file[0] ) as unsplit_words, open( file[1] ) as unsplit_sounds:
        sounds = (line.split() for line in unsplit_sounds)
        words = (line.split() for line in unsplit_words)
        output = output +  [
          (word, " ".join(sound for _, _, sound in
                    takeuntil(sounds, stop)))
                for start, stop, word in words
            ]

これらのファイルのファイルパスに保持したい情報がいくつかあります。このコードが返すリストのタプルに分割ファイルパスを追加するにはどうすればよいか疑問に思っていました。

[('she', 'sh iy', 'directory', 'subdirectory'), ('had', 'hv ae dcl d', 'directory', subdirectory')]

パスを分割してリストをまとめて圧縮できると考えましたが、上記のコードが出力するリストには合計 53,000 の項目がありますが、処理されているのは 6,300 のファイルペアのみです。

score 3 · Accepted Answer

これは、主な問題が音と言葉を一致させるタスクです。幸いなことに、単語の終了時間に一致するまですべての音を取得するだけでよいため、これは簡単に行うことができます。

これを行うには、takeuntil()関数を作成する必要があります- itertools.takewhile()（私の元のソリューション）残念ながら余分な値を取るため、これが最良のソリューションです。

def takeuntil(iterable, stop):
    for x in iterable:
        yield x
        if x[1] == stop:
            break

with open("SA1.WRD") as unsplit_words, open("SA1.PHN") as unsplit_sounds:
    sounds = (line.split() for line in unsplit_sounds)
    words = (line.split() for line in unsplit_words)
    output = [
        (word, " ".join(sound for _, _, sound in takeuntil(sounds, stop)))
        for start, stop, word in words
    ]

print(output)

私たちに与えます：

[('she', 'sh iy'), ('had', 'hv ae dcl d')]

このコードでは、読みやすくするため、およびファイルを閉じるためにステートメントを使用しwithます (例外が発生した場合でも) 。また、リスト内包表記とジェネレーター式を多用します。

コードにいくつかの悪いパターンがあります。ステートメントopen()なしでの使用は悪い考えであり、使用する必要はありません（ファイルを直接ループします-怠惰であるため、ほとんどの場合、はるかに効率的であり、読みやすく、入力が少ないことは言うまでもありません）。withreadlines()

では、これはどのように機能するのでしょうか。それを実行してみましょう：

まず、両方のファイルを開いて読み取り、ファイル内の行を分割するクイックジェネレーター式をスローします。

次はモンスターリストの理解です。ここで行うことはsounds、現在の単語に属する最後の音に到達するまで iterable から音を取得し、次の単語に移動して、単語と関連する音のリストを返すことです。次に、を使用str.join()して音を 1 つの文字列に結合します。

思考プロセスを理解するのに問題がある場合は、Python 側のループのために効率が大幅に低下しますが、同じように機能する拡張バージョンを次に示します (ジェネレーターとリストの内包表記により、上記がはるかに高速になります)。

with open("SA1.WRD") as words, open("SA1.PHN") as sounds:
    output = []
    current = []
    for line in words:
        start, stop, word = line.split()
        for sound_line in sounds:
            sound_start, sound_stop, sound = sound_line.split()
            current.append(sound)
            if sound_stop == stop:
                break
        output.append((word, " ".join(current)))
        current = []

print(output)

python - Python で長さの異なる 2 つのファイルをマージする

1 に答える 1

Related

Reference