python - 文字列をトークンのリストの位置に変換します

Question

私は約5000個のユニークな単語/トークンのリストを持っています.各単語(スマイリーは単語としてカウントされます)は1行ごとです. SVM for Pythonで機能するものを作成しようとしています。

例のリストに数単語しかないことを想像してください

happy
sad
is
:(
i
the
day
am
today
:)

私の文字列は次のとおりです。

tweets =['i am happy today :)','is today the sad day :(']

次に、ツイートごとの出力は次のとおりです。

5:1 8:1 1:1 9:1 10:1
3:1 9:1 6:1 2:1 4:1

この形式 : に注意してください。これは、コロンの前の最初の数字が、 list 内の行番号/位置を使用して単語を参照する必要があることを意味します。たとえば、':)' はリストの 10 番目の単語です (テキストファイル、1 行に 1 トークン)。

テキストファイルを読み取り、各行（各単語/トークン）をリストまたは辞書の1か所に配置する関数を作成することを考えていました。これにより、各ツイートから単語を読み取り、それに基づいて数字に変換できますリスト内の位置。

Pythonでこれを行う方法について誰か考えがありますか? それから私は次のようなことを考えていました：

 for i in tweets:
         <translate-words-into-list-position>

score 5 · Accepted Answer

words = ['happy', 'sad', 'is', ':(', 'i', 'the', 'day', 'am', 'today', ':)']
d = {w: i for i, w in enumerate(words, start=1)}
tweets =['i am happy today :)','is today the sad day :(']
for tweet in tweets:
    print ' '.join(['{0}:1'.format(d[w]) for w in tweet.split() if w in d])


5:1 8:1 1:1 9:1 10:1
3:1 9:1 6:1 2:1 7:1 4:1

words がfileこのソリューションで引き続き使用できる場合は、その行を覚えておいて.rstrip('\n')ください。例えば。

with open('words.txt', 'rU') as f:
    d = {w.rstrip('\n'): i for i, w in enumerate(f, start=1)}

score 0 · Accepted Answer

>>> from itertools import count
>>> D = dict(zip(words, count(1)))
>>> tweets =['i am happy today :)','is today the sad day :(']
>>> [["{}:1".format(D[k]) for k in t.split() if k in D] for t in tweets]
[['5:1', '8:1', '1:1', '9:1', '10:1'], ['3:1', '9:1', '6:1', '2:1', '7:1', '4:1']]

python - 文字列をトークンのリストの位置に変換します

2 に答える 2

Related

Reference