0

文字列から辞書を作成する方法に関する私の質問は、文字列から辞書を作成するよりも言語学的に/ NLP に傾倒しています。

文字列文のリストが与えられた場合、一意の単語辞書を作成して文字列文をベクトル化する簡単な方法はありますか? これを行うための外部ライブラリがあることは知っていますがgensim、それらを避けたいと思います。私はこのようにしてきました:

from itertools import chain

def getKey(dic, value):
  return [k for k,v in sorted(dic.items()) if v == value]

# Vectorize will return a list of tuples and each tuple is made up of 
# (<position of word in dictionar>,<number of times it occurs in sentence>)
def vectorize(sentence, dictionary): # is there simpler way to do this?
  vector = []
  for word in sentence.split():
    word_count = sentence.lower().split().count(word)
    dic_pos = getKey(dictionary, word)[0]
    vector.append((dic_pos,word_count))
  return vector

s1 = "this is is a foo"
s2 = "this is a a bar"
s3 = "that 's a foobar"

uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this?
dictionary = {}
for i in range(len(uniq)): # can this be done with dict(list_comprehension)?
  dictionary[i] = uniq[i]

v1 = vectorize(s1, dictionary)
v2 = vectorize(s2, dictionary)
v3 = vectorize(s3, dictionary)

print v1
print v2
print v3
4

4 に答える 4

3

ここ:

from itertools import chain, count

s1 = "this is is a foo"
s2 = "this is a a bar"
s3 = "that 's a foobar"

# convert each sentence into a list of words, because the lists
# will be used twice, to build the dictionary and to vectorize
w1, w2, w3 = all_ws = [s.split() for s in [s1, s2, s3]]

# chain the lists and turn into a set, and then a list, of unique words
index_to_word = list(set(chain(*all_ws)))

# build the inverse mapping of index_to_word, by pairing it with a counter
word_to_index = dict(zip(index_to_word, count()))

# create the vectors of word indices and of word count for each sentence
v1 = [(word_to_index[word], w1.count(word)) for word in w1]
v2 = [(word_to_index[word], w2.count(word)) for word in w2]
v3 = [(word_to_index[word], w3.count(word)) for word in w3]

print v1
print v2
print v3

注意事項:

  • ディクショナリはキーから値へのみ移動する必要があります。反対のことをする必要がある場合は、上記で行ったように、一方が他方の逆マッピングである 2 つの辞書を作成 (および更新を維持) します。
  • キーが連続した整数である辞書が必要な場合は、リストを使用してください (Jeff に感謝)。
  • 同じことを 2 回計算することはありません。(文の split() バージョンを参照してください) 後で必要になった場合は変数に保存します。
  • パフォーマンス、簡潔さ、読みやすさのために、可能な限りリスト内包表記を使用してください。
于 2013-03-14T00:19:41.497 に答える
1

You've got multiple questions in your code, so let's answer them one by one.


uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this?

For one thing, it might be conceptually simpler (although just as verbose) to split() the strings independently, instead of joining them together then splitting the result.

uniq = list(set(chain(*map(str.split, (s1, s2, s3))))

Beyond that: it looks like you're always using the word lists, not the actual sentences, so you're splitting in multiple places. Why not just split them all at once, up at the top?

Meanwhile, instead of having to explicitly pass around s1, s2, and s3, why not stick them in a collection? And you can stick the results in a collection as well.

So:

sentences = (s1, s2, s3)
wordlists = [sentence.split() for sentence in sentences]

uniq = list(set(chain.from_iterable(wordlists)))

# ...

vectors = [vectorize(sentence, dictionary) for sentence in sentences]
for vector in vectors:
    print vector

dictionary = {}
for i in range(len(uniq)): # can this be done with dict(list_comprehension)?
  dictionary[i] = uniq[i]

You could do it as dict() on a list comprehension—but, even more simply, use a dict comprehension. And, while you're at it, use enumerate instead of the for i in range(len(uniq)) bit.

dictionary = {idx: word for (idx, word) in enumerate(uniq)}

That replaces the whole # ... part in the above.


Meanwhile, if you want a reverse dictionary lookup, this is not the way to do it:

def getKey(dic, value):
    return [k for k,v in sorted(dic.items()) if v == value]

Instead, create an inverse dictionary, mapping values to lists of keys.

def invert_dict(dic):
    d = defaultdict(list)
    for k, v in dic.items():
        d[v].append(k)
    return d

Then, instead of your getKey function, just do a normal lookup in the inverted dict.

If you need to alternate modifications and lookups, you probably want some kind of bidirectional dictionary, that manages its own inverse dictionary as it goes along. There are a bunch of recipes for such a thing on ActiveState, and there may be some modules on PyPI, but it's not that hard to build yourself. And at any rate, you don't seem to need that here.


Finally, there's your vectorize function.

The first thing to do is to take a word list instead of a sentence to split, as mentioned above.

And there's no reason to re-split the sentence after lower; just use a map or generator expression on the word list.

In fact, I'm not sure why you're doing lower here, when your dictionary is built out of the original-case versions. I'm guessing that's a bug, and you wanted to do lower when building the dictionary as well. That's one of the advantages of making the word lists in advance in a single, easy-to-find place: you just need to change that one line:

wordlists = [sentence.lower().split() for sentence in sentences]

Now you're already a bit simpler:

def vectorize(wordlist, dictionary):
    vector = []
    for word in wordlist:
        word_count = wordlist.count(word)
        dic_pos = getKey(dictionary, word)[0]
        vector.append((dic_pos,word_count))
    return vector

Meanwhile, you may recognize that the vector = []… for word in wordlist… vector.append is exactly what a list comprehension is for. But how do you turn three lines of code into a list comprehension? Easy: refactor it into a function. So:

def vectorize(wordlist, dictionary):
    def vectorize_word(word):
        word_count = wordlist.count(word)
        dic_pos = getKey(dictionary, word)[0]
        return (dic_pos,word_count)
    return [vectorize_word(word) for word in wordlist]
于 2013-03-14T00:07:45.330 に答える
1

文中の単語の出現回数をカウントしようとしている場合は、使用しますcollections.Counter

コードの問題:

uniq = list(set(chain(" ".join([s1,s2,s3]).split()))) # is there simpler way for this?
dictionary = {}
for i in range(len(uniq)): # can this be done with dict(list_comprehension)?
  dictionary[i] = uniq[i]

set上記の部分が行うことは、任意の番号でインデックス付けされた辞書を作成することです(これは、インデックスの概念を持たないa の反復から来ました)。次に、上記の辞書は次を使用してアクセスされます

def getKey(dic, value):
  return [k for k,v in sorted(dic.items()) if v == value]

この関数も dict の精神を完全に無視しています。値ではなくキーで検索を行います。

また、その考え方もvectorize不明です。この機能によって何を達成しようとしていますか? あなたは のより単純なバージョンを求めましたがvectorize、それが何をするのかは教えてくれませんでした。

于 2013-03-13T23:58:01.507 に答える
0

よし、それはあなたが望むように見えます:

  • 各トークンの位置値を返す辞書。
  • セット内でトークンが見つかった回数。

あなたは出来る:

import bisect

uniq.sort() #Sort it since order didn't seem to matter

def getPosition(value):
    position = bisect.bisect_left(uniq, value) #Do a log(n) query
    if uniq[position] != value:
        raise IndexError

O(n) 時間で検索するには、代わりにセットを作成し、順次キーを使用して各値を繰り返し挿入することができます。これはメモリ効率が大幅に低下しますが、ハッシュによる O(n) 検索を提供します...そして、私が書いている間に Tobia がすばらしいコード例を投稿したので、その回答を参照してください。

于 2013-03-14T00:24:56.440 に答える