python - 文字列のリストから、Python で最も奇妙な単語/文字列を取得するにはどうすればよいですか

Question

文字列のリストがあります：

['twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths', 'outgrabe', '"beware', 'the', 'jabberwock', 'my', 'son', 'the', 'jaws', 'that', 'bite', 'the', 'claws', 'that', 'catch', 'beware', 'the', 'jubjub', 'bird', 'and', 'shun', 'the', 'frumious', 'bandersnatch', 'he', 'took', 'his', 'vorpal', 'sword', 'in', 'hand', 'long', 'time', 'the', 'manxome', 'foe', 'he', 'sought', 'so', 'rested', 'he', 'by', 'the', 'tumtum', 'tree', 'and', 'stood', 'awhile', 'in', 'thought', 'and', 'as', 'in', 'uffish', 'thought', 'he', 'stood', 'the', 'jabberwock', 'with', 'eyes', 'of', 'flame', 'came', 'whiffling', 'through', 'the', 'tulgey', 'wood', 'and', 'burbled', 'as', 'it', 'came', 'one', 'two', 'one', 'two', 'and', 'through', 'and', 'through', 'the', 'vorpal', 'blade', 'went', 'snicker-snack', 'he', 'left', 'it', 'dead', 'and', 'with', 'its', 'head', 'he', 'went', 'galumphing', 'back', '"and', 'has', 'thou', 'slain', 'the', 'jabberwock', 'come', 'to', 'my', 'arms', 'my', 'beamish', 'boy', 'o', 'frabjous', 'day', 'callooh', 'callay', 'he', 'chortled', 'in', 'his', 'joy', '`twas', 'brillig', 'and', 'the', 'slithy', 'toves', 'did', 'gyre', 'and', 'gimble', 'in', 'the', 'wabe', 'all', 'mimsy', 'were', 'the', 'borogoves', 'and', 'the', 'mome', 'raths', 'outgrabe']

文字列内の他の単語と最も異なる単語のリストを返すにはどうすればよいですか - リスト内の他のすべての単語との最小類似度と平均類似度値 (float として) に基づいて。

これを行う方法がまったくわかりません。「word1」と「word2」の類似度を計算する cossim(word1,word2) 関数を講師から教わったので使う必要があると思いますが、使い方がわかりません。

def cossim(word1,word2):
"""Calculate the cosine similarity between the two words"""

# sub-function for constructing a letter vector from argument `word`
# which returns the tuple `(vec,veclen)`, where `vec` is a dictionary of
# characters in `word`, and `veclen` is the length of the vector
def wordvec(word):
    vec = defaultdict(int)  # letter vector

    # count the letters in the word
    for char in word:
        vec[char] += 1

    # calculate the length of the letter vector
    len = 0.0
    for char in vec:
        len += vec[char]**2

    # return the letter vector and vector length
    return vec,math.sqrt(len)

# calculate a vector,length tuple for each of `word1` and `word2`
vec1,len1 = wordvec(word1)
vec2,len2 = wordvec(word2)

# calculate the dot product between the letter vectors for the two words
dotprod = 0.0
for char in vec1:
    dotprod += vec1[char]*vec2[char]

# divide by the lengths of the two vectors
if dotprod:
    dotprod /= len1*len2

return dotprod

上記のリストから得られるべき答えは次のとおりです。

({'my'], 0.088487238234566931)

どんな助けでも大歓迎です、

ありがとう、

キーリー

score 2 · Accepted Answer

Robert Rossney が提案したようなアプローチを使用する前に、まず単語のリストを重複排除する必要があります。そうしないと、結果の数値がわずかにずれてしまいwますd[word]。

これを行う 1 つの可能な方法は、リストからセットを作成することです。

set_of_words = set(mylist)
differences = {}
for word in set_of_words:
    differences[word] = [cossim(word, word2) for word2 in set_of_words if word != word2]

これにより、各単語に他の単語との違いのリストを割り当てる辞書が作成されます。

これらのリストを辞書エントリに直接割り当てる代わりに、ループ内の変数に保存し、その変数を使用してロバートのソリューションで提案されている afg のように平均を計算することもできます。

ディクショナリ関数iteritems(key, value)を使用すると、-pairsを反復処理できます。min関数には、タプルまたはリストの 2 番目の要素で並べ替えるkeyなど、最小化する対象を指定する特別なパラメーターがあります。key=lambda x: x[1]

score 1 · Accepted Answer

出発点として、リスト内の単語をキーとし、リスト内の他のすべての単語を値とする辞書を作成することをお勧めします。

d = {}
for word in mylist:
   d[word] = [w for w in mylist if w != word]

これにより、各単語の類似値をすばやく計算できます。

similarities = {}
for word in mylist:
   similarities[word] = [cossim(w, word) for w in d[word]]

そこから、各単語の最小類似度と平均類似度を簡単に計算できます。

score 1 · Accepted Answer

したがって、私の理解が正しければ、目標は cossim と他のすべての単語の和が最小になる単語を見つけることです。そのためには、次のコードで十分です。

/* removed at the reasonable request of agf */

大まかに言えば、私たちが行っていることは、リスト内の各単語をループして、他のすべての単語とどの程度類似しているかをチェックすることです。これまでに見た他のどの単語よりも類似性が低い場合は、それを保存します。私たちの出力は、他のすべての単語との類似性が最も低い単語です。

score 0 · Accepted Answer

モジュール Python-Levenshtein (pypi リンク)は、word1 と word2 の類似性を得るのに役立つと思います。

次の 2 つの関数を使用します。

import Levenshtein

str1 = 'abcde'
str2 = 'abcdf'
print(Levenshtein.distance(str1,str2))
# 1
print(Levenshtein.ratio(str1,str2))
# 0.8

で十分です。

python - 文字列のリストから、Python で最も奇妙な単語/文字列を取得するにはどうすればよいですか

4 に答える 4

Related

Reference