テキストから SynSets を取得するために LESK アルゴリズムを使用しています。しかし、同じ入力でも異なる結果が得られます。Leskアルゴリズムの「機能」ですか、それとも何か間違っていますか? 次は私が使用しているコードです:
self.SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
The language provides constructs intended to enable clear programs on both a small and large scale.\
Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
")
stopwordsList = stopwords.words('english')
self.sentNum=0;
for sentence in sentences:
raw_tokens = word_tokenize(sentence)
final_tokens = [token.lower() for token in raw_tokens
if(not token in stopwordsList)
#and (len(token) > 3)
and not token.isdigit()]
for token in final_tokens:
synset = wsd.lesk(sentence, token)
if not synset is None:
self.SynSets.append(synset)
self.SynSets = set(self.SynSets)
self.WriteSynSets()
return self
出力で結果が得られます (2 つの異なる起動からの最初の 3 つの結果):
Synset('allow.v.09') Synset('code.n.03') Synset('coffee.n.01')
------------
Synset('allow.v.09') Synset('argumentation.n.02') Synset('boastfully.r.01')
シンセットを取得する別の (より安定した) 方法があれば、あなたの助けに感謝します。
前もって感謝します。
編集済み
追加の例として、2 回実行した完全なスクリプトを次に示します。
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords
SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
The language provides constructs intended to enable clear programs on both a small and large scale.\
Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
")
stopwordsList = stopwords.words('english')
for sentence in sentences:
raw_tokens = word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
#removing stopwords and words, smaller than 3 characters
final_tokens = [token.lower() for token in raw_tokens
if(not token in stopwordsList)
#and (len(token) > 3)
and not token.isdigit()]
for token in final_tokens:
synset = wsd.lesk(sentence, token)
if not synset is None:
SynSets.append(synset)
SynSets = set(SynSets)
SynSets = sorted(SynSets)
with open("synsets.txt", "a") as file:
file.write("\n-------------------\n")
for synset in SynSets:
file.write("{} ".format(str(synset.__str__())))
file.close()
そして、これらの結果が得られました(プログラムを実行した2回のそれぞれについてファイルに書き込まれた最初の4つの結果のsynset):
Synset('allow.v.04') Synset('boastfully.r.01') Synset('clear.v.11') Synset('code.n.02')
Synset('boastfully.r.01') Synset('clear.v.19') Synset('code.n.01') Synset('design.n.04')
解決策: 何が問題だったのかわかりました。Python 2.7を再インストールした後、すべての問題がなくなりました。そのため、lesk アルゴリズムで python 3.x を使用しないでください。