python-3.x - NLTK。Lesk は同じ入力に対して異なる結果を返します

Question

テキストから SynSets を取得するために LESK アルゴリズムを使用しています。しかし、同じ入力でも異なる結果が得られます。Leskアルゴリズムの「機能」ですか、それとも何か間違っていますか? 次は私が使用しているコードです：

    self.SynSets =[]
    sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
        Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
        The language provides constructs intended to enable clear programs on both a small and large scale.\
        Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
        ")
    stopwordsList =  stopwords.words('english')
    self.sentNum=0;
    for sentence in sentences:
        raw_tokens =  word_tokenize(sentence)
        final_tokens = [token.lower() for token in raw_tokens 
                    if(not token in stopwordsList) 
                    #and (len(token) > 3) 
                    and not token.isdigit()]
        for token in final_tokens:
            synset = wsd.lesk(sentence, token)
            if not synset is None:
                self.SynSets.append(synset)

    self.SynSets = set(self.SynSets)
    self.WriteSynSets()
    return self

出力で結果が得られます (2 つの異なる起動からの最初の 3 つの結果):

Synset('allow.v.09')   Synset('code.n.03')   Synset('coffee.n.01') 
------------
Synset('allow.v.09')   Synset('argumentation.n.02')   Synset('boastfully.r.01')

シンセットを取得する別の (より安定した) 方法があれば、あなたの助けに感謝します。

前もって感謝します。

編集済み

追加の例として、2 回実行した完全なスクリプトを次に示します。

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords

SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
    Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
    The language provides constructs intended to enable clear programs on both a small and large scale.\
    Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
    ")
stopwordsList =  stopwords.words('english')

for sentence in sentences:
    raw_tokens =  word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
    #removing stopwords and words, smaller than 3 characters
    final_tokens = [token.lower() for token in raw_tokens 
                if(not token in stopwordsList) 
                #and (len(token) > 3) 
                and not token.isdigit()]
    for token in final_tokens:
        synset = wsd.lesk(sentence, token)
        if not synset is None:
            SynSets.append(synset)


SynSets = set(SynSets)

SynSets = sorted(SynSets)
with open("synsets.txt", "a") as file:
    file.write("\n-------------------\n")
    for synset in SynSets:
        file.write("{}   ".format(str(synset.__str__())))
file.close()

そして、これらの結果が得られました（プログラムを実行した2回のそれぞれについてファイルに書き込まれた最初の4つの結果のsynset）：

Synset('allow.v.04') Synset('boastfully.r.01') Synset('clear.v.11') Synset('code.n.02')
Synset('boastfully.r.01') Synset('clear.v.19') Synset('code.n.01') Synset('design.n.04')

解決策: 何が問題だったのかわかりました。Python 2.7を再インストールした後、すべての問題がなくなりました。そのため、lesk アルゴリズムで python 3.x を使用しないでください。

score 2 · Accepted Answer

NLTK の最新バージョンには、lesk アルゴリズム用の wsd 関数があります。

>>> from nltk.wsd import lesk
>>> from nltk import sent_tokenize
>>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."
>>> for sent in sent_tokenize(text):
...     for word in word_tokenize(sent):
...             print word, lesk(sent, word), sent

[アウト]：

Python Synset('python.n.02') Python is a widely used general-purpose, high-level programming language.
is Synset('be.v.08') Python is a widely used general-purpose, high-level programming language.
a Synset('angstrom.n.01') Python is a widely used general-purpose, high-level programming language.
widely Synset('wide.r.04') Python is a widely used general-purpose, high-level programming language.
used Synset('use.v.01') Python is a widely used general-purpose, high-level programming language.
general-purpose None Python is a widely used general-purpose, high-level programming language.
, None Python is a widely used general-purpose, high-level programming language.

また、（https://github.com/alvations/pywsddisambiguate() ）から試してください：pywsd

>>> from pywsd import disambiguate>>> from nltk import sent_tokenize
>>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."
>>> for sent in sent_tokenize(text):
...     print disambiguate(sent, prefersNone=True)
...

[アウト]：

[('Python', Synset('python.n.02')), ('is', None), ('a', None), ('widely', Synset('widely.r.03')), ('used', Synset('used.a.01')), ('general-purpose', None), (',', None), ('high-level', None), ('programming', Synset('scheduling.n.01')), ('language', Synset('terminology.n.01')), ('.', None)]
[('Its', None), ('design', Synset('purpose.n.01')), ('philosophy', Synset('philosophy.n.03')), ('emphasizes', Synset('stress.v.01')), ('code', Synset('code.n.03')), ('readability', Synset('readability.n.01')), (',', None), ('and', None), ('its', None), ('syntax', Synset('syntax.n.03')), ('allows', Synset('let.v.01')), ('programmers', Synset('programmer.n.01')), ('to', None), ('express', Synset('express.n.03')), ('concepts', Synset('concept.n.01')), ('in', None), ('fewer', None), ('lines', Synset('wrinkle.n.01')), ('of', None), ('code', Synset('code.n.03')), ('than', None), ('would', None), ('be', None), ('possible', Synset('potential.a.01')), ('in', None), ('languages', Synset('linguistic_process.n.02')), ('such', None), ('as', None), ('C++', None), ('or', None), ('Java', Synset('java.n.03')), ('.', None)]
[('The', None), ('language', Synset('language.n.01')), ('provides', Synset('provide.v.06')), ('constructs', Synset('concept.n.01')), ('intended', Synset('mean.v.03')), ('to', None), ('enable', None), ('clear', Synset('open.n.01')), ('programs', Synset('program.n.08')), ('on', None), ('both', None), ('a', None), ('small', Synset('small.a.01')), ('and', None), ('large', Synset('large.a.01')), ('scale', Synset('scale.n.10')), ('.', None)]
[('Python', Synset('python.n.02')), ('supports', Synset('support.n.11')), ('multiple', None), ('programming', Synset('program.v.02')), ('paradigms', Synset('substitution_class.n.01')), (',', None), ('including', Synset('include.v.03')), ('object-oriented', None), (',', None), ('imperative', Synset('imperative.a.02')), ('and', None), ('functional', Synset('functional.a.01')), ('programming', Synset('scheduling.n.01')), ('or', None), ('procedural', Synset('procedural.a.01')), ('styles', Synset('vogue.n.01')), ('.', None)]

それらは完璧ではありませんが、レスクの正確な実装に近づいています。

編集済み

実行するたびに同じ結果であることを確認するには、これを行うときに STDOUT がないようにする必要があります。

from nltk.wsd import lesk
from nltk import sent_tokenize, word_tokenize
text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."

lst = []
for sent in sent_tokenize(text):
    lst = []
    for word in word_tokenize(sent):
        lst.append(lesk(sent, word))
    for i in range(10):
        lst2 = []
        for word in word_tokenize(sent):
            lst2.append(lesk(sent, word))
        assert lst2 == lst

OP のコードを 10 回実行しましたが、同じ結果が得られます。

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords

def run():
    SynSets =[]
    sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
        Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
        The language provides constructs intended to enable clear programs on both a small and large scale.\
        Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
        ")
    stopwordsList =  stopwords.words('english')

    for sentence in sentences:
        raw_tokens =  word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
        #removing stopwords and words, smaller than 3 characters
        final_tokens = [token.lower() for token in raw_tokens 
                    if(not token in stopwordsList) 
                    #and (len(token) > 3) 
                    and not token.isdigit()]
        for token in final_tokens:
            synset = wsd.lesk(sentence, token)
            if not synset is None:
                SynSets.append(synset)
    return sorted(set(SynSets))

run1 = run()

for i in range(10):
    assert run1 == run()

python-3.x - NLTK。Lesk は同じ入力に対して異なる結果を返します

1 に答える 1

Related

Reference