c# - Lucene.net でカスタム TokenFilter を使用してトークンからコンマを削除する方法

Question

次のようなキーワードを解析するためのカスタム tokenfilter セットアップがあります。

oracle,java,sybase,vb.net etc.

の中へ

oracle java sybase vb.net

正常に動作していますが、テストドキュメントの1つに次のテキストがあります

,oracle java,sybase,unix

そして、先頭のCOMMAを削除しようとしています

,oracle

以下のコードを使用して

    public override bool IncrementToken()
    {
        if (!input.IncrementToken())
            return false;


        char[] buffer = termAtt.TermBuffer();
        int bufferLength = termAtt.TermLength();

...
        else if (bufferLength > 1 && buffer[0] == ',')
        {
            // strip the starting , off !
            offsetAtt.SetOffset(offsetAtt.StartOffset + 1, offsetAtt.EndOffset);
        // where offsetAtt = AddAttribute<IOffsetAttribute>();
        }
        ...

        return true;

    }

ただし、コンマは削除されません

これを機能させる方法について何か助けはありますか？

ありがとう

score 1 · Accepted Answer

Lucene のトークンは属性に基づいて機能します。つまり、トークンの各プロパティ (テキスト値、オフセットなど) は属性です。

トークンのテキスト値は、トークン TermAttribute.class に関連付けられています。

オフセットやその他のプロパティを変更したら、おそらく次のスニペットを使用して、テキスト自体も変更する必要がある場合があります。

private final TermAttribute termAtt; // instance variable

termAtt = addAttribute(TermAttribute.class); // initialization in constructor 

....


 else if (bufferLength > 1 && buffer[0] == ',')
        {

            // strip the starting , off !
            offsetAtt.SetOffset(offsetAtt.StartOffset + 1, offsetAtt.EndOffset);

        // update the termAtt
            termAtt.setTermBuffer("sub-content of the buffer");

        }

....

それがうまくいったかどうか教えてください..

c# - Lucene.net でカスタム TokenFilter を使用してトークンからコンマを削除する方法

1 に答える 1

Related

Reference