regex - 複雑な同義語の一致

Question

Solr とのシノニムマッチングを行う必要があります。

たとえば、スウェーデンでは、通りの名前は通常Foogatan、gatan が英語で通りの名前であるという形をしています。この通りの名前は、次のように省略して書き出すことができますFoog.(英語で書いst.たようstreetに)

どのように機能するかはよく知っていますが、 beforeまたは beforesynonyms.txtの文字が含まれていることを確認する同義語を作成する方法がわかりません。gatang.

*g.と一致する同義語が必要*gatanです。

私はこれをやり遂げました（私が求めているもののラフドラフトとして機能するようです）

public boolean incrementToken() throws IOException {

    // See http://solr.pl/en/2012/05/14/developing-your-own-solr-filter/

    if (!input.incrementToken()) return false;

    String string = charTermAttr.toString();

    boolean containsGatan = string.contains("gatan");
    boolean containsG = string.contains("g.");

    if (containsGatan) {

        string = string.replace("gatan", "g.");

        char[] newBuffer = string.toCharArray();

        charTermAttr.setEmpty();
        charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);

        return true;
    }

    if (containsG) {

        string = string.replace("g.", "gatan");

        char[] newBuffer = string.toCharArray();

        charTermAttr.setEmpty();
        charTermAttr.copyBuffer(newBuffer, 0, newBuffer.length);

        return true;
    }

    return false;
}

031-123456また、私が抱えている同様の問題は、電話番号をとの形式で記述できることです031123456。031123456 のような電話番号を検索すると、031-123456も見つかるはずです。

Solrでこれを達成するにはどうすればよいですか?

score 0 · Accepted Answer

最初の例では、カスタムTokenFilterを作成してアナライザーに接続できます (それほど難しいことではありませんorg.apache.lucene.analysis.ASCIIFoldingFilter。いくつかの簡単な例を参照してください)。

PatternReplaceCharFilterFactory2 つ目は、http : //docs.lucidworks.com/display/solr/CharFilterFactoriesを使用して解決できる可能性があります。

数字から「-」文字を削除し、数字のみをインデックス/検索する必要があります。同様の質問: Solr PatternReplaceCharFilterFactory が指定されたパターンに置き換えられない

各トークンの末尾からgatanを削除する簡単な例:

public class Gatanizer extends TokenFilter {

    private final CharTermAttribute termAttribute = addAttribute(CharTermAttribute.class);

    /**
     * Construct a token stream filtering the given input.
     */
    protected Gatanizer(TokenStream input) {
        super(input);
    }

    @Override
    public boolean incrementToken() throws IOException {
        if (input.incrementToken()) {

            final char[] buffer = termAttribute.buffer();
            final int length = termAttribute.length();

            String tokenString = new String(buffer, 0, length);
            tokenString = StringUtils.removeEnd(tokenString, "gatan");

            termAttribute.setEmpty();
            termAttribute.append(tokenString);

            return true;
        }

        return false;
    }

}

TokenFilterそして、Solrフィールドに登録しました：

    <fieldtype name="gatan" stored="false" indexed="false" multiValued="true" class="solr.TextField">
        <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
            <filter class="solr.ASCIIFoldingFilterFactory"/>
            <filter class="gatanizer.GatanizerFactory"/>
        </analyzer>
    </fieldtype>

またGatanizerFactory、あなたのGatanizer

regex - 複雑な同義語の一致

1 に答える 1

Related

Reference