nlp - インド語で使用できるステマーはありますか

Question

(ヒンディー語、テルグ語) などのインド語のステマーの実装はありますか?

score 5 · Accepted Answer

Hindi Analyzerはステマー付きで、Lucene で使用できます。このアルゴリズムに基づいています(pdf)。

score 3 · Accepted Answer

Hindi_stemmerは、 Ananthakrishnan Ramanathan と Durgesh D Rao による「 A Lightweight Stemmer for Hindi 」で説明されているヒンディー語ステマーの Python 実装です。

score -2 · Accepted Answer

オリジナルコードのJava版を作成しましたPython Hindi Stemmer

元のヒンディー語ステマーの作成者が変数 L を使用した方法はよく理解されていませんが、機能する完全なコードがあります。

import java.util.ArrayList; org.apache.commons.lang.StringUtils をインポートします。

public class SimpleHindiStemmer {

/* This is not coming while pasting the code better copy from Python Code as given in suffixes[1]  */
private static String [] stem1 = new String [] { "ो", "े", "ू", "ु", "ी", "ि", "ा" };

/* This is not coming while pasting the code better copy from Python Code as given in suffixes[2]  */
private static String [] stem2 = new String [] { "कर", "ाओ", "िए", "ाई", "ाए", "ने", "नी", "ना", "ते", "ीं", "ती", "ता", "ाँ", "ां", "ों", "ें" };

private static String [] stem3 = new String [] { "ाकर", "ाइए", "ाईं", "ाया", "ेगी", "ेगा", "ोगी", "ोगे", "ाने", "ाना", "ाते", "ाती", "ाता", "तीं", "ाओं", "ाएं", "ुओं", "ुएं", "ुआं" };

private static String [] stem4 = new String [] { "ाएगी", "ाएगा", "ाओगी", "ाओगे", "एंगी", "ेंगी", "एंगे", "ेंगे", "ूंगी", "ूंगा", "ातीं", "नाओं", "नाएं", "ताओं", "ताएं", "ियाँ", "ियों", "ियां" };

private static String [] stem5 = new String [] { "ाएंगी", "ाएंगे", "ाऊंगी", "ाऊंगा", "ाइयाँ", "ाइयों", "ाइयां" };

private static ArrayList<String []> stemList = new ArrayList<String []>(5);

static int [] cut = new int [] { 0, 1, 1, 1, 2, 2 };

static {

    stemList.add(stem5);
    stemList.add(stem4);
    stemList.add(stem3);
    stemList.add(stem2);
    stemList.add(stem1);

}

public SimpleHindiStemmer(){

}

public static void main(String [] argv){

    SimpleHindiStemmer sm = new SimpleHindiStemmer();
    String word = "रास्ते";
    System.out.println(sm.stemprocess(word));
}

public String stemprocess(String word){


    int wlen = word.length();
    int wordlen = wlen*3;
    int icnt = 5;
    for (String [] stemwords : stemList){
        if(wordlen > (icnt + 1)){
            for(String sw: stemwords){
                if(StringUtils.endsWith(word, sw)){
                    return StringUtils.substring(word, 0, wlen - cut[icnt]);

                }
            }

        }


        icnt--;
    }

    return word;
}
}

ご覧のとおり、utf-8 文字がうまくキャプチャされない場合があります。元の python コードを見て、そこからサフィックスの値をコピーするだけです。

nlp - インド語で使用できるステマーはありますか

4 に答える 4

Related

Reference