hadoop - ハイブ UDF の実行

Question

次のように、社内 API を使用して復号化を行う Hive UDF を作成しました。

public Text evaluate(String customer) {
    String result = new String();

    if (customer == null) { return null; }

    try {
        result = com.voltage.data.access.Data.decrypt(customer.toString(), "name");
    } catch (Exception e) {
        return new Text(e.getMessage());
    }

    return new Text(result);
}

および Data.decrypt は次のことを行います。

public static String decrypt(String data, String type) throws Exception {
    configure();
    String FORMAT = new String();
    if (type.equals("ccn")) {
        FORMAT = "CC";
    } else if (type.equals("ssn")) {
        FORMAT = "SSN";
    }   else if (type.equals("name")) {
        FORMAT = "AlphaNumeric";
    }

    return library.FPEAccess(identity, LibraryContext.getFPE_FORMAT_CUSTOM(),String.format("formatName=%s", FORMAT),authMethod, authInfo, data);
}

ここで、configure()はかなり高価なコンテキストオブジェクトを作成します。

私の質問は次のとおりです。Hive は、クエリによって返される行ごとに 1 回、この UDF を実行しますか? つまり、10,000 行を選択している場合、evaluate メソッドは 10,000 回実行されますか?

私の腸の本能は私にイエスと言います。もしそうなら、ここに2番目の質問があります：

次のいずれかを行う方法はありますか。

a）クエリが最初に開始されたときにconfigure（）を1回実行し、次にコンテキストオブジェクトを共有します

b) 復号化された文字列を返す UDF の代わりに、暗号化された文字列をいくつかのセットに集約し、セットで一括復号化を行いますか?

前もって感謝します

score 2 · Accepted Answer

configure()JVMごとに1回、またはUDFクラスのインスタンスごとに1回呼び出す必要があるものはありますか？

JVMごとに1回の場合は、次のように、クラスの静的ブロックに配置します。

static {
    configure();
}

インスタンスごとに1回の場合は、コンストラクターに配置します。

public [class name]() {
    super();
    configure();
}

hadoop - ハイブ UDF の実行

1 に答える 1

Related

Reference