java - インラインコードは Java の関数呼び出し/静的関数よりも遅い

Question

関数コードのインライン化 (コード自体に関数アルゴリズムを明示的に記述すること) がパフォーマンスにどのように影響するかを確認するために、いくつかのテストを実行してきました。単純なバイト配列を整数コードに記述し、それを関数にラップして、別のクラスから静的に呼び出し、クラス自体から静的に呼び出しました。コードは次のとおりです。

public class FunctionCallSpeed {
    public static final int numIter = 50000000;

    public static void main (String [] args) {
        byte [] n = new byte[4];

        long start;

        System.out.println("Function from Static Class =================");
        start = System.nanoTime();
        for (int i = 0; i < numIter; i++) {
            StaticClass.toInt(n);
        }
        System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");

        System.out.println("Function from Class ========================");
        start = System.nanoTime();
        for (int i = 0; i < numIter; i++) {
            toInt(n);
        }
        System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");

        int actual = 0;

        int len = n.length;

        System.out.println("Inline Function ============================");
        start = System.nanoTime();
        for (int i = 0; i < numIter; i++) {
            for (int j = 0; j < len; j++) {
                actual += n[len - 1 - j] << 8 * j;
            }
        }
        System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
    }

    public static int toInt(byte [] num) {
        int actual = 0;

        int len = num.length;

        for (int i = 0; i < len; i++) {
            actual += num[len - 1 - i] << 8 * i;
        }

        return actual;
    }
}

結果は次のとおりです。

Function from Static Class =================
Elapsed time: 0.096559931s
Function from Class ========================
Elapsed time: 0.015741711s
Inline Function ============================
Elapsed time: 0.837626286s

バイトコードで何か奇妙なことが起こっていますか? 私は自分でバイトコードを見てきましたが、あまり詳しくなく、頭も尻尾もわかりません。

編集

出力を読み取るステートメントを追加assertし、読み取ったバイト数をランダム化すると、ベンチマークが思ったとおりに動作するようになりました。マイクロベンチマークの記事を教えてくれた Tomasz Nurkiewicz に感謝します。したがって、結果のコードは次のようになります。

public class FunctionCallSpeed {
public static final int numIter = 50000000;

public static void main (String [] args) {
    byte [] n;

    long start, end;
    int checker, calc;

    end = 0;
    System.out.println("Function from Object =================");
    for (int i = 0; i < numIter; i++) {
        checker = (int)(Math.random() * 65535);
        n = toByte(checker);
        start = System.nanoTime();
        calc = StaticClass.toInt(n);
        end += System.nanoTime() - start;
        assert calc == checker;
    }
    System.out.println("Elapsed time: " + (double)end / 1000000000 + "s");
    end = 0;
    System.out.println("Function from Class ==================");
    start = System.nanoTime();
    for (int i = 0; i < numIter; i++) {
        checker = (int)(Math.random() * 65535);
        n = toByte(checker);
        start = System.nanoTime();
        calc = toInt(n);
        end += System.nanoTime() - start;
        assert calc == checker;
    }
    System.out.println("Elapsed time: " + (double)end / 1000000000 + "s");


    int len = 4;
    end = 0;
    System.out.println("Inline Function ======================");
    start = System.nanoTime();
    for (int i = 0; i < numIter; i++) {
        calc = 0;
        checker = (int)(Math.random() * 65535);
        n = toByte(checker);
        start = System.nanoTime();
        for (int j = 0; j < len; j++) {
            calc += n[len - 1 - j] << 8 * j;
        }
        end += System.nanoTime() - start;
        assert calc == checker;
    }
    System.out.println("Elapsed time: " + (double)(System.nanoTime() - start) / 1000000000 + "s");
}

public static byte [] toByte(int val) {
    byte [] n = new byte[4];

    for (int i = 0; i < 4; i++) {
        n[i] = (byte)((val >> 8 * i) & 0xFF);
    }
    return n;
}

public static int toInt(byte [] num) {
    int actual = 0;

    int len = num.length;

    for (int i = 0; i < len; i++) {
        actual += num[len - 1 - i] << 8 * i;
    }

    return actual;
}
}

結果：

Function from Static Class =================
Elapsed time: 9.276437031s
Function from Class ========================
Elapsed time: 9.225660708s
Inline Function ============================
Elapsed time: 5.9512E-5s

score 5 · Accepted Answer

JIT が何を行っているかを保証することは常に困難ですが、私が推測しなければならない場合、関数の戻り値が使用されていないことに気づき、その多くを最適化しました。

関数の戻り値を実際に使用すると、速度が変わるに違いありません。

score 3 · Accepted Answer

いくつかの問題がありますが、主な問題は、1 つの最適化されたコードの 1 つの反復をテストしていることです。それはあなたにさまざまな結果をもたらすことは間違いありません。最初の 10,000 回程度の反復を無視して、2 秒間テストを実行することをお勧めします。

ループの結果が保持されない場合、ランダムな間隔の後にループ全体を破棄できます。

各テストを個別のメソッドに分割する

public class FunctionCallSpeed {
    public static final int numIter = 50000000;
    private static int dontOptimiseAway;

    public static void main(String[] args) {
        byte[] n = new byte[4];

        for (int i = 0; i < 10; i++) {
            test1(n);
            test2(n);
            test3(n);
            System.out.println();
        }
    }

    private static void test1(byte[] n) {
        System.out.print("from Static Class: ");
        long start = System.nanoTime();
        for (int i = 0; i < numIter; i++) {
            dontOptimiseAway = FunctionCallSpeed.toInt(n);
        }
        System.out.print((System.nanoTime() - start) / numIter + "ns ");
    }

    private static void test2(byte[] n) {
        long start;
        System.out.print("from Class: ");
        start = System.nanoTime();
        for (int i = 0; i < numIter; i++) {
            dontOptimiseAway = toInt(n);
        }
        System.out.print((System.nanoTime() - start) / numIter + "ns ");
    }

    private static void test3(byte[] n) {
        long start;
        int actual = 0;

        int len = n.length;

        System.out.print("Inlined: ");
        start = System.nanoTime();
        for (int i = 0; i < numIter; i++) {
            for (int j = 0; j < len; j++) {
                actual += n[len - 1 - j] << 8 * j;
            }
            dontOptimiseAway = actual;
        }
        System.out.print((System.nanoTime() - start) / numIter + "ns ");
    }

    public static int toInt(byte[] num) {
        int actual = 0;

        int len = num.length;

        for (int i = 0; i < len; i++) {
            actual += num[len - 1 - i] << 8 * i;
        }

        return actual;
    }
}

版画

from Class: 7ns Inlined: 11ns from Static Class: 9ns 
from Class: 6ns Inlined: 8ns from Static Class: 8ns 
from Class: 6ns Inlined: 9ns from Static Class: 6ns

これは、内側のループを個別に最適化すると、効率がわずかに向上することを示唆しています。

ただし、バイトからintへの最適化された変換を使用すると

public static int toInt(byte[] num) {
    return num[0] + (num[1] << 8) + (num[2] << 16) + (num[3] << 24);
}

すべてのテストレポート

from Static Class: 0ns from Class: 0ns Inlined: 0ns 
from Static Class: 0ns from Class: 0ns Inlined: 0ns 
from Static Class: 0ns from Class: 0ns Inlined: 0ns

テストは何も役に立たないことに気づきました。;)

score 3 · Accepted Answer

テストケースをキャリパーに移植しました：

import com.google.caliper.SimpleBenchmark;

public class ToInt extends SimpleBenchmark {

    private byte[] n;
    private int total;

    @Override
    protected void setUp() throws Exception {
        n = new byte[4];
    }

    public int timeStaticClass(int reps) {
        for (int i = 0; i < reps; i++) {
            total += StaticClass.toInt(n);
        }
        return total;
    }

    public int timeFromClass(int reps) {
        for (int i = 0; i < reps; i++) {
            total += toInt(n);
        }
        return total;
    }

    public int timeInline(int reps) {
        for (int i = 0; i < reps; i++) {
            int actual = 0;
            int len = n.length;
            for (int i1 = 0; i1 < len; i1++) {
                actual += n[len - 1 - i1] << 8 * i1;
            }
            total += actual;
        }
        return total;
    }

    public static int toInt(byte[] num) {
        int actual = 0;
        int len = num.length;
        for (int i = 0; i < len; i++) {
            actual += num[len - 1 - i] << 8 * i;
        }
        return actual;
    }
}

class StaticClass {
    public static int toInt(byte[] num) {
        int actual = 0;

        int len = num.length;

        for (int i = 0; i < len; i++) {
            actual += num[len - 1 - i] << 8 * i;
        }

        return actual;
    }

}

実際、インラインバージョンが最も遅いようですが、2 つの静的バージョンはほぼ同じです (予想どおり)。

キャリパー

理由は想像しにくい。次の 2 つの要因が考えられます。

JVM は、コードブロックが可能な限り小さく、推論が単純な場合に、マイクロ最適化を実行するのに適しています。関数がインライン化されると、コード全体がより複雑になり、JVM はあきらめます。機能が小さいほどtoInt()、JIT の方が賢い
キャッシュの局所性 - どういうわけか、JVM は 1 つの大きなコードチャンクよりも 2 つの小さなコードチャンク (ループとメソッド) でパフォーマンスが向上します

score 0 · Accepted Answer

あなたのテストには欠陥があります。2 番目のテストには、最初のテストが既に実行されているという利点があります。各テストケースを独自の JVM 呼び出しで実行する必要があります。

java - インライン コードは Java の関数呼び出し/静的関数よりも遅い

4 に答える 4

Related

Reference

java - インラインコードは Java の関数呼び出し/静的関数よりも遅い