java - Java Matrix-Vector-Multiplication は C バージョンより 100 倍遅い

Question

Android Java アプリケーションと Android NDK アプリケーションのパフォーマンスの違いに取り組んでいます。3D グラフィックスの例として、90000 を超える頂点に対して Matrix4D-Vector4D 変換を実行しました。

Java バージョンはC バージョンよりも100 倍近く遅いようです。私は何か間違っていましたか？似たような経験をした人はいますか？

変換のための私のJavaコード:

        long t1 = System.nanoTime();
        for ( int i = 0; i < vCount; i++)
        {

            Vector4 vOut = new Vector4();
            Vector4 v = vertices[i];

            vOut.v_[0] = v.v_[0] * matrix[0].v_[0];
            vOut.v_[1] = v.v_[0] * matrix[0].v_[1];
            vOut.v_[2] = v.v_[0] * matrix[0].v_[2];
            vOut.v_[3] = v.v_[0] * matrix[0].v_[3];

            vOut.v_[0] += v.v_[1] * matrix[1].v_[0];
            vOut.v_[1] += v.v_[1] * matrix[1].v_[1];
            vOut.v_[2] += v.v_[1] * matrix[1].v_[2];
            vOut.v_[3] += v.v_[1] * matrix[1].v_[3];

            vOut.v_[0] += v.v_[2] * matrix[2].v_[0];
            vOut.v_[1] += v.v_[2] * matrix[2].v_[1];
            vOut.v_[2] += v.v_[2] * matrix[2].v_[2];
            vOut.v_[3] += v.v_[2] * matrix[2].v_[3];

            vOut.v_[0] += v.v_[3] * matrix[3].v_[0];
            vOut.v_[1] += v.v_[3] * matrix[3].v_[1];
            vOut.v_[2] += v.v_[3] * matrix[3].v_[2];
            vOut.v_[3] += v.v_[3] * matrix[3].v_[3]; 

            vertices[i] = vOut;

        }
        long t2 = System.nanoTime();        
        long diff = t2 - t1;        
        double ms = (double)(diff / 1000000.0f);
        Log.w("GL2JNIView", String.format("ms %.2f ", ms));

パフォーマンス (変換 > 90,000 頂点 | Android 4.0.4 SGS II): (200 回の実行の中央値)

JAVA-Version:   2 FPS
C-Version:    190 FPS

score 5 · Accepted Answer

各反復で新しい Vector4 を作成します。私自身の経験から、新しい内部ループを使用すると、Android で予期しないパフォーマンスの問題が発生する可能性があります。

score 0 · Accepted Answer

私の知る限り、Android Javaの実装は、 JVMとは異なる命令セットを持ち、一部のバイトコードを機械コードに動的に変換するジャストインタイムコンパイル技術を使用せず、解釈するだけのDalvikと呼ばれる仮想マシンを介しています。そのため、Dalvik は C よりも CPU バウンドタスクで明らかに遅いです。

これは、ごく最近のAndroidシステムでは変更される可能性があります。

score 0 · Accepted Answer

ループも変更する必要があります。@toopok4k3 による回答に加えて、次のことを試してください。

for ループをダンプし、ArrayIndexOutOfBounds 例外をキャッチします。try/catch のオーバーヘッドを補うのに十分な大きさのループがあります。
行列配列とそれに含まれる値がループの反復ごとに変化しない場合は、それらをループの外側の定数に割り当てます。配列の逆参照とメンバー変数へのアクセスは、ローカル変数ほど高速ではありません。
v.v_[] は複数回使用されるため、ローカル変数に割り当てて、次の取得までに 4 回使用します。

以下のバージョンでは、値が double であると想定しています。

int i = 0;
try  
{
    Vector4 vOut = new Vector4();
    final double m0v0 = matrix[0].v_[0];
    final double m0v1 = matrix[0].v_[1];
    final double m0v2 = matrix[0].v_[2];
    final double m0v3 = matrix[0].v_[3];
    final double m1v0 = matrix[1].v_[0];
    final double m1v1 = matrix[1].v_[1];
    final double m1v2 = matrix[1].v_[2];
    final double m1v3 = matrix[1].v_[3];
    final double m2v0 = matrix[2].v_[0];
    final double m2v1 = matrix[2].v_[1];
    final double m2v2 = matrix[2].v_[2];
    final double m2v3 = matrix[2].v_[3];
    final double m3v0 = matrix[3].v_[0];
    final double m3v1 = matrix[3].v_[1];
    final double m3v2 = matrix[3].v_[2];
    final double m3v3 = matrix[3].v_[3];

    while (true)
    {
        Vector4 v = vertices[i];
        i++;

        double vertexVal = v.v_[0];
        vOut.v_[0] = vertexVal * m0v0;
        vOut.v_[1] = vertexVal * m0v1;
        vOut.v_[2] = vertexVal * m0v2;
        vOut.v_[3] = vertexVal * m0v3;

        vertexVal = v.v_[1];
        vOut.v_[0] += vertexVal * m1v0;
        vOut.v_[1] += vertexVal * m1v1;
        vOut.v_[2] += vertexVal * m1v2;
        vOut.v_[3] += vertexVal * m1v3;

        vertexVal = v.v_[2];
        vOut.v_[0] += vertexVal * m2v0;
        vOut.v_[1] += vertexVal * m2v1;
        vOut.v_[2] += vertexVal * m2v2;
        vOut.v_[3] += vertexVal * m2v3;

        vertexVal = v.v_[3];
        vOut.v_[0] += vertexVal * m3v0;
        vOut.v_[1] += vertexVal * m3v1;
        vOut.v_[2] += vertexVal * m3v2;
        vOut.v_[3] += vertexVal * m3v3; 

        vertices[i] = vOut;

    } 
}
catch (ArrayIndexOutOfBoundsException aioobe) 
{
    // loop is done
}

java - Java Matrix-Vector-Multiplication は C バージョンより 100 倍遅い

3 に答える 3

Related

Reference