c# - LockBits パフォーマンスクリティカルコード

Question

私は可能な限り高速にする必要があるメソッドを持っています。それは安全でないメモリポインタを使用し、このタイプのコーディングへの私の最初の進出なので、おそらく高速になる可能性があることを知っています.

    /// <summary>
    /// Copies bitmapdata from one bitmap to another at a specified point on the output bitmapdata
    /// </summary>
    /// <param name="sourcebtmpdata">The sourcebitmap must be smaller that the destbitmap</param>
    /// <param name="destbtmpdata"></param>
    /// <param name="point">The point on the destination bitmap to draw at</param>
    private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
    {
        // calculate total number of rows to draw.
        var totalRow = Math.Min(
            destbtmpdata.Height - point.Y,
            sourcebtmpdata.Height);


        //loop through each row on the source bitmap and get mem pointers
        //to the source bitmap and dest bitmap
        for (int i = 0; i < totalRow; i++)
        {
            int destRow = point.Y + i;

            //get the pointer to the start of the current pixel "row" on the output image
            byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);
            //get the pointer to the start of the FIRST pixel row on the source image
            byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);

            int pointX = point.X;
            //the rowSize is pre-computed before the loop to improve performance
            int rowSize = Math.Min(destbtmpdata.Width - pointX, sourcebtmpdata.Width);
            //for each row each set each pixel
            for (int j = 0; j < rowSize; j++)
            {
                int firstBlueByte = ((pointX + j)*3);

                int srcByte = j *3;
                destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
                destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
                destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];
            }


        }
    }

それで、これをより速くするためにできることはありますか？今のところ todo は無視してください。ベースラインのパフォーマンス測定値が得られたら、後で問題を修正してください。

更新:申し訳ありませんが、Graphics.DrawImage の代わりにこれを使用している理由は、マルチスレッドを実装しているためであり、そのため DrawImage を使用できないためです。

更新 2:私はまだパフォーマンスに満足しておらず、さらに数ミリ秒かかる可能性があると確信しています。

score 4 · Accepted Answer

今まで気付かなかったとは信じられない、コードに根本的な問題がありました。

byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride);

これは宛先行へのポインターを取得しますが、コピー先の列は取得しません。古いコードでは、rowSize ループ内で行われます。次のようになります。

byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + pointX * 3;

これで、宛先データへの正しいポインターが得られました。これで for ループを取り除くことができます。VilxとRobからの提案を使用すると、コードは次のようになります。

        private static unsafe void CopyBitmapToDestSuperFast(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
    {
        //calculate total number of rows to copy.
        //using ternary operator instead of Math.Min, few ms faster
        int totalRows = (destbtmpdata.Height - point.Y < sourcebtmpdata.Height) ? destbtmpdata.Height - point.Y : sourcebtmpdata.Height;
        //calculate the width of the image to draw, this cuts off the image
        //if it goes past the width of the destination image
        int rowWidth = (destbtmpdata.Width - point.X < sourcebtmpdata.Width) ? destbtmpdata.Width - point.X : sourcebtmpdata.Width;

        //loop through each row on the source bitmap and get mem pointers
        //to the source bitmap and dest bitmap
        for (int i = 0; i < totalRows; i++)
        {
            int destRow = point.Y + i;

            //get the pointer to the start of the current pixel "row" and column on the output image
            byte* destRowPtr = (byte*)destbtmpdata.Scan0 + (destRow * destbtmpdata.Stride) + point.X * 3;

            //get the pointer to the start of the FIRST pixel row on the source image
            byte* srcRowPtr = (byte*)sourcebtmpdata.Scan0 + (i * sourcebtmpdata.Stride);

            //RtlMoveMemory function
            CopyMemory(new IntPtr(destRowPtr), new IntPtr(srcRowPtr), (uint)rowWidth * 3);

        }
    }

500x500 の画像をグリッド内の 5000x5000 の画像に 50 回コピーすると、00:00:07.9948993 秒かかりました。上記の変更により、00:00:01.8714263 秒かかります。ずっといい。

score 2 · Accepted Answer

うーん... .NET ビットマップデータ形式がWindows の GDI32 関数と完全に互換性があるかどうかはわかりません...

しかし、私が最初に学んだ数少ない Win32 API の 1 つは BitBlt でした。

BOOL BitBlt(
  HDC hdcDest, 
  int nXDest, 
  int nYDest, 
  int nWidth, 
  int nHeight, 
  HDC hdcSrc, 
  int nXSrc, 
  int nYSrc, 
  DWORD dwRop
);

私の記憶が正しければ、データをコピーする最速の方法でした。

以下は、C# で使用するための BitBlt PInvoke シグネチャと関連する使用方法に関する情報です。C# で高性能グラフィックスを扱う人にとっては、非常に読みやすいものです。

http://www.pinvoke.net/default.aspx/gdi32/BitBlt.html

間違いなく一見の価値があります。

score 1 · Accepted Answer

残念ながら、完全な解決策を書く時間はありませんが、プラットフォームのRtlMoveMemory()関数を使用して、バイト単位ではなく行全体を移動することを検討します。それはずっと速いはずです。

score 1 · Accepted Answer

内側のループは、多くの時間を集中したい場所です (ただし、確認のために測定を行ってください)。

for  (int j = 0; j < sourcebtmpdata.Width; j++)
{
    destRowPtr[(point.X + j) * 3] = srcRowPtr[j * 3];
    destRowPtr[((point.X + j) * 3) + 1] = srcRowPtr[(j * 3) + 1];
    destRowPtr[((point.X + j) * 3) + 2] = srcRowPtr[(j * 3) + 2];
}

乗算と配列のインデックス付け (フードの下での乗算) を取り除き、インクリメントするポインターに置き換えます。
+1、+2 と同じように、ポインターをインクリメントします。
おそらくあなたのコンパイラは point.X を計算し続けませんが (チェック)、念のためローカル変数を作成します。単一の反復では実行されませんが、反復ごとに実行される可能性があります。

score 1 · Accepted Answer

Eigenを見たいと思うかもしれません。

これは、 SSE (2 以降) と AltiVec 命令セットを使用する C++ テンプレートライブラリであり、ベクトル化されていないコードへの適切なフォールバックを備えています。

速い。(ベンチマークを参照)。
式テンプレートを使用すると、一時変数をインテリジェントに削除し、必要に応じて遅延評価を有効にすることができます。Eigen はこれを自動的に処理し、ほとんどの場合、エイリアシングも処理します。
SSE (2 以降) および AltiVec 命令セットに対して明示的なベクトル化が実行され、ベクトル化されていないコードへの適切なフォールバックが行われます。式テンプレートを使用すると、これらの最適化を式全体に対してグローバルに実行できます。
固定サイズのオブジェクトでは、動的メモリ割り当てが回避され、それが意味をなすときにループが展開されます。
大規模な行列の場合、キャッシュの使いやすさに特別な注意が払われます。

関数を C++ で実装し、それを C# から呼び出すことができます。

score 1 · Accepted Answer

速度を上げるために常にポインターを使用する必要はありません。これは、元の数ミリ秒以内である必要があります。

        private static void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
    {
        byte[] src = new byte[sourcebtmpdata.Height * sourcebtmpdata.Width * 3];
        int maximum = src.Length;
        byte[] dest = new byte[maximum];
        Marshal.Copy(sourcebtmpdata.Scan0, src, 0, src.Length);
        int pointX = point.X * 3;
        int copyLength = destbtmpdata.Width*3 - pointX;
        int k = pointX + point.Y * sourcebtmpdata.Stride;
        int rowWidth = sourcebtmpdata.Stride;
        while (k<maximum)
        {
            Array.Copy(src,k,dest,k,copyLength);
            k += rowWidth;

        }
        Marshal.Copy(dest, 0, destbtmpdata.Scan0, dest.Length);
    }

score 0 · Accepted Answer

ストライドサイズと行数の制限は事前に計算できると思います。

そして、すべての乗算を事前に計算した結果、次のコードが得られました。

private static unsafe void CopyBitmapToDest(BitmapData sourcebtmpdata, BitmapData destbtmpdata, Point point)
{
    //TODO: It is expected that the bitmap PixelFormat is Format24bppRgb but this could change in the future
    const int pixelSize = 3;

    // calculate total number of rows to draw.
    var totalRow = Math.Min(
        destbtmpdata.Height - point.Y,
        sourcebtmpdata.Height);

    var rowSize = Math.Min(
        (destbtmpdata.Width - point.X) * pixelSize,
        sourcebtmpdata.Width * pixelSize);

    // starting point of copy operation
    byte* srcPtr = (byte*)sourcebtmpdata.Scan0;
    byte* destPtr = (byte*)destbtmpdata.Scan0 + point.Y * destbtmpdata.Stride;

    // loop through each row
    for (int i = 0; i < totalRow; i++) {

        // draw the entire row
        for (int j = 0; j < rowSize; j++)
            destPtr[point.X + j] = srcPtr[j];

        // advance each pointer by 1 row
        destPtr += destbtmpdata.Stride;
        srcPtr += sourcebtmpdata.Stride;
    }

}

Havnはそれを徹底的にテストしていませんが、それを機能させることができるはずです。

ループから乗算演算を削除し（代わりに事前に計算）、ほとんどの分岐を削除したので、多少高速になるはずです。

これが役立つかどうか教えてください:-)

score 0 · Accepted Answer

私はあなたのC＃コードを見ていますが、なじみのあるものは何も認識できません。それはすべてC++のトンのように見えます。ところで、DirectX/XNAはあなたの新しい友達になる必要があるようです。ちょうど私の2セント。メッセンジャーを殺さないでください。

これを行うためにCPUに依存する必要がある場合：私はいくつかの24ビットレイアウトの最適化を自分で行いました。メモリアクセス速度がボトルネックになるはずです。SSE3命令を使用して、可能な限り最速のバイト単位のアクセスを実現します。これは、C++と組み込みアセンブリ言語を意味します。純粋なCでは、ほとんどのマシンで30％遅くなります。

この種の操作では、最新のGPUはCPUよりもはるかに高速であることに注意してください。

score 0 · Accepted Answer

これによりパフォーマンスが向上するかどうかはわかりませんが、Reflector でこのパターンがよく見られます。

そう：

int srcByte = j *3;
destRowPtr[(firstBlueByte)] = srcRowPtr[srcByte];
destRowPtr[(firstBlueByte) + 1] = srcRowPtr[srcByte + 1];
destRowPtr[(firstBlueByte) + 2] = srcRowPtr[srcByte + 2];

なります:

*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;
*destRowPtr++ = *srcRowPtr++;

おそらくより多くのブレースが必要です。

幅が固定されている場合、おそらく行全体を数百行に展開できます。:)

アップデート

パフォーマンスを向上させるために、Int32 や Int64 などのより大きな型を使用することもできます。

score 0 · Accepted Answer

よし、これはアルゴリズムから得られるミリ秒数のラインにかなり近くなるだろうが、Math.Minへの呼び出しを取り除き、代わりに三項演算子に置き換える。

一般に、ライブラリの呼び出しは、自分で何かを行うよりも時間がかかります。これを確認するために、Math.Min の簡単なテストドライバーを作成しました。

using System;
using System.Diagnostics;

namespace TestDriver
{
    class Program
    {
        static void Main(string[] args)
        {
            // Start the stopwatch
            if (Stopwatch.IsHighResolution)
            { Console.WriteLine("Using high resolution timer"); }
            else
            { Console.WriteLine("High resolution timer unavailable"); }
            // Test Math.Min for 10000 iterations
            Stopwatch sw = Stopwatch.StartNew();
            for (int ndx = 0; ndx < 10000; ndx++)
            {
                int result = Math.Min(ndx, 5000);
            }
            Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
            // Test trinary operator for 10000 iterations
            sw = Stopwatch.StartNew();
            for (int ndx = 0; ndx < 10000; ndx++)
            {
                int result = (ndx < 5000) ? ndx : 5000;
            }
            Console.WriteLine(sw.Elapsed.TotalMilliseconds.ToString("0.0000"));
            Console.ReadKey();
        }
    }
}

私のコンピューター、Intel T2400 @1.83GHz で上記を実行したときの結果。また、結果には多少のばらつきがありますが、一般的に trinay 演算子は約 0.01 ミリ秒高速です。それほど多くはありませんが、十分な大きさのデータセットを合計すると、合計されます。

高分解能タイマーの使用
0.0539
0.0402

c# - LockBits パフォーマンス クリティカル コード

10 に答える 10

Related

Reference

c# - LockBits パフォーマンスクリティカルコード