assembly - このアセンブリコードを最適化する

Question

私は現在、コンピュータアーキテクチャのコースを受講しており、基本的な R タイプおよび I タイプの命令 (これもRISCアーキテクチャです) などについて学習しています。これを最適化する方法がわかりません。コード。

説明: このコードは、ゼロに到達するまで、数字の配列 ($s1 が指す) に単語を追加します。結果は $t1 に格納されます。$t0 は現在の単語を保持します。

        add   $t1, $zero, $zero      # Initialize result to zero
again:  
        lw    $t0, 0($s1)            # Load the word from the array
        beq   $t0, $zero, done       # Terminate if current word is a zero
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array
        beq   $t1, $t1, again        # Loop again
done:
        nop                          # Do nothing

コードの最適化に苦労しています。(常に true であるため) は不要だと思いbeq $t1, $t1, againますが、削除する方法がわかりません。これが私の試みですが、コードが終了しないことに気付きました。

        add   $t1, $zero, $zero      # Initialize result to zero
again:  
        lw    $t0, 0($s1)            # Load the word from the array
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array
        bne   $t1, $zero, again      # If result is not zero, loop
done:
        nop                          # Do nothing

終了ゼロをチェックして完了にジャンプすることは決してありません。しかし、別のチェックを追加すると、コードは以前と同じになるのではないでしょうか?

score 2 · Accepted Answer

通常、上部ループのテストを下部ループのテストに変換します。これを行うには、最初の反復でループ本体の途中に (多かれ少なかれ) 頻繁にジャンプする必要があります。疑似コードでは、現在持っているものは基本的に次のとおりです。

    initialize sum
beginning:
    load a word
    if (done) goto end
    add to sum
    increment pointer
    goto beginning
end:

それを最適化するには、構造を次のように変更します。

    initialize sum
    goto start_loop

beginning:
    add to sum
    increment pointer
start_loop:
    load a word
    if (!done) goto beginning

この方法では、ループごとに 2 つではなく 1 つのジャンプしかありません (また、後方への短いジャンプであるため、ターゲットはほぼ常にキャッシュ内にあります)。

そうは言っても、この最適化は実際にはほとんど時代遅れになっていることを付け加えておく必要があります。まともな分岐予測があれば、無条件ジャンプは通常本質的に無料です。

編集: ループのアンローリングが言及されているので、それについて 2 セントを追加します。分岐予測は、より多くの命令を並行して実行するために使用できない限り、通常、ループ展開を時代遅れにします。それはここでは問題ではありませんが、実生活ではしばしば役に立ちます。たとえば、2 つの個別の加算器が利用可能な CPU を想定すると、ループの 2 つの反復を展開し、それらの反復の結果を 2 つの個別のターゲットレジスタに追加できるため、2 つの値を同時に加算することで両方の加算器を利用できます。時間。次に、ループが終了したら、これら 2 つのレジスタを合計して最終的な値を取得します。C ライクな疑似コードでは、次のようになります。

odds = 0
evens = 0

do {   
    evens += pointer[0];
    odds += pointer[1];
    pointer += 2;
while (pointer[0] && pointer[1]);
total = odds + evens;

書かれているように、これにより、1 つだけではなく 2 つの連続するゼロがシーケンスを終了するためのマイナーな追加要件が追加され、配列内に最低 2 つの項目が追加されます。ただし、ここで大きな利点をもたらすのは実際にはループの展開ではなく、並列実行であることに注意してください。

それがなければ、(両方が正しく予測されたとしても) 取られなかった分岐が取られた分岐よりもコストがかからない場合にのみ、ループ展開の利点が実際に見られます。古いプロセッサ (古い Intel など) では、分岐を取得すると、取得されていない分岐と比較してペナルティが発生しました。同時に、展開されたループはより多くのキャッシュスペースを使用します。したがって、最新のプロセッサでは、全体的な損失になることがよくあります (前に述べたように、アンローリングを使用して並列性を得ることができない限り)。

score 1 · Accepted Answer

ループを展開します。


        add   $t1, $zero, $zero      # Initialize result to zero
again:  
        lw    $t0, 0($s1)            # Load the word from the array
        beq   $t0, $zero, done       # Terminate if current word is a zero
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array

        lw    $t0, 0($s1)            # Load the word from the array
        beq   $t0, $zero, done       # Terminate if current word is a zero
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array

        lw    $t0, 0($s1)            # Load the word from the array
        beq   $t0, $zero, done       # Terminate if current word is a zero
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array

        lw    $t0, 0($s1)            # Load the word from the array
        beq   $t0, $zero, done       # Terminate if current word is a zero
        add   $t1, $t1, $t0          # Add current word to result
        addi  $s1, $s1, 4            # Point to the next word in the array

        # and so on to a reasonable amount, 4-8 times are common.

        b     again        # Loop again
done:
        nop

assembly - このアセンブリ コードを最適化する

2 に答える 2

Related

Reference

assembly - このアセンブリコードを最適化する