android - 浮動小数点演算の高速化（Android ARMv6）

Question

ネイティブコードを使用してAndroidで画像圧縮を行っています。さまざまな理由で、ビルド済みのライブラリを使用できません。

android-ndk-profilerを使用してコードのプロファイリングを行ったところ、ボトルネックは-驚くべきことに-浮動小数点演算であることがわかりました。プロファイルの出力は次のとおりです。

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 40.37      0.44     0.44                             __addsf3
 11.93      0.57     0.13     7200     0.02     0.03  EncodeBlock
  6.42      0.64     0.07   535001     0.00     0.00  BitsOut
  6.42      0.71     0.07                             __aeabi_fdiv
  6.42      0.78     0.07                             __gnu_mcount_nc
  5.50      0.84     0.06                             __aeabi_fmul
  5.50      0.90     0.06                             __floatdisf
  ...

私は__addsf3をグーグルで検索しましたが、これはソフトウェアの浮動小数点演算であるようです。うん。ARMv6アーキテクチャコアについてさらに調査しましたが、何かを見逃さない限り、ハードウェア浮動小数点はサポートされていません。では、これをスピードアップするためにここで何ができるでしょうか？固定小数点？これは通常整数で行われることは知っていますが、コードを変換してそれを行う方法がよくわかりません。それを行うために設定できるコンパイラフラグはありますか？他の提案を歓迎します。

score 8 · Accepted Answer

もちろん、整数演算だけで何でもできます (結局のところ、プログラムが現在行っていることとまったく同じです) が、より速く実行できるかどうかは、正確に何をしようとしているかによって異なります。

浮動小数点は、ほとんどの場合に適用でき、それを忘れることができる一般的な解決策のようなものですが、信じられないほど小さいものから信じられないほど大きいものまで、52ビットの仮数精度を持つ数値が実際に必要になることはめったにありません。計算が倍精度浮動小数点数を使用したグラフィックスに関するものであると仮定すると、サブアトミックスケールよりはるかに小さいスケールから宇宙のサイズよりもはるかに大きいスケールまで可能です...本当にその範囲が必要なのでしょうか? もちろん、提供される精度は FP のスケールに依存しますが、本当に必要な精度はどれくらいですか?

「内側のループ」で使用される数字は何ですか? それを知らなければ、計算を大幅に高速化できるかどうかを判断するのは困難です。ほぼ確実に高速化できますが (FP は一般的なブラインドソリューションです)、期待できるゲインの程度は大きく異なります。特定の実装はわかりませんが、かなり効率的であると期待しています（一般的なケースの場合）。

より高い論理レベルの最適化を目指す必要があります。

DCTやウェーブレット変換などに基づく画像（圧縮解除）の場合、実際に浮動小数点演算は必要ないと思います。数値の正確なスケールを考慮して、整数演算を使用できます。さらに、おおよその結果を生成できるため、自由度が高くなる場合もあります。

score 2 · Accepted Answer

See 6502's excellent answer first...

Most processors dont have fpus because they are not needed. And when they do for some reason they try to conform to IEEE754 which is equally unnecessary, the cases that need any of that are quite rare. The fpu is just an integer alu with some stuff around it to keep track of the floating point, all of which you can do yourself.

How? Lets think decimals and dollars we can think about $110.50 and adding $0.07 and getting $110.57 or you could have just done everything in pennies, 11050 + 7 = 11057, then when you print it for a user place a dot in the right place. That is all the fpu is doing, and that is all you need to do. this link may or may not give some insight into this http://www.divms.uiowa.edu/~jones/bcd/divide.html

Dont blanket all ARMv6 processors that way, that is not how ARMs are categorized. Some cores have the option for an FPU or you can add one on yourself after you buy from ARM, etc. the ARM11's are ARMv6 with the option for an fpu for example.

Also, just because you can keep track of the decimal point yourself, if there is a hard fpu it is possible to have it be faster than doing it yourself in fixed point. Likewise it is possible and easy to not know how to use an fpu and get bad results, just get them faster. Very easy to write bad floating point code. Whether you use fixed or float you need to keep track of the range of your numbers and from that control where you move the point around to keep the integer math at the core within the mantissa. Which means to use floating point effectively you should be thinking in terms of what the integer math is doing. One very common mistake is to think that multiplies mess up your precision, when it is actually addition and subtraction that can hurt you the most.

android - 浮動小数点演算の高速化（Android ARMv6）

2 に答える 2

Related

Reference