optimization - Haskell テキスト処理の最適化

Question

Haskell でいくつかの単純な文字カウントルーチンを作成し、統計を新しいデータ型に格納しています。

data Stat = Stat {
    stChars    :: !Int,
    stVowels   :: !Int,
    stPairEL   :: !Int,
    stWords    :: !Int
}

これを、それぞれ約 50K ～ 100K の数百または数千のプレーンテキストファイルで実行しています。

tabulateFile :: FilePath -> IO Stat
tabulateFile path = do
  putStrLn path
  contents <- L.readFile path
  return $! tabulateText ' ' contents defaultStat

左折を使用する代わりに、プリミティブ再帰を使用しているため、前の文字を保持できます。

tabulateText :: Char -> L.ByteString -> Stat -> Stat
tabulateText lastChr bs stat =
  case U.uncons bs of
    Nothing -> stat
    Just (chr, newBs) ->
      tabulateText lchr newBs (countChar lastChr lchr stat)
        where lchr = toLower chr

{-# INLINE countChar #-}
countChar :: Char -> Char -> Stat -> Stat
countChar !lastChr !chr !(Stat stChars stVowels stPairEL stWords) =
  Stat
    (stChars  + 1)
    (stVowels + (countIf $ isVowel chr))
    (stPairEL + (countIf (lastChr == 'e' && chr == 'l')))
    (stWords  + (countIf ((not $ isLetter lastChr) && isLetter chr)))

isVowel :: Char -> Bool
isVowel c = Set.member c vowels

vowels = Set.fromAscList ['a', 'e', 'i', 'o', 'u', ...] -- rest of vowels elided

現時点では、実行cat * | wc中の 2 倍以上遅いですが、ファイル I/O が必要な CPU 時間をかなりのマージンで上回っているはずだと直感的にわかります。ホットキャッシュで約 20MB/s のプロセスを使用するだけcat * | wcですが、私の Haskell プログラム ( でコンパイル-O) を使用すると、いくつかの基本的な最適化の後でも、10MB/s 未満で実行されます。プロファイリングによると、ほとんどの時間はとで費やされtabulateTextていcountCharます。

ここで最適化できることを見逃しているものはありますか?

編集: http://hpaste.org/74638に貼り付けられた完全なファイル

score 10 · Accepted Answer

誰かがコードをコンパイルできるように、インポートを提供する必要があります。ただし、ここには可能性が高いと思われることがいくつかあります。

コンパイル-O2 -funbox-strict-fields（厳密なフィールドの利点を得るには）
tabulateTextはlastChr、およびで厳密にする必要がありますstat
Set.memberは、同等性の比較を行うための非常に費用のかかる方法のようです。ジャンプテーブルを使用します。

例えば

isSpaceChar8 :: Char -> Bool
isSpaceChar8 c =
    c == ' '     ||
    c == '\t'    ||
    c == '\n'    ||
    c == '\r'    ||
    c == '\f'    ||
    c == '\v'    ||
    c == '\xa0'

これはインライン化され、非常にうまく最適化されます。

何をするのかわからcountIfないが、それはひどい。私はそれを疑って、ifあなたは0を返しますか？どうですか：

Stat
   (a + 1)
   (if isVowel c then a + 1 else a)
   ...

次に、コアを見てください。

score 4 · Accepted Answer

{-# LANGUAGE BangPatterns #-}
import qualified Data.ByteString.Lazy.Char8 as U
import qualified Data.ByteString.Lazy as L
import Data.Word
import Data.Char
import Control.Applicative 

data Stat = Stat {
    stChars    :: !Int,
    stVowels   :: !Int,
    stPairEL   :: !Int,
    stWords    :: !Int
} deriving Show
defaultStat = Stat 0 0 0 0

{-# INLINE  tabulateFile #-}
tabulateFile :: FilePath -> IO Stat
tabulateFile path = newTabulate <$> L.readFile path

{-#  INLINE newTabulate #-}
newTabulate :: L.ByteString -> Stat 
newTabulate = snd . U.foldl' countIt (' ',defaultStat) 
    where 
        {-#  INLINE countIt #-}
        countIt :: (Char,Stat) -> Char -> (Char,Stat)
        countIt (!lastChr,!Stat stChars stVowels stPairEL stWords) !chr = 
            (chr,Stat
                (stChars  + 1)
                (if isVowel chr then stVowels + 1 else stVowels)
                (if (lastChr == 'e' && chr == 'l') then stPairEL + 1 else stPairEL)
                (if ((isSpace lastChr) && isLetter chr) then stWords+1 else stWords))

{-# INLINE isVowel #-}
isVowel :: Char -> Bool
isVowel c = 
    c == 'e' ||
    c == 'a' ||
    c == 'i' ||
    c == 'o' ||
    c == 'u' 



main:: IO ()
main = do 
    stat <- tabulateFile "./test.txt"
    print stat

Don によって提案された最適化のほとんどは、効率的な foldl の使用とともに含まれています。パフォーマンスは cat + wc よりわずかに遅くなりますが、さらに計算を行っているので問題ありません。非常に大きなファイルでテストしていませんが、cat + wc と同等に動作するはずです。

でコンパイルし-O2 -funbox-strict-fieldsて、最適化されたコードを取得します。

コアを調べた後、さらに変更し、さらに最適化が可能かどうかを確認します。最適化の 1 つの考えられるポイントは、stat の構築中にコンストラクターから if 条件を使用することです。たとえば、ifchrが母音の場合、それは既にであるletterため、stWords などで他の if は必要ありませんが、それは実際にコードを爆破します。ただし、大きなファイルで本当に役立つかどうかを試すことができます。

score 0 · Accepted Answer

他の代替案をテストした後、CPU使用率が高いのは、主にData.ByteString.Lazy.UTF8を使用していたためと思われます。UTF8 ByteStringでtabulateText使用するようにデータ構造を変更することで、実行時間のごくわずかな部分を削減しました。foldl

それを考えると、私はファイル上でプログラムを並列化し、時々私のマシンで7倍のスピードアップを得ることができました。

私は最初に：で包みtabulateFileましたunsafePerformIO

unsafeTabulateFile :: FilePath -> Stat
unsafeTabulateFile f =
  unsafePerformIO $ tabulateFile f

その後、を行うために使用さControl.Parallel.StrategiesれますparMap rseq unsafeTabulateFile files。

optimization - Haskell テキスト処理の最適化

3 に答える 3

Related

Reference