c - より高速な ByteString 構築のヒント

Question

私はHaskellを初めて使用し、効率の問題に悩まされています..

タスク: 列のサイズが一定の 4 GB のテキストファイルから CSV ファイルを作成する

列のサイズは既知です。たとえば、[col1: 幅 4 文字、col2: 幅 2 文字など...
ファイルには [A-Z0-9] ASCII 文字のみを含めることができるため、セルをエスケープしても意味がありません。

I have: 

$ cat example.txt 
AAAABBCCCC...
AAA1B1CCC1...
... (72 chars per line, usually 50 mln lines)


I need: 
$ cat done.csv
AAAA,BB,CCCC, ...
AAA1,B1,CCC1, ...
...

これは Haskell で最速のコードで、4GB のファイル全体を処理するのに約 2 分かかります。
最大 30 秒必要です

import qualified Data.ByteString.Lazy as BL
import qualified Data.ByteString as B
import qualified Data.ByteString.Unsafe as U
import Data.ByteString.Lazy.Builder
import Data.Monoid
import Data.List

col_sizes = intercalate [1] $ map (`replicate` 0) cs
  where
    cs = [4, 4, 4, 3, 5, 1, 1, 3, 3, 3, 3, 3, 3, 10, 3, 1, 1, 1, 2, 3, 10]

sp = char8 ',' -- column separator
nl = char8 '\n'

separator !cs !cl !xs !xl !ci !xi
  | c  == 1   = ps
  | xi == xl  = mempty -- at the end of bytestring, end recursion
  | cl == ci  = pr
  | otherwise = pc
  where
    c  = U.unsafeIndex cs ci         -- get column separation indicator
    w  = word8 . U.unsafeIndex xs    -- get char from BS at position
    p  = separator cs cl xs xl       -- partial recursion call
    pr = nl   <> p  0       (xi + 1) -- end of row, put '\n', reset counter, recur
    ps = sp   <> p (ci + 1)  xi      -- end of column, put column separator, recur
    pc = w xi <> p (ci + 1) (xi + 1) -- in the middle of column, copy byte, recur


main = do
  contents <- B.getContents
  BL.putStr . toLazyByteString $ init_sep sp_after_char contents


init_sep cs xs = separator cs (l cs) xs (l xs) 0 0
  where l = fromIntegral . B.length

sp_after_char = B.pack col_sizes

そして、これはC http://pastebin.com/Kjz3Mugsでの私の実装です（ここに貼り付けるには長いです...）
同じファイルを処理するのに約5秒かかります

したがって、私のHaskellコードは約です。20倍遅い。

Haskell ByteString フィルターとマップは、私の C での実装よりも高速であるため
(単純な変更を行って同じファイルを処理するのに、どちらも 2 秒もかかりません)
、私の Haskell コードに何か問題があり、C を使用せざるを得ないことを願っています。

更新: テストデータジェネレーターはこちらから入手できますhttp://pastebin.com/aJ3RW3jG

本番環境では、データは 1 つのバイナリから別のバイナリにパイプされるため、ハードドライブの IO はありません

ソリューションをテストするために SSD ドライブを使用しましたが、とにかく Ext4 がそのファイルを RAM にキャッシュしたと思います

time cat test.txt > /dev/null
cat test.txt > /dev/null  0,00s user 0,35s system 99% cpu 0,353 total

バニラジェネレーター:

time ./data_builder | head -50000000 > /dev/null
./data_builder  0,02s user 1,09s system 30% cpu 3,709 total
head -50000000 > /dev/null  2,95s user 0,76s system 99% cpu 3,708 total

私のCソリューション：

time ./tocsvc < test.txt > /dev/null 
./tocsvc < test.txt > /dev/null  5,35s user 0,35s system 100% cpu 5,689 total

発電機付き

time ./data_builder | head -50000000 | ./tocsvc > /dev/null
./data_builder  0,02s user 1,18s system 18% cpu 6,460 total
head -50000000  3,15s user 1,19s system 67% cpu 6,459 total
./tocsvc > /dev/null  5,81s user 0,55s system 98% cpu 6,459 total

@GabrielGonzalez Haskell ソリューション

time ./tocsvh1 < test.txt > /dev/null 
./tocsv < test.txt > /dev/null  19,56s user 0,41s system 100% cpu 19,950 total

発電機付き

time ./data_builder | head -50000000 | ./tocsvh1 > /dev/null 
./data_builder  0,11s user 3,04s system 7% cpu 41,320 total
head -50000000  7,29s user 3,56s system 26% cpu 41,319 total
./tocsvh2 > /dev/null  33,01s user 2,42s system 85% cpu 41,327 total

私のHaskellソリューション

time ./tocsvh2 < test.txt > /dev/null 
./tocsvh2 < test.txt > /dev/null  128,63s user 2,95s system 100% cpu 2:11,45 total

発電機付き

time ./data_builder | head -50000000 | ./tocsvh2 > /dev/null 
./data_builder  0,02s user 1,26s system 28% cpu 4,526 total
head -50000000  3,17s user 1,33s system 99% cpu 4,524 total
./tocsvh2 > /dev/null  129,95s user 3,33s system 98% cpu 2:14,75 total

@LukeTaylorソリューション

time ./tocsvh3 < test.txt > /dev/null 
./tocsv < test.txt > /dev/null  324,38s user 4,13s system 100% cpu 5:28,18 total

発電機付き

time ./data_builder | head -50000000 | ./tocsvh3 > /dev/null 
./data_builder  0,43s user 4,46s system 1% cpu 5:30,34 total
head -50000000  5,20s user 2,82s system 2% cpu 5:30,34 total
./tocsv > /dev/null  329,08s user 4,21s system 100% cpu 5:32,96 total

score 0 · Accepted Answer

ファイルを1行ずつ処理するための代替コードをいくつか書きました（ByteStringそれ自体をサポートしているため）。それはうまくいくようです：

{-# LANGUAGE OverloadedStrings #-}

import qualified Data.ByteString.Lazy.Char8 as B
import Data.Int (Int64)

main = do
    contents <- B.getContents
    mapM B.putStrLn (process contents)

process :: B.ByteString -> [B.ByteString]
process bs = do
    line <- B.lines bs
    return $ B.intercalate "," $ splitLine line indices

splitLine :: B.ByteString -> [Int64] -> [B.ByteString]
splitLine l []      = [l] 
splitLine l (i:is)  = let (head, tail) = B.splitAt i l 
                      in  head : splitLine tail is

indices = [1,2,3] :: [Int64]

文字列「1223334444」のコピーを 500000 個含むファイルを作成すると、1 秒ほどで実行されます。

 $ time ./blines < blah.txt > blah2.txt

 real   0m1.280s
 user   0m1.192s
 sys    0m0.080s

それはあなたが達成しようとしていることと一致していますか？

更新: 大量のデータがある場合、これはまだかなり遅いです。800万行で、約15秒かかります。

c - より高速な ByteString 構築のヒント

更新: テスト データ ジェネレーターはこちらから入手できますhttp://pastebin.com/aJ3RW3jG

本番環境では、データは 1 つのバイナリから別のバイナリにパイプされるため、ハード ドライブの IO はありません

ソリューションをテストするために SSD ドライブを使用しましたが、とにかく Ext4 がそのファイルを RAM にキャッシュしたと思います

私のCソリューション：

@GabrielGonzalez Haskell ソリューション

私のHaskellソリューション

@LukeTaylorソリューション

2 に答える 2

Related

Reference

更新: テストデータジェネレーターはこちらから入手できますhttp://pastebin.com/aJ3RW3jG

本番環境では、データは 1 つのバイナリから別のバイナリにパイプされるため、ハードドライブの IO はありません