haskell - Haskell 数字行のファイルを解析するより効率的な方法

Question

したがって、スペースで区切られた6つのintを持つそれぞれ約8MBのファイルがあります。

これを解析するための私の現在の方法は次のとおりです。

tuplify6 :: [a] -> (a, a, a, a, a, a)
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q)

toInts :: String -> (Int, Int, Int, Int, Int, Int)
toInts line =
        tuplify6 $ map read stringNumbers
        where stringNumbers = split " " line

toInts へのマッピング

liftM lines . readFile

タプルのリストが返されます。ただし、これを実行すると、ファイルをロードして解析するのに 25 秒近くかかります。これをスピードアップする方法はありますか？ファイルは単なるプレーンテキストです。

score 8 · Accepted Answer

ByteStringsを使用して高速化できます。

module Main (main) where

import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as C
import Data.Char

main :: IO ()
main = do
    args <- getArgs
    mapM_ doFile args

doFile :: FilePath -> IO ()
doFile file = do
    bs <- C.readFile file
    let tups = buildTups 0 [] $ C.dropWhile (not . isDigit) bs
    print (length tups)

buildTups :: Int -> [Int] -> C.ByteString -> [(Int,Int,Int,Int,Int,Int)]
buildTups 6 acc bs = tuplify6 acc : buildTups 0 [] bs
buildTups k acc bs
    | C.null bs = if k == 0 then [] else error ("Bad file format " ++ show k)
    | otherwise = case C.readInt bs of
                    Just (i,rm) -> buildTups (k+1) (i:acc) $ C.dropWhile (not . isDigit) rm
                    Nothing -> error ("No Int found: " ++ show (C.take 100 bs))

tuplify6:: [a] -> (a, a, a, a, a, a)
tuplify6 [l, m, n, o, p, q] = (l, m, n, o, p, q)

非常に高速に実行されます:

$ time ./fileParse IntList 
200000

real    0m0.119s
user    0m0.115s
sys     0m0.003s

8.1 MiB ファイルの場合。

一方、Strings を使用して変換 (seq評価を強制するためにいくつかの s を使用) も 0.66 秒しかかからなかったため、解析ではなく結果の操作に多くの時間が費やされているようです。

おっと、a を逃したseqため、reads は実際にはStringバージョンに対して評価されませんでした。String+を修正するには、@ Rotsor のコメントreadのカスタムパーサーを使用すると、約 4 秒かかります。Int

foldl' (\a c -> 10*a + fromEnum c - fromEnum '0') 0

そのため、解析には明らかにかなりの時間がかかりました。

haskell - Haskell 数字行のファイルを解析するより効率的な方法

1 に答える 1

Related

Reference