私は Haskell とパイプ ライブラリの非常に初心者であり、関数でこのプログラムの高いメモリ使用量を引き起こしている原因を理解したいと言って前置きする必要がありtest
ます。
具体的には、r1
値を生成するフォールドで、使用されtest
ない限り、最終結果が生成されるまで MyRecord 値の蓄積が見deepseq
られます。~ 500000 行 / ~ 230 MB のサンプル データ セットでは、メモリ使用量が 1.5 GB を超えて増加します。
値を生成するフォールドr2
は、定数メモリで実行されます。
私が理解したいのは:
1) 最初のフォールドで MyMemory 値がビルドされる原因は何ですか?また、使用deepseq
するとそれが修正されるのはなぜですか? 一定のメモリ使用量を達成するために使用するまで、ランダムに物を投げていましたdeepseq
が、なぜそれが機能するのかを理解したいと思います。deepseq
Maybe Intの同じ結果タイプを生成しながら、使用せずに一定のメモリ使用量を達成できますか?
2)。同じ問題が発生しない原因となる 2 番目の折り目の違いは何ですか?
タプルの代わりに整数のみを使用する場合sum
、Pipes.Prelude の組み込み関数を使用できることはわかっていますが、最終的には解析エラーを含む 2 番目の要素を処理する必要があります。
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE ScopedTypeVariables #-}
module Test where
import Control.Arrow
import Control.DeepSeq
import Control.Monad
import Data.Aeson
import Data.Function
import Data.Maybe
import Data.Monoid
import Data.Text (Text)
import Pipes
import qualified Pipes.Aeson as PA (DecodingError(..))
import qualified Pipes.Aeson.Unchecked as PA
import qualified Pipes.ByteString as PB
import qualified Pipes.Group as PG
import qualified Pipes.Parse as PP
import qualified Pipes.Prelude as P
import System.IO
import Control.Lens
import qualified Control.Foldl as Fold
data MyRecord = MyRecord
{ myRecordField1 :: !Text
, myRecordField2 :: !Int
, myRecordField3 :: !Text
, myRecordField4 :: !Text
, myRecordField5 :: !Text
, myRecordField6 :: !Text
, myRecordField7 :: !Text
, myRecordField8 :: !Text
, myRecordField9 :: !Text
, myRecordField10 :: !Int
, myRecordField11 :: !Text
, myRecordField12 :: !Text
, myRecordField13 :: !Text
} deriving (Eq, Show)
instance FromJSON MyRecord where
parseJSON (Object o) =
MyRecord <$> o .: "field1" <*> o .: "field2" <*> o .: "field3" <*>
o .: "field4" <*>
o .: "field5" <*>
o .: "filed6" <*>
o .: "field7" <*>
o .: "field8" <*>
o .: "field9" <*>
(read <$> o .: "field10") <*>
o .: "field11" <*>
o .: "field12" <*>
o .: "field13"
parseJSON x = fail $ "MyRecord: expected Object, got: " <> show x
instance ToJSON MyRecord where
toJSON _ = undefined
test :: IO ()
test = do
withFile "some-file" ReadMode $ \hIn
{-
the pipeline is composed as follows:
1 a producer reading a file with Pipes.ByteString, splitting chunks into lines,
and parsing the lines as JSON to produce tuples of (Maybe MyRecord, Maybe
ByteString), the second element being an error if parsing failed
2 a pipe filtering that tuple on a field of Maybe MyRecord, passing matching
(Maybe MyRecord, Maybe ByteString) downstream
3 and a pipe that picks an Int field out of Maybe MyRecord, passing (Maybe Int,
Maybe ByteString downstream)
pipeline == 1 >-> 2 >-> 3
memory profiling indicates the memory build up is due to accumulation of
MyRecord "objects", and data types comprising their fields (mainly
Text/ARR_WORDS)
-}
-> do
let pipeline = f1 hIn >-> f2 >-> f3
-- need to use deepseq to avoid leaking memory
r1 <-
P.fold
(\acc (v, _) -> (+) <$> acc `deepseq` acc <*> pure (fromMaybe 0 v))
(Just 0)
id
(pipeline :: Producer (Maybe Int, Maybe PB.ByteString) IO ())
print r1
hSeek hIn AbsoluteSeek 0
-- this works just fine as is and streams in constant memory
r2 <-
P.fold
(\acc v ->
case fst v of
Just x -> acc + x
Nothing -> acc)
0
id
(pipeline :: Producer (Maybe Int, Maybe PB.ByteString) IO ())
print r2
return ()
return ()
f1
:: (FromJSON a, MonadIO m)
=> Handle -> Producer (Maybe a, Maybe PB.ByteString) m ()
f1 hIn = PB.fromHandle hIn & asLines & resumingParser PA.decode
f2
:: Pipe (Maybe MyRecord, Maybe PB.ByteString) (Maybe MyRecord, Maybe PB.ByteString) IO r
f2 = filterRecords (("some value" ==) . myRecordField5)
f3 :: Pipe (Maybe MyRecord, d) (Maybe Int, d) IO r
f3 = P.map (first (fmap myRecordField10))
filterRecords
:: Monad m
=> (MyRecord -> Bool)
-> Pipe (Maybe MyRecord, Maybe PB.ByteString) (Maybe MyRecord, Maybe PB.ByteString) m r
filterRecords predicate =
for cat $ \(l, e) ->
when (isNothing l || (predicate <$> l) == Just True) $ yield (l, e)
asLines
:: Monad m
=> Producer PB.ByteString m x -> Producer PB.ByteString m x
asLines p = Fold.purely PG.folds Fold.mconcat (view PB.lines p)
parseRecords
:: (Monad m, FromJSON a, ToJSON a)
=> Producer PB.ByteString m r
-> Producer a m (Either (PA.DecodingError, Producer PB.ByteString m r) r)
parseRecords = view PA.decoded
resumingParser
:: Monad m
=> PP.StateT (Producer a m r) m (Maybe (Either e b))
-> Producer a m r
-> Producer (Maybe b, Maybe a) m ()
resumingParser parser p = do
(x, p') <- lift $ PP.runStateT parser p
case x of
Nothing -> return ()
Just (Left _) -> do
(x', p'') <- lift $ PP.runStateT PP.draw p'
yield (Nothing, x')
resumingParser parser p''
Just (Right b) -> do
yield (Just b, Nothing)
resumingParser parser p'