8

I've written a daemon in Haskell that scrapes information from a webpage every 5 minutes.

The daemon originally ran fine for about 50 minutes, but then it unexpectedly died with out of memory (requested 1048576 bytes). Every time I ran it it died after the same amount of time. Setting it to sleep only 30 seconds, it instead died after 8 minutes.

I realized the code to scrape the website was incredibly memory inefficient (going from about 30M while sleeping to 250M while parsing 9M of html), so I rewrote it so that now it only uses about 15M extra while parsing. Thinking the problem was fixed, I ran the daemon overnight and when I woke up it was actually using less memory than it was that night. I thought I was done, but roughly 20 hours after it had started, it had crashed with the same error.

I started looking into ghc profiling but I wasn't able to get that to work. Next I started messing with rts options, and I tried setting -H64m to set the default heap size to be larger than my program was using, and also using -Ksize to shrink the maximum size of the stack to see if that would make it crash sooner.

Despite every change I've made, the daemon still seems to crash after a constant number of iterations. Making the parsing more memory efficient made this value higher, but it still crashes. This doesn't make sense to me because none of these have runs have even come close to using all of my memory, much less swap space. The heap size is supposed to be unlimited by default, shrinking the stack size didn't make a difference, and all my ulimits are either unlimited or significantly higher than what the daemon is using.

In the original code I pinpointed the crash to somewhere in the html parsing, but I haven't done the same for the more memory efficient version because 20 hours takes so long to run. I don't know if this would even be useful to know because it doesn't seem like any specific part of the program is broken because it run successfully for dozens of iterations before crashing.

Out of ideas, I even looked through the ghc source code for this error, and it appears to be a failed call to mmap, which wasn't very helpful to me because I assume that isn't the root of the problem.

(Edit: code rewritten and moved to end of post)

I'm pretty new at Haskell, so I'm hoping this is some quirk of lazy evaluation or something else that has a quick fix. Otherwise, I'm fresh out of ideas.

I'm using GHC version 7.4.2 on FreeBsd 9.1

Edit:

Replacing the downloading with static html got rid of the problem, so I've narrowed it down to how I'm using http-conduit. I've edited the code above to include my networking code. The hackage docs mention to share a manager so I've done that. And it also says that for http you have to explicitly close connections, but I don't think I need to do that for httpLbs.

Here's my code.

import Control.Monad.IO.Class (liftIO)
import qualified Data.Text as T
import qualified Data.ByteString.Lazy as BL
import Text.Regex.PCRE
import Network.HTTP.Conduit

main :: IO ()
main = do
    manager <- newManager def
    daemonLoop manager

daemonLoop :: Manager -> IO ()
daemonLoop manager = do
    rows <- scrapeWebpage manager
    putStrLn $ "number of rows parsed: " ++ (show $ length rows)
    doSleep
    daemonLoop manager

scrapeWebpage :: Manager -> IO [[BL.ByteString]]
scrapeWebpage manager = do
    putStrLn "before makeRequest"
    html <- makeRequest manager
    -- Force evaluation of html.
    putStrLn $ "html length: " ++ (show $ BL.length html)
    putStrLn "after makeRequest"
    -- Breaks ~10M html table into 2d list of bytestrings.
    -- Max memory usage is about 45M, which is about 15M more than when sleeping.
    return $ map tail $ html =~ pattern
    where
        pattern :: BL.ByteString
        pattern = BL.concat $ replicate 12 "<td[^>]*>([^<]+)</td>\\s*"

makeRequest :: Manager -> IO BL.ByteString
makeRequest manager = runResourceT $ do
    defReq <- parseUrl url
    let request = urlEncodedBody params $ defReq
                    -- Don't throw errors for bad statuses.
                    { checkStatus = \_ _ -> Nothing
                    -- 1 minute.
                    , responseTimeout = Just 60000000
                    }
    response <- httpLbs request manager
    return $ responseBody response

and it's output:

before makeRequest
html length: 1555212
after makeRequest
number of rows parsed: 3608
...
before makeRequest
html length: 1555212
after makeRequest
bannerstalkerd: out of memory (requested 2097152 bytes)

Getting rid of the regex computations fixed the problem, but it seems that the error happens after the networking and during the regex, presumably because of something I'm doing wrong with http-conduit. Any ideas?

Also, when I try to compile with profiling enabled I get this error:

Could not find module `Network.HTTP.Conduit'
Perhaps you haven't installed the profiling libraries for package `http-conduit-1.8.9'?

Indeed, I have not installed profiling libraries for http-conduit and I don't know how.

4

2 に答える 2

4

だからあなたは自分自身に漏れがあることに気づきました。コンパイラオプションとメモリ設定を騙すことで、プログラムがクラッシュした瞬間を延期することはできますが、問題の原因を取り除くことはできないため、そこに何を設定しても、最終的にはメモリが不足します。

すべての非純粋なコードを注意深くウォークスルーし、主にリソースを操作する部分を確認することをお勧めします。すべてのリソースが正しく解放されるかどうかを確認します。成長している無制限のチャネルのように、蓄積状態があるかどうかを確認します。そしてもちろん、nmによって賢明に示唆されているように、それをプロファイリングします

一時停止せずにページを解析してファイルをダウンロードするスクレーパーがあり、すべてを同時に実行します。60Mを超えるメモリを使用しているのを見たことがありません。私はGHC7.4.2、GHC 7.6.1、GHC 7.6.2でコンパイルしてきましたが、どちらにも問題はありませんでした。

問題の根本は、使用しているライブラリにもある可能性があることに注意してください。私のスクレーパーでは、、、http-conduitおよびを使用http-conduit-browserします。HandsomeSoupHXT

于 2013-02-25T07:44:38.127 に答える
3

私は自分の問題を解決することになった。FreeBSDのGHCバグのようです。バグレポートを提出してLinuxに切り替えたところ、ここ数日は問題なく動作しています。

于 2013-03-03T03:23:41.333 に答える