html - Haskell で TagSoup を使用してタグを解析する

Question

Haskell で HTML ファイルからデータを抽出する方法を学ぼうとしてきましたが、壁にぶつかりました。私は Haskell の経験はまったくありません。私の以前の知識は Python (および HTML 解析用の BeatifulSoup) に関するものです。

私は TagSoup を使用して自分の HTML を見ており (推奨されているようです)、それがどのように機能するかについて基本的な考えを持っています。問題のコードの基本的なセグメントは次のとおりです(自己完結型で、テスト用の情報を出力します):

import System.IO
import Network.HTTP
import Text.HTML.TagSoup
import Data.List

main :: IO ()
main = do
    http <- simpleHTTP (getRequest "http://www.cbssports.com/nba/scoreboard/20130310") >>= getResponseBody
    let tags = dropWhile (~/= TagOpen "div" []) (parseTags http)
    done tags where
        done xs = case xs of
            [] -> putStrLn $ "\n"
            _ -> do
                putStrLn $ show $ head xs
                done (tail xs)

ただし、「div」タグにアクセスしようとしているわけではありません。タグの前のすべてを次のような形式で削除したい:

TagOpen "div" [("id","scores-1997830"),("class","scoreBox spanCol2")]
TagOpen "div" [("id","scores-1997831"),("class","scoreBox spanCol2 lastCol")]

私はそれを書き出そうとしました：

let tags = dropWhile (~/= TagOpen "div" [("id", "scores-[0-9]+"), ("class", "scoreBox( spanCol[0-9]?)+( lastCol)?")]) (parseTags http)

しかし、その後、リテラル [0-9]+ を見つけようとします。Text.Regex.Posix モジュールの回避策をまだ見つけていません。また、文字のエスケープも機能しません。ここでの解決策は何ですか？

score 4 · Accepted Answer

~==は正規表現を行わないため、自分でマッチャーを作成する必要があります。次のようなものです。

import Data.Maybe
import Text.Regex

goodTag :: TagOpen -> Bool
goodTag tag = tag ~== TagOpen "div" []
    && fromAttrib "id" tag `matches` "scores-[0-9]+"

-- Just a wrapper around Text.Regex.matchRegex
matches :: String -> String -> Bool
matches string regex = isJust $ mkRegex regex `matchRegex` string

html - Haskell で TagSoup を使用してタグを解析する

1 に答える 1

Related

Reference