scraping a web site and receiving a HTML page.
The page has some tables with rows
(actor -> role)
For example:
( actor = Jason Priestley -> role = Brandon Walsh)
Sometimes there are rows that are missing the "actor" or the "role"
(rows with 1 column when expecting 2)
File example :
<div id="90210">
<h2 style="margin:0 0 2px 0">beverly hills 90210</h2>
<table class="actors">
<tr><td class="actor">Jennie Garth</td><td class="role">Kelly Taylor</td></tr>
<tr><td class="actor">Shannen Doherty</td></tr>
<tr><td class="actor">Jason Priestley</td><td class="role">Brandon Walsh</td></tr>
</table>
</div>
Having trouble filtering out the rows with 1 column only :
my code:
def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {
val beverlyHillsData = page \\ "div" find ((node: xml.Node) => (node \ "id").text == "90210")
beverlyHillsData match {
case Some(data) => {
val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )
val actors = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "actor") map { _.text }
val roles = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "role") map {_.text}
actors zip roles toMap
}
case None => Map()
}
}
Main concerns is with the line :
val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )
How can i filter out the bad rows doing it more precise (without the _.toString() )
Any suggestions ?