1

scraping a web site and receiving a HTML page.

The page has some tables with rows

(actor -> role)

For example:

( actor = Jason Priestley -> role = Brandon Walsh)

Sometimes there are rows that are missing the "actor" or the "role"

(rows with 1 column when expecting 2)

File example :

<div id="90210">
      <h2 style="margin:0 0 2px 0">beverly hills 90210</h2>
      <table class="actors">
        <tr><td class="actor">Jennie Garth</td><td class="role">Kelly Taylor</td></tr>
        <tr><td class="actor">Shannen Doherty</td></tr>
        <tr><td class="actor">Jason Priestley</td><td class="role">Brandon Walsh</td></tr>
      </table>
</div>

Having trouble filtering out the rows with 1 column only :

my code:

  def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {
    val beverlyHillsData = page \\ "div" find ((node: xml.Node) => (node \ "id").text == "90210")
    beverlyHillsData match {
      case Some(data) => {
        val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )
        val actors = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "actor") map { _.text }
        val roles  = goodRows \\ "td" filter ((node: xml.Node) => (node \ "class").text == "role")  map {_.text}
        actors zip roles  toMap
      }
      case None => Map()
    }
  }

Main concerns is with the line :

val goodRows = data \\ "tr" filter (_.toString() contains "actor" ) filter (_.toString() contains "role" )

How can i filter out the bad rows doing it more precise (without the _.toString() )

Any suggestions ?

4

1 に答える 1

1

あなたはできる

def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))

val goodRows = data \\ "tr" filter actorWithRole

また、データ抽出を変更して、俳優/役割のペアをそのまま保持します。クリーンな解決策を見つけるにはもっと時間が必要です

私が提案するのは

def beverlyHillsParser(page: xml.NodeSeq) : Map[String, String] = {

  def actorWithRole(n: Node) = n \\ "@class" xml_sameElements(List("actor", "role"))

  def rowToEntry(r: Node) =
    r \ "td" map (_.text) match {
      case actor :: role :: Nil => (actor -> role)
    }  

  val beverlyHillsData = page \\ "div" find whereId("90210")

  beverlyHillsData match {
    case Some(data) => {
      val goodRows = data \\ "tr" filter actorWithRole
      val entries = goodRows map rowToEntry
      entries.toMap
    }
    case None => Map()
  }
}
于 2013-11-06T14:31:21.543 に答える