scala - Scala: ストリーム/イテレータ収集結果をいくつかの異なるコレクションにトラバースする方法

Question

大きすぎてメモリに収まらないログファイルを調べて、2 種類の式を収集しています。以下の反復スニペットの代わりに機能的に優れているのはどれですか?

def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)]={
  val lines : Iterator[String] = io.Source.fromFile(file).getLines()

  val logins: mutable.Map[String, String] = new mutable.HashMap[String, String]()
  val errors: mutable.ListBuffer[(String, String)] = mutable.ListBuffer.empty

  for (line <- lines){
    line match {
      case errorPat(date,ip)=> errors.append((ip,date))
      case loginPat(date,user,ip,id) =>logins.put(ip, id)
      case _ => ""
    }
  }

  errors.toList.map(line => (logins.getOrElse(line._1,"none") + " " + line._1,line._2))
}

score 1 · Accepted Answer

考えられる解決策は次のとおりです。

def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String,String)] = {
  val lines = Source.fromFile(file).getLines
  val (err, log) = lines.collect {
        case errorPat(inf, ip) => (Some((ip, inf)), None)
        case loginPat(_, _, ip, id) => (None, Some((ip, id)))
      }.toList.unzip
  val ip2id = log.flatten.toMap
  err.collect{ case Some((ip,inf)) => (ip2id.getOrElse(ip,"none") + "" + ip, inf) }
}

score 0 · Accepted Answer

私はいくつかの提案があります：

ペア/タプルの代わりに、独自のクラスを使用する方がよい場合がよくあります。タイプとそのフィールドの両方に意味のある名前が付けられ、コードがはるかに読みやすくなります。
コードを小さな部分に分割します。特に、一緒に結び付ける必要のないコードの断片を切り離すようにしてください。これにより、コードが理解しやすくなり、堅牢になり、エラーが発生しにくくなり、テストが容易になります。あなたの場合、入力（ログファイルの行）の生成とそれを消費して結果を生成することを分離するのが良いでしょう。たとえば、サンプルデータをファイルに保存しなくても、関数の自動テストを行うことができます。

例と演習として、Scalaziterateesに基づいたソリューションを作成しようとしました。それは少し長く（の補助コードを含むIteratorEnumerator）、おそらくタスクには少しやり過ぎですが、おそらく誰かがそれを役立つと思うでしょう。

import java.io._;
import scala.util.matching.Regex
import scalaz._
import scalaz.IterV._

object MyApp extends App {
  // A type for the result. Having names keeps things
  // clearer and shorter.
  type LogResult = List[(String,String)]

  // Represents a state of our computation. Not only it
  // gives a name to the data, we can also put here
  // functions that modify the state.  This nicely
  // separates what we're computing and how.
  sealed case class State(
    logins: Map[String,String],
    errors: Seq[(String,String)]
  ) {
    def this() = {
      this(Map.empty[String,String], Seq.empty[(String,String)])
    }

    def addError(date: String, ip: String): State =
      State(logins, errors :+ (ip -> date));
    def addLogin(ip: String, id: String): State =
      State(logins + (ip -> id), errors);

    // Produce the final result from accumulated data.
    def result: LogResult =
      for ((ip, date) <- errors.toList)
        yield (logins.getOrElse(ip, "none") + " " + ip) -> date
  }

  // An iteratee that consumes lines of our input. Based
  // on the given regular expressions, it produces an
  // iteratee that parses the input and uses State to
  // compute the result.
  def logIteratee(errorPat: Regex, loginPat: Regex):
            IterV[String,List[(String,String)]] = {
    // Consumes a signle line.
    def consume(line: String, state: State): State =
      line match {
        case errorPat(date, ip)           => state.addError(date, ip);
        case loginPat(date, user, ip, id) => state.addLogin(ip, id);
        case _                            => state
      }

    // The core of the iteratee. Every time we consume a
    // line, we update our state. When done, compute the
    // final result.
    def step(state: State)(s: Input[String]): IterV[String, LogResult] =
      s(el    = line => Cont(step(consume(line, state))),
        empty = Cont(step(state)),
        eof   = Done(state.result, EOF[String]))
    // Return the iterate waiting for its first input.
    Cont(step(new State()));
  }


  // Converts an iterator into an enumerator. This
  // should be more likely moved to Scalaz.
  // Adapted from scalaz.ExampleIteratee
  implicit val IteratorEnumerator = new Enumerator[Iterator] {
    @annotation.tailrec def apply[E, A](e: Iterator[E], i: IterV[E, A]): IterV[E, A] = {
      val next: Option[(Iterator[E], IterV[E, A])] =
        if (e.hasNext) {
          val x = e.next();
          i.fold(done = (_, _) => None, cont = k => Some((e, k(El(x)))))
        } else
          None;
       next match {
         case None => i
         case Some((es, is)) => apply(es, is)
       }
    }
  }


  // main ---------------------------------------------------
  {
    // Read a file as an iterator of lines:
    // val lines: Iterator[String] =
    //    io.Source.fromFile("test.log").getLines();

    // Create our testing iterator:
    val lines: Iterator[String] = Seq(
      "Error: 2012/03 1.2.3.4",
      "Login: 2012/03 user 1.2.3.4 Joe",
      "Error: 2012/03 1.2.3.5",
      "Error: 2012/04 1.2.3.4"
    ).iterator;

    // Create an iteratee.
    val iter = logIteratee("Error: (\\S+) (\\S+)".r, 
                           "Login: (\\S+) (\\S+) (\\S+) (\\S+)".r);
    // Run the the iteratee against the input
    // (the enumerator is implicit)
    println(iter(lines).run);
  }
}

score 0 · Accepted Answer

修正:
1) 不要な型宣言を削除
2) ulgy の代わりにタプルの分解._1
3) 可変アキュムレータの代わりに左折畳み
4) より便利な演算子のようなメソッド:+を使用し、+

def streamData(file: File, errorPat: Regex, loginPat: Regex): List[(String, String)] = {
    val lines = io.Source.fromFile(file).getLines()

    val (logins, errors) =
        ((Map.empty[String, String], Seq.empty[(String, String)]) /: lines) {
            case ((loginsAcc, errorsAcc), next) =>
                next match {
                    case errorPat(date, ip) => (loginsAcc, errorsAcc :+ (ip -> date))
                    case loginPat(date, user, ip, id) => (loginsAcc + (ip -> id) , errorsAcc)
                    case _ => (loginsAcc, errorsAcc)
                }
        }

// more concise equivalent for
// errors.toList.map { case (ip, date) => (logins.getOrElse(ip, "none") + " " + ip) -> date }
    for ((ip, date) <- errors.toList) 
    yield (logins.getOrElse(ip, "none") + " " + ip) -> date


}

scala - Scala: ストリーム/イテレータ収集結果をいくつかの異なるコレクションにトラバースする方法

3 に答える 3

Related

Reference