xml - 大きな（30MB）xmlファイルで要素を見つけるScala

Question

30MBの大きなXMLファイルがあり、その中にいくつかの要素を見つけたいと思っています。

このファイルは効果的なpom.xmlであり、すべての依存関係（名前、グループ、バージョン）、サブモジュール、およびその親を取得したいと考えています。あなたはを使用してそのようなファイルを見ることができます

mvn help:effective-pom -Doutput=test.xml

26MBの私のファイルの場合、scalaのXML.load *はjava.lang.OutOfMemoryError：Javaヒープスペースになります。

ヒープスペースを増やす以外に何ができますか？

TIA、バストル。

score 6 · Accepted Answer

You can use pull parsing, where XML elements are viewed as a sequence of events (open tag a, open tag i, text, close tag i, ...).

This avoids storing the entire file in memory.

I have used it on XML files of several hundreds of MB without any major problem. (Of course, as Rex points out in a comment, if the elements you want to recover are themselves huge, there is no obvious way around it.)

The pull parser is not as convenient as the "regular" one (or Anti-XML) because it does not give you a tree. Instead, you have to manage state to track where you are in the document.

Here is a self contained example that shows how to extract all internal links on the Wikipedia page for Scala:

import scala.xml.Text
import scala.xml.pull._
import scala.io.Source

val src = Source.fromURL("http://en.wikipedia.org/wiki/Scala_(programming_language)")

val reader = new XMLEventReader(src)

val Internal = """/wiki/([\w_]*)""".r

var inLink = false
var linksTo = ""

for(event <- reader) { 
  event match { 
    case EvElemStart(_, "a", meta, _) => meta("href") match { 
      case Text(Internal(href)) =>
        linksTo = href
        inLink = true
      case _ => 
    } 
    case EvText(txt) if inLink => println(txt + " --> " + linksTo)
    case EvElemEnd(_, "a") => inLink = false
    case _ => ; 
  } 
}

score 4 · Accepted Answer

Simply put, Scala's standard library's xml doesn't cut it. You can use the pull parser, but it's hardly practical. Instead, I'd use Scales (seeing as Anti-XML isn't an improvement either -- I thought they had iteratee-based selectors, but I couldn't find them).

score 0 · Accepted Answer

To add to Daniels point, I'm biased anyway of course, Scales Xml provides what you need for higher level pull parsing. Sometimes a full blown tree parse simply isn't a good match and pull parsing traditionally forces too much management on the developer. Scales aims to make this simpler via iteratees and a notion of path.

If you can identify the paths you need then Scales will pull out mini trees for each item. This works via combining the results of iteratees (one iteratee for each path) and allowing the user to fold over each occurrence.

This runs in constant space, only limited by what objects you keep during the parse, but is slower than a tree based parse. (Scales requires around 200-220MB of heap to process a 30MB tree - but can be reduced to 170-180 if the document is easy to optimise - see memory optimisation for more details)

See the Pull Parsing docs for examples

xml - 大きな（30MB）xmlファイルで要素を見つけるScala

3 に答える 3

Related

Reference