You can use pull parsing, where XML elements are viewed as a sequence of events (open tag a
, open tag i
, text, close tag i
, ...).
This avoids storing the entire file in memory.
I have used it on XML files of several hundreds of MB without any major problem. (Of course, as Rex points out in a comment, if the elements you want to recover are themselves huge, there is no obvious way around it.)
The pull parser is not as convenient as the "regular" one (or Anti-XML) because it does not give you a tree. Instead, you have to manage state to track where you are in the document.
Here is a self contained example that shows how to extract all internal links on the Wikipedia page for Scala:
import scala.xml.Text
import scala.xml.pull._
import scala.io.Source
val src = Source.fromURL("http://en.wikipedia.org/wiki/Scala_(programming_language)")
val reader = new XMLEventReader(src)
val Internal = """/wiki/([\w_]*)""".r
var inLink = false
var linksTo = ""
for(event <- reader) {
event match {
case EvElemStart(_, "a", meta, _) => meta("href") match {
case Text(Internal(href)) =>
linksTo = href
inLink = true
case _ =>
}
case EvText(txt) if inLink => println(txt + " --> " + linksTo)
case EvElemEnd(_, "a") => inLink = false
case _ => ;
}
}