After hearing the latest Stack Overflow podcast, Peter Norvig's compact Python spell-checker intrigued me, so I decided to implement it in Scala if I could express it well in the functional Scala idiom, and also to see how many lines of code it would take.
Here's the whole problem. (Let's not compare lines of code yet.)
(Two notes: You can run this in the Scala interpreter, if you wish. If you need a copy of big.txt, or the whole project, it's on GitHub.)
import scala.io.Source
val alphabet = "abcdefghijklmnopqrstuvwxyz"
def train(text:String) = {
"[a-z]+".r.findAllIn(text).foldLeft(Map[String, Int]() withDefaultValue 1)
{(a, b) => a(b) = a(b) + 1}
}
val NWORDS = train(Source.fromFile("big.txt").getLines.mkString.toLowerCase)
def known(words:Set[String]) =
{Set.empty ++ (for(w <- words if NWORDS contains w) yield w)}
def edits1(word:String) = {
Set.empty ++
(for (i <- 0 until word.length) // Deletes
yield (word take i) + (word drop (i + 1))) ++
(for (i <- 0 until word.length - 1) // Transposes
yield (word take i) + word(i + 1) + word(i) + (word drop (i + 2))) ++
(for (i <- 0 until word.length; j <- alphabet) // Replaces
yield (word take i) + j + (word drop (i+1))) ++
(for (i <- 0 until word.length; j <- alphabet) // Inserts
yield (word take i) + j + (word drop i))
}
def known_edits2(word:String) = {Set.empty ++ (for (e1 <- edits1(word);
e2 <- edits1(e1) if NWORDS contains e2) yield e2)}
def correct(word:String) = {
val options = Seq(() => known(Set(word)), () => known(edits1(word)),
() => known_edits2(word), () => Set(word))
val candidates = options.foldLeft(Set[String]())
{(a, b) => if (a.isEmpty) b() else a}
candidates.foldLeft("") {(a, b) => if (NWORDS(a) > NWORDS(b)) a else b}
}
Specifically, I'm wondering if there's anything cleaner I can do with the correct
function. In the original Python, the implementation is a bit cleaner:
def correct(word):
candidates = known([word]) or known(edits1(word)) or
known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Apparently in Python, an empty set will evaluate to Boolean False
, so only the first of the candidates to return a non-empty set will be evaluated, saving potentially expensive calls to edits1
and known_edits2
.
The only solution I would come up with is the version you see here, where the Seq
of anonymous functions are called until one returns a non-empty Set
, which the last one is guaranteed to do.
So experienced Scala-heads, is there a more syntactically concise or better way to do this? Thanks in advance!