clojure - ビッグデータからの clojure 頻度辞書

Question

独自の単純ベイズ分類器を作成したいのですが、次のようなファイルがあります。

(これはスパムとハムメッセージのデータベースです。最初の単語はスパムまたはハムを指し、最後までのテキストはメッセージです (サイズ: 0.5 Mb) ここからhttp://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ )

ham     Go until jurong point, crazy.. Available only in bugis n gre
at world la e buffet... Cine there got amore wat...
ham     Ok lar... Joking wif u oni...
spam    Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's
ham     U dun say so early hor... U c already then say...
ham     Nah I don't think he goes to usf, he lives around here though
spam    FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

そして、次のようなハッシュマップを作成したい: {"spam" {"go" 1, "until" 100, ...}, "ham" {......}} すべての値が頻度であるハッシュマップ単語のマップ (ハムとスパムを別々に)

私は知っています、どのようにpythonまたはc ++でそれを行い、私はclojureで作成しましたが、私のソリューションは大きなデータで失敗しました(stackoverflow)

私の解決策：

(defn read_data_from_file [fname]
    (map #(split % #"\s")(map lower-case (with-open [rdr (reader fname)] 
        (doall (line-seq rdr))))))

(defn do-to-map [amap keyseq f]
    (reduce #(assoc %1 %2 (f (%1 %2))) amap keyseq))

(defn dicts_from_data [raw_data]
    (let [data (group-by #(first %) raw_data)]
        (do-to-map
            data (keys data) 
                (fn [x] (frequencies (reduce concat (map #(rest %) x)))))))

私はそれが間違っている場所を見つけようとして、これを書きました

(def raw_data (read_data_from_file (first args)))
(def d (group-by #(first %) raw_data))
(def f (map frequencies raw_data))
(def d1 (reduce concat (d "spam")))
(println (reduce concat (d "ham")))

エラー：

Exception in thread "main" java.lang.RuntimeException: java.lang.StackOverflowError
    at clojure.lang.Util.runtimeException(Util.java:165)
    at clojure.lang.Compiler.eval(Compiler.java:6476)
    at clojure.lang.Compiler.eval(Compiler.java:6455)
    at clojure.lang.Compiler.eval(Compiler.java:6431)
    at clojure.core$eval.invoke(core.clj:2795)
    at clojure.main$eval_opt.invoke(main.clj:296)
    at clojure.main$initialize.invoke(main.clj:315)
.....

これをより良く/効果的にするのを手伝ってくれる人はいますか? PS私の書き間違いで申し訳ありません。私の母国語ではない英語。

score 2 · Accepted Answer

匿名関数でapply代わりに使用すると、例外が回避されます。使用する代わりに。reduceStackOverflow(fn [x] (frequencies (reduce concat (map #(rest %) x))))(fn [x] (frequencies (apply concat (map #(rest %) x))))

以下は同じコードを少しリファクタリングしたものですが、ロジックはまったく同じです。一連の行に対して ping が 2 回実行されるのread-data-from-fileを回避するように変更されました。map

(use 'clojure.string)
(use 'clojure.java.io)

(defn read-data-from-file [fname]
  (let [lines (with-open [rdr (reader fname)] 
                (doall (line-seq rdr)))]
    (map #(-> % lower-case (split #"\s")) lines)))

(defn do-to-map [m keyseq f]
    (reduce #(assoc %1 %2 (f (%1 %2))) m keyseq))

(defn process-words [x]
  (->> x 
    (map #(rest %)) 
    (apply concat) ; This is the only real change from the 
                   ; original code, it used to be (reduce concat).
    frequencies))

(defn dicts-from-data [raw_data]
  (let [data (group-by first raw_data)]
    (do-to-map data
               (keys data) 
               process-words)))

(-> "SMSSpamCollection.txt" read-data-from-file dicts-from-data keys)

score 1 · Accepted Answer

考慮すべきもう 1 つのことは、(doall (line-seq ...))単語リスト全体をメモリに読み込むの使用です。リストが非常に大きい場合、これにより問題が発生する可能性があります。このようなデータを蓄積するための便利なトリックは、を使用することreduceです。あなたの場合、2 回行う必要があります。1reduce回は行を、次に各行の単語を 1 回です。このようなもの：

(defn parse-line
  [line]
  (str/split (str/lower-case line) #"\s+"))

(defn build-word-freq
  [file]
  (with-open [rdr (io/reader file)]
    (reduce (fn [accum line]
              (let [[spam-or-ham & words] (parse-line line)]
                (reduce #(update-in %1 [spam-or-ham %2] (fnil inc 0)) accum words)))
            {}
            (line-seq rdr))))

clojure - ビッグデータからの clojure 頻度辞書

2 に答える 2

Related

Reference