clojure - Clojure でテキスト処理のパフォーマンスを向上させる方法は?

Question

Clojure についてもっと学ぶ方法として、単純なデスクトップ検索エンジンを Clojure で書いています。これまで、私のプログラムのテキスト処理段階でのパフォーマンスは非常に悪かったです。

テキスト処理中に私はしなければなりません：

不要な文字をクリーンアップします。
文字列を小文字に変換します。
ドキュメントを分割して単語のリストを取得します。
各単語をドキュメント内の出現箇所に関連付けるマップを作成します。

コードは次のとおりです。

(ns txt-processing.core
  (:require [clojure.java.io :as cjio])
  (:require [clojure.string :as cjstr])
  (:gen-class))

(defn all-files [path]
  (let [entries (file-seq (cjio/file path))]
    (filter (memfn isFile) entries)))

(def char-val
  (let [value #(Character/getNumericValue %)]
    {:a (value \a) :z (value \z)
     :A (value \A) :Z (value \Z)
     :0 (value \0) :9 (value \9)}))

(defn is-ascii-alpha-num [c]
  (let [n (Character/getNumericValue c)]
    (or (and (>= n (char-val :a)) (<= n (char-val :z)))
        (and (>= n (char-val :A)) (<= n (char-val :Z)))
        (and (>= n (char-val :0)) (<= n (char-val :9))))))

(defn is-valid [c]
    (or (is-ascii-alpha-num c)
        (Character/isSpaceChar c)
        (.equals (str \newline) (str c))))

(defn lower-and-replace [c]
  (if (.equals (str \newline) (str c)) \space (Character/toLowerCase c)))

(defn tokenize [content]
  (let [filtered (filter is-valid content)
        lowered (map lower-and-replace filtered)]
    (cjstr/split (apply str lowered) #"\s+")))

(defn process-content [content]
  (let [words (tokenize content)]
    (loop [ws words i 0 hmap (hash-map)]
      (if (empty? ws)
        hmap
        (recur (rest ws) (+ i 1) (update-in hmap [(first ws)] #(conj % i)))))))

(defn -main [& args]
  (doseq [file (all-files (first args))]
    (let [content (slurp file)
          oc-list (process-content content)]
      (println "File:" (.getPath file)
               "| Words to be indexed:" (count oc-list )))))

Haskell でこの問題を別の方法で実装したので、次の出力でわかるように両方を比較しました。

Clojure のバージョン:

$ lein uberjar
Compiling txt-processing.core
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT.jar
Including txt-processing-0.1.0-SNAPSHOT.jar
Including clojure-1.5.1.jar
Created /home/luisgabriel/projects/txt-processing/clojure/target/txt-processing-0.1.0-SNAPSHOT-standalone.jar
$ time java -jar target/txt-processing-0.1.0-SNAPSHOT-standalone.jar ../data
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642

real    2m2.164s
user    2m3.868s
sys     0m0.978s

Haskell バージョン:

$ ghc -rtsopts --make txt-processing.hs 
[1 of 1] Compiling Main             ( txt-processing.hs, txt-processing.o )
Linking txt-processing ...
$ time ./txt-processing ../data/ +RTS -K12m
File: ../data/The.Rat.Racket.by.David.Henry.Keller.txt | Words to be indexed: 2033
File: ../data/Beyond.Pandora.by.Robert.J.Martin.txt | Words to be indexed: 1028
File: ../data/Bat.Wing.by.Sax.Rohmer.txt | Words to be indexed: 7562
File: ../data/Operation.Outer.Space.by.Murray.Leinster.txt | Words to be indexed: 7754
File: ../data/The.Reign.of.Mary.Tudor.by.James.Anthony.Froude.txt | Words to be indexed: 15418
File: ../data/.directory | Words to be indexed: 3
File: ../data/Home.Life.in.Colonial.Days.by.Alice.Morse.Earle.txt | Words to be indexed: 12191
File: ../data/The.Dark.Door.by.Alan.Edward.Nourse.txt | Words to be indexed: 2378
File: ../data/Storm.Over.Warlock.by.Andre.Norton.txt | Words to be indexed: 7451
File: ../data/A.Brief.History.of.the.United.States.by.John.Bach.McMaster.txt | Words to be indexed: 11049
File: ../data/The.Jesuits.in.North.America.in.the.Seventeenth.Century.by.Francis.Parkman.txt | Words to be indexed: 14721
File: ../data/Queen.Victoria.by.Lytton.Strachey.txt | Words to be indexed: 10494
File: ../data/Crime.and.Punishment.by.Fyodor.Dostoyevsky.txt | Words to be indexed: 10642

real    0m9.086s
user    0m8.591s
sys     0m0.463s

Clojure 実装での ( string -> lazy sequence ) 変換がパフォーマンスを低下させていると思います。どうすれば改善できますか？

PS: これらのテストで使用されるすべてのコードとデータは、ここからダウンロードできます。

score 4 · Accepted Answer

おそらくこのコードを高速化するためにできることは次のとおりです。

1) をにマッピングする代わりにchars、char-val文字間の値を直接比較するだけです。これは、Java で高速になるのと同じ理由で高速です。

str2)単一文字の値を本格的な文字列に変換するために繰り返し使用します。ここでも、文字値を直接使用することを検討してください。繰り返しますが、オブジェクトの作成は、Java と同じように低速です。

3) に置き換えprocess-contentてclojure.core/frequenciesください。おそらくfrequenciesソースを調べて、どのように高速であるかを確認してください。

(hash-map)4)ループ内で a を更新する必要がある場合は、を使用しますtransient。参照: http://clojuredocs.org/clojure_core/clojure.core/transient

また、(hash-map)が a を返すためPersistentArrayMap、呼び出しごとに新しいインスタンスを作成していることにも注意してupdate-inください。したがって、低速であり、トランジェントを使用する必要があります。

(set! *warn-on-reflection* true)5) これはあなたの友達です:

 Reflection warning, scratch.clj:10:13 - call to isFile can't be resolved.
 Reflection warning, scratch.clj:13:16 - call to getNumericValue can't be resolved.
 Reflection warning, scratch.clj:19:11 - call to getNumericValue can't be resolved.
 Reflection warning, scratch.clj:26:9 - call to isSpaceChar can't be resolved.
 Reflection warning, scratch.clj:30:47 - call to toLowerCase can't be resolved.
 Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.
 Reflection warning, scratch.clj:48:24 - reference to field getPath can't be resolved.

clojure - Clojure でテキスト処理のパフォーマンスを向上させる方法は?

2 に答える 2

Related

Reference