parsing - Clojure を使用して次のテキストを読み取り/解析するにはどうすればよいですか?

Question

Text の構造は次のとおりです。

Tag001
 0.1, 0.2, 0.3, 0.4
 0.5, 0.6, 0.7, 0.8
 ...
Tag002
 1.1, 1.2, 1.3, 1.4
 1.5, 1.6, 1.7, 1.8
 ...

ファイルには任意の数の TagXXX を含めることができ、各タグには任意の数の CSV 値行を含めることができます。

====PPPS。（これらのもので申し訳ありません:-)

その他の改善; 現在、atom ラップトップで 31842 行のデータを処理するのに約 1 秒かかります。これは、元のコードよりも 7 倍高速です。ただし、C 版はこれより 20 倍高速です。

(defn add-parsed-code [accu code]
  (if (empty? code)
    accu
    (conj accu code)))

(defn add-values [code comps]
  (let [values comps
        old-values (:values code)
        new-values (if old-values
                     (conj old-values values)
                     [values])]
    (assoc code :values new-values)))

(defn read-line-components [file]
  (map (fn [line] (clojure.string/split line #","))
       (with-open [rdr (clojure.java.io/reader file)]
         (doall (line-seq rdr)))))

(defn parse-file [file]
  (let [line-comps (read-line-components file)]
    (loop [line-comps line-comps
           accu []
           curr {}]
      (if line-comps
        (let [comps (first line-comps)]
          (if (= (count comps) 1) ;; code line?
            (recur (next line-comps)
                   (add-parsed-code accu curr)
                   {:code (first comps)})
            (recur (next line-comps)
                   accu
                   (add-values curr comps))))
        (add-parsed-code accu curr)))))

==== PPS。

最初のものは2番目のものよりも10倍速い理由はわかりませんが、slurpの代わりにmapとwith-openを使用すると読み取りが速くなります。全体の読み取り/処理時間はそれほど短縮されていませんが (7 秒から 6 秒に)

(time
 (let [lines (map (fn [line] line)
                  (with-open [rdr (clojure.java.io/reader
                                   "DATA.txt")]
                    (doall (line-seq rdr))))]
   (println (last lines))))

(time (let [lines
            (clojure.string/split-lines
             (slurp "DATA.txt"))]
        (println (last lines))))

====追伸。Skuro のソリューションは機能しました。しかし、解析速度はそれほど速くないので、読み取りと統計解析部分のみDBとclojureを構築。

score 5 · Accepted Answer

以下は、値の行を区切って上記のファイルを解析します。それがあなたが望むものでない場合は、add-values機能を変更することができます。解析状態はcurr変数にaccu保持されますが、以前に解析されたタグ（つまり、「TagXXX」が見つかる前に表示されたすべての行）が保持されます。タグなしの値を許可します。

更新：副作用が専用load-file関数にカプセル化されました

(defn tag? [line]
  (re-matches #"Tag[0-9]*" line))

; potentially unsafe, you might want to change this:
(defn parse-values [line]
  (read-string (str "[" line "]")))

(defn add-parsed-tag [accu tag]
  (if (empty? tag)
      accu
      (conj accu tag)))

(defn add-values [tag line]
  (let [values (parse-values line)
        old-values (:values tag)
        new-values (if old-values
                       (conj old-values values)
                       [values])]
    (assoc tag :values new-values)))

(defn load-file [path]
  (slurp path))

(defn parse-file [file]
  (let [lines (clojure.string/split-lines file)]
    (loop [lines lines ; remaining lines 
           accu []     ; already parsed tags
           curr {}]    ; current tag being parsed
          (if lines
              (let [line (first lines)]
                (if (tag? line)
                    ; we recur after starting a new tag
                    ; if curr is empty we don't add it to the accu (e.g. first iteration)
                    (recur (next lines)
                           (add-parsed-tag accu curr)
                           {:tag line})
                    ; we're parsing values for a currentl tag
                    (recur (next lines)
                           accu
                           (add-values curr line))))
              ; if we were parsing a tag, we need to add it to the final result
              (add-parsed-tag accu curr)))))

私は上記のコードにあまり興奮していませんが、それは仕事をします。次のようなファイルがあるとします。

Tag001
 0.1, 0.2, 0.3, 0.4
 0.5, 0.6, 0.7, 0.8
Tag002
 1.1, 1.2, 1.3, 1.4
 1.5, 1.6, 1.7, 1.8
Tag003
 1.1, 1.2, 1.3, 1.4
 1.1, 1.2, 1.3, 1.4
 1.5, 1.6, 1.7, 1.8
 1.5, 1.6, 1.7, 1.8

次の結果が生成されます。

user=> (clojure.pprint/print-table [:tag :values] (parse-file (load-file "tags.txt")))
================================================================
:tag   | :values
================================================================
Tag001 | [[0.1 0.2 0.3 0.4] [0.5 0.6 0.7 0.8]]
Tag002 | [[1.1 1.2 1.3 1.4] [1.5 1.6 1.7 1.8]]
Tag003 | [[1.1 1.2 1.3 1.4] [1.1 1.2 1.3 1.4] [1.5 1.6 1.7 1.8] [1.5 1.6 1.7 1.8]]
================================================================

score 1 · Accepted Answer

これは、partition-by 関数を使用して実行できます。読むのはおそらくやや不可解ですが、読みやすさは簡単に向上させることができます。この関数は、私の mini-mac で約 500 ミリ秒で実行されました。

まず、次の関数を使用してテストデータを作成しました。

(defn write-data[fname]
   (with-open [wrtr (clojure.java.io/writer fname) ]
     (dorun 
        (for [ x (take 7500 (range)) ]
          (do
             (.write wrtr (format "Tag%010d" x))
             (.write wrtr "
                            1.1, 1.2, 1.3, 1.4
                            1.1, 1.2, 1.3, 1.4
                            1.5, 1.6, 1.7, 1.8
                            1.5, 1.6, 1.7, 1.8
                           " ))))))

(write-data "my-data.txt")

; "a b c d " will be converted to [ a b c d ]
(defn to-vec[st]
   (load-string (str "[" st "]")))


(defn my-transform[fname]
   (let [tag (atom {:tag nil})]
      (with-open [rdr (clojure.java.io/reader fname)]
         (doall 
           (into {} 
               (map 
                  (fn[xs] {(first xs) (map to-vec (rest xs))}) 
                     ( partition-by 
                          (fn[y] 
                             (if(.startsWith 
                                  (str y) "Tag") 
                                  (swap! tag assoc :tag y) @tag)) 
                       (line-seq rdr))))))))


(time (count (my-transform "my-data.txt")))
;Elapsed time: 517.23 msecs

parsing - Clojure を使用して次のテキストを読み取り/解析するにはどうすればよいですか?

2 に答える 2

Related

Reference