2

私の目的は、後でデータベースに入力するために、Web ページから一連の書誌参照を抽出して解析することです。リファレンスはすべて MLA 形式です。これは、MLA 形式の参考文献のすべてのインスタンスに対する一般的な解決策であり、以下に示す Web ページ以外でも機能するはずです。

動作しない私の試行コードは次のとおりです。

(use '[net.cgrand.enlive-html])

(def ^:dynamic *base-url* "https://www.impacttest.com/research/?Clinical-Research-Database-4")
(def ^:dynamic *ref-selector*     [:div#content_1 :ul :li])


(defn fetch-url [url]
  (html-resource (java.net.URL. url)))

(defn references []
  (select (fetch-url *base-url*) *ref-selector*))

(def ^:dynamic *ref-regex*    #"\s([A-Z]{1}[\w|\s]+)[,|\.]")
(def ^:dynamic *ref-modifier* `(remove :content))

(defmacro extract-re [node re modifier]
  `(doseq [seqs (map :content (node))]
    (re-find re (apply str (modifier seqs)))))

(extract-re references *ref-regex* *ref-modifier*)

(macroexpand-1 '(extract-re references *ref-regex* *ref-modifier*))

すべての enlive ノードで正規表現マッチャー ( ) を実行するマクロextract-reを作成したいと考えています。変更する必要がある変数は 2 つあります。1 つは正規表現自体で、もう 1 つは修飾子であり、処理前に enlive ノードを変更します。修飾子がない場合、正規表現は著者一部のタイトルの両方に一致します。関数を書いてみたのですが、一般的には動かなかったので、マクロがいいと思います。doseqre-find

MLA 参照については、間違っているかもしれませんが、すべての抽出を正規表現で行うよりも、enlive ノードで修飾子を使用する方が簡単だと思います。タイトルのみ、または著者のみに一致する正規表現を行う方法が思いつきません。

では、修飾子をマクロに渡して適切に実行するにはどうすればよいでしょうか。私はマクロの引用の詳細を完全には理解していないので、そもそもマクロをどのように記述したか、またはマクロが必要であるかどうかについては、かなりずれている可能性があります。

4

2 に答える 2

3

このコードには多くの問題があります。

'(use [net.cgrand.enlive-html])

これはライブラリを取り込まず、リテラル リストを作成し、何もしません。

user> (class '(use [net.cgrand.enlive-html]))
clojure.lang.PersistentList

これは事実上ノーオペレーションです。

(def ^:dynamic *ref-modifier* `(remove :content))

これにより、いかなる種類の「修飾子」でもなく、2 つの要素のリストが作成されます。

(defmacro extract-re [node re modifier]
  `(doseq [seqs (map :content (node))]
    (re-find re (apply str (modifier seqs)))))

ここでは、syntax-quote を使用していますが、その中の何も引用符を外さないでください。マクロは、その引数をまったく使用しません。

関数であるかのように適用したいようですがmodifier(これは起こりません。上記の引用の問題を参照してください)、実際の呼び出しでわかるようにmodifier、2 つの要素のリストであり、呼び出されるとエラーが発生します。 .

最後に、doseq副作用に対してのみ機能し、常に nil を返します。doseq ブロックは によって生成された値を使用しないre-findため、doseq 本体は事実上ノーオペレーションです。

さらに、明示的な関数引数として提供される var に対して動的 var 宣言を使用することには、疑わしいユーティリティがあると思います。

これらすべての問題に対処することで、機能するものに近づいたと思います。

(use 'net.cgrand.enlive-html)

(def ^:dynamic *base-url*
  "https://www.impacttest.com/research/?Clinical-Research-Database-4")

(def ^:dynamic *ref-selector* [:div#content_1 :ul :li])


(defn fetch-url [url]
  (html-resource (java.net.URL. url)))

(defn references []
  (select (fetch-url *base-url*) *ref-selector*))

(def ^:dynamic *ref-regex* #"\s([A-Z]{1}[\w|\s]+)[,|\.]")

(def ^:dynamic *ref-modifier* (partial remove :content))

(defn extract-re [node re modifier]
  (doall
    (for [sq (map :content (node))]
      (re-find re (apply str (modifier sq))))))

そして実際に:

user> (extract-re references *ref-regex* *ref-modifier*)

([" Dambinova SA," "Dambinova SA"] [" Zuckerman SL," "Zuckerman SL"] [" Conklin HM," "Conklin HM"] [" Covassin T," "Covassin T"] [" Maerlender A," "Maerlender A"] [" Fedor A," "Fedor A"] [" Resch J," "Resch J"] [" Elbin RJ," "Elbin RJ"] [" Rabinowitz AR," "Rabinowitz AR"] [" Kinnaman KA," "Kinnaman KA"] [" Tsushima WT," "Tsushima WT"] [" Amonette WE," "Amonette WE"] [" Lovell MR," "Lovell MR"] [" Schatz P," "Schatz P"] [" McGrath N," "McGrath N"] [" Kontos AP," "Kontos AP"] [" AB," "AB"] [" Meehan WP," "Meehan WP"] [" Rieger BP," "Rieger BP"] [" Solomon GS," "Solomon GS"] [" Sandel NK," "Sandel NK"] [" Schatz P," "Schatz P"] [" Schatz P," "Schatz P"] [" Lebrun CM," "Lebrun CM"] [" Brooks B," "Brooks B"] [" Meehan WP," "Meehan WP"] [" Fakhran S," "Fakhran S"] [" Cole WR," "Cole WR"] [" Tsushima M," "Tsushima M"] [" Zuckerman SL," "Zuckerman SL"] [" JK," "JK"] [" Covassin T," "Covassin T"] [" Moser RS," "Moser RS"] [" Mayers LB," "Mayers LB"] [" McAllister TW," "McAllister TW"] [" Meehan WP 3rd," "Meehan WP 3rd"] [" Neal MT," "Neal MT"] [" Lau BC," "Lau BC"] [" Kontos AP," "Kontos AP"] [" Gardner A," "Gardner A"] [" Elbin RJ," "Elbin RJ"] [" Wolf EG," "Wolf EG"] [" Reddy CC," "Reddy CC"] [" Moser RS," "Moser RS"] [" Guerriero RM," "Guerriero RM"] [" Deibert E," "Deibert E"] [" Wiebe DJ," "Wiebe DJ"] [" Baillargeon A," "Baillargeon A"] [" Erdal K." "Erdal K"] [" Maugans TA," "Maugans TA"] [" Iverson GL," "Iverson GL"] [" Ponsford J," "Ponsford J"] [" Schatz P," "Schatz P"] [" Mulligan I," "Mulligan I"] [" Echlin PS," "Echlin PS"] [" McLeod TC," "McLeod TC"] [" Zuckerman SL," "Zuckerman SL"] [" Kontos AP," "Kontos AP"] [" Zuckerman SL," "Zuckerman SL"] [" Schatz P," "Schatz P"] [" Kontos AP," "Kontos AP"] [" Covassin T," "Covassin T"] [" Covassin T," "Covassin T"] [" Duhaime AC," "Duhaime AC"] [" Echemendia RJ," "Echemendia RJ"] [" Ramanathan DM," "Ramanathan DM"] [" Meehan WP 3rd," "Meehan WP 3rd"] [" Krol AL," "Krol AL"] [" Turgeon C," "Turgeon C"] [" Randolph C." "Randolph C"] [" Barlow M," "Barlow M"] [" Schatz P," "Schatz P"] [" Moser RS," "Moser RS"] [" Broglio SP," "Broglio SP"] [" Thomas DG," "Thomas DG"] [" Allen BJ," "Allen BJ"] [" Solomon GS," "Solomon GS"] [" Ponsford J," "Ponsford J"] [" Johnson EW," "Johnson EW"] [" Randolph C," "Randolph C"] [" Elbin RJ," "Elbin RJ"] [" Broglio SP," "Broglio SP"] [" Kontos AP," "Kontos AP"] [" Lau BC," "Lau BC"] [" Lau BC," "Lau BC"] [" Hettich T," "Hettich T"] [" Elbin T," "Elbin T"] [" Maerlender A," "Maerlender A"] [" Kontos AP," "Kontos AP"] [" Talavage TM," "Talavage TM"] [" Meehan WP 3rd," "Meehan WP 3rd"] [" Lange RT," "Lange RT"] [" Covassin T," "Covassin T"] [" Schatz P." "Schatz P"] [" Lange RT," "Lange RT"] [" Pardini JE," "Pardini JE"] [" Echlin PS," "Echlin PS"] [" Schatz P," "Schatz P"] [" Echlin PS," "Echlin PS"] [" Keightley ML," "Keightley ML"] [" McGrath N." "McGrath N"] [" Covassin T," "Covassin T"] [" Pontifex MB," "Pontifex MB"] [" AB," "AB"] [" Casson IR," "Casson IR"] [" McCrory P," "McCrory P"] [" Covassin T," "Covassin T"] [" Bruce JM," "Bruce JM"] [" Covassin T," "Covassin T"] [" Lovell M." "Lovell M"] [" Lau B," "Lau B"] [" Nance ML," "Nance ML"] [" Peterson SE," "Peterson SE"] [" Lovell M." "Lovell M"] [" Broglio SP," "Broglio SP"] [" Broglio SP," "Broglio SP"] [" Colvin AC," "Colvin AC"] [" Reddy CC," "Reddy CC"] [" Solomon GS," "Solomon GS"] [" Covassin T," "Covassin T"] [" Majerske CW," "Majerske CW"] [" Lovell MR," "Lovell MR"] [" AB," "AB"] [" Tsushima WT," "Tsushima WT"] [" Miller JR," "Miller JR"] [" Slobounov S," "Slobounov S"] [" Mihalik JP," "Mihalik JP"] [" Covassin T," "Covassin T"] [" Lovell MR," "Lovell MR"] [" Stoller KP." "Stoller KP"] [" Broglio SP," "Broglio SP"] [" Moser RS," "Moser RS"] [" Iverson G." "Iverson G"] [" Fazio VC," "Fazio VC"] [" Swanik CB," "Swanik CB"] [" Broglio SP," "Broglio SP"] [" Covassin T," "Covassin T"] [" Broglio SP," "Broglio SP"] [" Chen JK," "Chen JK"] [" Van Kampen DA," "Van Kampen DA"] [" Broglio SP," "Broglio SP"] [" Pellman EJ," "Pellman EJ"] [" Pellman EJ," "Pellman EJ"] [" Schatz P," "Schatz P"] [" Biasca N," "Biasca N"] [" Collins M," "Collins M"] [" Lovell MR," "Lovell MR"] [" Lovell MR," "Lovell MR"] [" Iverson GL," "Iverson GL"] [" Cantu RC," "Cantu RC"] [" McClincy MP," "McClincy MP"] [" Schatz P," "Schatz P"] [" Iverson GL," "Iverson GL"] [" Van Kampen DA," "Van Kampen DA"] [" Lovell M," "Lovell M"] [" Mihalik JP," "Mihalik JP"] [" Moser RS," "Moser RS"] [" Broshek DK," "Broshek DK"] [" Grove R," "Grove R"] [" McCrea M," "McCrea M"] [" McCrory P," "McCrory P"] [" Iverson GL," "Iverson GL"] [" Lovell MR," "Lovell MR"] [" Bruce JM," "Bruce JM"] [" Pellman EJ," "Pellman EJ"] [" Iverson GL," "Iverson GL"] [" Lovell MR," "Lovell MR"] [" Kontos A," "Kontos A"] [" Collins MW," "Collins MW"] [" Iverson GL," "Iverson GL"] [" Lovell M," "Lovell M"] [" Field M," "Field M"] [" Covassin T," "Covassin T"] [" Iverson GL," "Iverson GL"] [" Lovell MR," "Lovell MR"] [" Collins MW," "Collins MW"] [" Lovell MR," "Lovell MR"] [" Collins MW," "Collins MW"] [" Collins MW," "Collins MW"] [" Collins MW," "Collins MW"] [" Maroon JC," "Maroon JC"] [" Lovell MR," "Lovell MR"] [" Lovell MR." "Lovell MR"] [" Aubry M," "Aubry M"] [" Grindel SH," "Grindel SH"] [" Collins MW," "Collins MW"] [" Lovell MR," "Lovell MR"] [" Collins MW," "Collins MW"] [" Lovell MR," "Lovell MR"])
于 2014-07-10T15:09:18.743 に答える
1

イラストの都合上…

注: enlive の前に html を付けます

(require '[net.cgrand.enlive-html :as html])

の出力(references)は、次のような個々の参照要素のシーケンスです

(def data-sample 
  '{:tag :li, :attrs nil, 
    :content 
    ("\n\t\t\t\t\t\t\t\t\t\t\t\t\t" 
      {:tag :strong, :attrs nil, 
       :content 
       ("AMPAR peptide values *snip*.")}
      " Dambinova SA, Shikuev, Weissman JD, Mullins, JD. "
      {:tag :em, :attrs nil, :content ("Military Medicine.")}
      " 2013, 178 (3):285-290.\t\t\t\t\t\t\t\t\t\t\t\t")})

記事のタイトルが太字で、ジャーナルがイタリック体であることに気付くでしょう。セレクターを使用して、少なくともそれらを抽出することができます。ただし、書式設定の変更はコンポーネントの視覚的な分離を提供するために使用されるため、データの分離も提供します。

(defn trimmed-text-only [html-data] 
  (as-> html-data x
    (html/select x [html/text-node])
    (map clojure.string/trim x)
    (remove empty? x)))

(trimmed-text-only data-sample)
;=> 
("AMPAR peptide values *snip*." 
 "Dambinova SA, Shikuev, Weissman JD, Mullins, JD." 
 "Military Medicine." 
 "2013, 178 (3):285-290.")

これですでにコンポーネントが明らかになっていますが、各コンポーネントがピリオドで区切られており、コンポーネント内でピリオドが使用されていないことに注意してください。したがって、書式設定を完全に無視してピリオドで分割し、それらのピリオドを削除するという利点を追加することもできます。

(defn extract-major-reference-components
  [html-data]
  (as-> html-data x
    (trimmed-text-only x)
    (apply str x)
    (clojure.string/split x #"\.")
    (zipmap [:title :authors :journal :issue-ref] x)))

(extract-major-reference-components data-sample)
;=> 
{:title "AMPAR peptide values *snip*"
 :authors "Dambinova SA, Shikuev, Weissman JD, Mullins, JD",
 :journal "Military Medicine",
 :issue-ref "2013, 178 (3):285-290"}

これで、この抽出関数を一連の参照にマップできます。出力マップを使用すると、update-in と regexps を使用してさらに変換を行い、たとえば、個々の著者または年、号番号、およびページを issue-ref から分離できます。

于 2014-07-10T16:48:07.460 に答える