javascript - 用語の大規模なリストを使用して、ページのテキストを検索し、単語をリンクに置き換えます

Question

少し前に、データベースの用語のリストと一致する場合にテキストをHTMLリンクに変換できるかどうかを尋ねるこの質問を投稿しました。

私はかなり膨大な用語のリストを持っています-約6000。

その質問に対する受け入れられた答えは素晴らしかったが、XPathを使用したことがなかったので、問題が発生し始めたときに私は途方に暮れた。ある時点で、コードをいじった後、なんとかしてデータベースに40,000を超えるランダムな文字を追加することができました。その大部分は、手動で削除する必要がありました。それ以来、私はそのアイデアへの信頼を失い、より単純なPHPソリューションは、データの量と用語の量を処理するのに十分効率的ではありませんでした。

解決策としての次の試みは、ページが読み込まれると用語を取得してページ上のテキストと照合するJSスクリプトを作成することです。

この答えは私が試みたい考えを持っています。

AJAXを使用してデータベースから用語を取得し、次のようなオブジェクトを作成します。

var words = [
    {
        word: 'Something',
        link: 'http://www.something.com'
    },
    {
        word: 'Something Else',
        link: 'http://www.something.com/else'
    }
];

オブジェクトがビルドされたら、次の種類のコードを使用します。

//for each array element
$.each(words,
    function() {
        //store it ("this" is gonna become the dom element in the next function)
        var search = this;
        $('.message').each(
            function() {
                //if it's exactly the same
                if ($(this).text() === search.word) {
                    //do your magic tricks
                    $(this).html('<a href="' + search.link + '">' + search.link + '</a>');
                }
            }
        );
    }
);

さて、一見すると、ここに大きな問題があります。6,000の用語で、このコードは、私がやろうとしていることを実行するのに十分効率的でしょうか？。

1つのオプションは、AJAXが通信するPHPスクリプト内でオーバーヘッドの一部を実行することです。たとえば、投稿のIDを送信し、PHPスクリプトでSQLステートメントを使用して投稿からすべての情報を取得し、それを6,000の用語すべてと照合することができます。その後、JavaScriptへの戻り呼び出しは単純に照合されます。用語。これにより、上記のjQueryが行う一致の数が大幅に減少します（最大で約50）。

スクリプトがユーザーのブラウザに「ロード」するのに数秒かかることは、CPU使用率などに影響を与えない限り、問題ありません。

したがって、1つに2つの質問があります。

これを機能させることはできますか？
可能な限り効率的にするためにどのような手順を実行できますか？

前もって感謝します、

score 2 · Accepted Answer

挿入時に結果をキャッシュできます。

基本的に、誰かが新しい投稿を挿入するとき、それをDBに挿入するだけでなく、置換プロセスを実行します。

投稿がこのようにDBに保存されている場合

Table: Posts
id        post
102       "Google is a search engine"

別のテーブルを作成できます

Table: cached_Posts
id       post_id   date_generated   cached_post                             
1        102       2012-10-10       <a href="http://google.com">Google</a> is a search engine"

投稿を取得するときは、それがcached_Postsテーブルの最初に存在するかどうかを確認します。

オリジナルを保持する必要がある理由は、おそらく将来的には、新しいキーワードを追加して置き換える可能性があるためです。あなたがしなければならないのはあなたのキャッシュを作り直すことだけです。

このようにすることで、クライアント側のJSは不要になり、投稿ごとに1回だけ実行する必要があるため、結果がすぐに表示されるはずです。

score 1 · Accepted Answer

これが私が思いついた比較的単純なものです。申し訳ありませんが、徹底的なテストもパフォーマンステストもありません。私はそれがさらに最適化できることを保証します、私はそれをする時間がなかっただけです。簡単にするためにコメントを付けましたhttp://pastebin.com/nkdTSvi6StackOverflowには少し長いかもしれませんが、とにかくここに投稿します。ペーストビンは、より快適に表示するためのものです。

function buildTrie(hash) {
    "use strict";
    // A very simple function to build a Trie
    // we could compress this later, but simplicity
    // is better for this example. If we don't
    // perform well, we'll try to optimize this a bit
    // there is a room for optimization here.
    var p, result = {}, leaf, i;
    for (p in hash) {
        if (hash.hasOwnProperty(p)) {
            leaf = result;
            i = 0;
            do {
                if (p[i] in leaf) {
                    leaf = leaf[p[i]];
                } else {
                    leaf = leaf[p[i]] = {};
                }
                i += 1;
            } while (i < p.length);
            // since, obviously, no character
            // equals to empty character, we'll
            // use it to store the reference to the
            // original value
            leaf[""] = hash[p];
        }
    }
    return result;
}

function prefixReplaceHtml(html, trie) {
    "use strict";
    var i, len = html.length, result = [], lastMatch = 0,
        current, leaf, match, matched, replacement;
    for (i = 0; i < len; i += 1) {
        current = html[i];
        if (current === "<") {
            // don't check for out of bounds access
            // assume we never face a situation, when
            // "<" is the last character in an HTML
            if (match) {
                result.push(
                    html.substring(lastMatch, i - matched.length),
                    "<a href=\"", match, "\">", replacement, "</a>");
                lastMatch = i - matched.length + replacement.length;
                i = lastMatch - 1;
            } else {
                if (matched) {
                    // go back to the second character of the
                    // matched string and try again
                    i = i - matched.length;
                }
            }
            matched = match = replacement = leaf = "";
            if (html[i + 1] === "a") {
                // we want to skip replacing inside
                // anchor tags. We also assume they
                // are never nested, as valid HTML is
                // against that idea
                if (html[i + 2] in
                    { " " : 1, "\t" : 1, "\r" : 1, "\n" : 1 }) {
                    // this is certainly an anchor
                    i = html.indexOf("</a", i + 3) + 3;
                    continue;
                }
            }
            // if we got here, it's a regular tag, just look
            // for terminating ">"
            i = html.indexOf(">", i + 1);
            continue;
        }
        // if we got here, we need to start checking
        // for the match in the trie
        if (!leaf) {
            leaf = trie;
        }
        leaf = leaf[current];
        // we prefer longest possible match, just like POSIX
        // regular expressions do
        if (leaf && ("" in leaf)) {
            match = leaf[""];
            replacement = html.substring(
                i - (matched ? matched.length : 0), i + 1);
        }
        if (!leaf) {
            // newby-style inline (all hand work!) pay extra
            // attention, this code is duplicated few lines above
            if (match) {
                result.push(
                    html.substring(lastMatch, i - matched.length),
                    "<a href=\"", match, "\">", replacement, "</a>");
                lastMatch = i - matched.length + replacement.length;
                i = lastMatch - 1;
            } else {
                if (matched) {
                    // go back to the second character of the
                    // matched string and try again
                    i = i - matched.length;
                }
            }
            matched = match = replacement = "";
        } else if (matched) {
            // perhaps a bit premature, but we'll try to avoid
            // string concatenation, when we can.
            matched = html.substring(i - matched.length, i + 1);
        } else {
            matched = current;
        }
    }
    return result.join("");
}

function testPrefixReplace() {
    "use strict";
    var trie = buildTrie(
        { "x" : "www.xxx.com", "yyy" : "www.y.com",
          "xy" : "www.xy.com", "yy" : "www.why.com" });
    return prefixReplaceHtml(
        "<html><head>x</head><body><a >yyy</a><p>" +
            "xyyy yy x xy</p><abrval><yy>xxy</yy>", trie);
}

score 1 · Accepted Answer

ConvertedSpearが言うように、PHPを機能させることができなかったという理由だけで、必ずしもPHPをあきらめる必要はありません。Javascriptソリューションは、サーバーの負荷を軽減する一方で、エンドユーザーには遅く見える可能性があります。サーバー側のソリューションもいつでもキャッシュできますが、実際にはクライアント側ではキャッシュできません。

そうは言っても、これらはあなたのJavascriptについての私の考えです。私は自分でこのようなことを試みたことがないので、あなたがそれを機能させることができるかどうかについてコメントすることはできませんが、潜在的に問題があると私が見ることができるいくつかのことがあります：

jQueryの$.each()関数は非常に便利ですが、あまり効率的ではありません。このベンチマークを実行してみると、私が何を意味するかがわかります：http: //jsperf.com/jquery-each-vs-for-loops/9
ループの各反復で実行する場合は$('.message')、かなり高価なDOMトラバーサルを大量に実行する可能性があります。ループを開始する前に、可能であれば、この操作の結果を変数にキャッシュする必要があります。words
「検索」テキストの各インスタンスが、クラスを持つ要素によってカプセル化され、messageそれを囲む他のテキストがないことに依存していますか？それがあなたのif ($(this).text() === search.word) {セリフが意味することだからです。他の質問では、置換する用語を囲むテキストがもっとあることを示唆しているようです。その場合、置換を実行するには正規表現を調べる必要があります。<a>また、テキストがタグ内に含まれていないことを確認する必要があります。

score 0 · Accepted Answer

メッセージと単語リストにデータベースでアクセスできる場合は、すべてをPHPで行うことをお勧めします。これはJSで実行できますが、サーバーサイドスクリプトとしてははるかに優れています。

JSでは、基本的に、

メッセージをロードする
「辞書」をロードする
辞書の各単語をループします
- DOMで一致するものを見つける（痛い）
  - 交換

最初の2つのポイントはリクエストであり、かなり大きなオーバーヘッドがかかります。ループはクライアントのCPUに負担をかけます。

サーバー側のコードとしてこれを行うことをお勧めする理由：

これらのタイプのジョブにはサーバーが適しています
JSはクライアントブラウザで実行されます。クライアントはそれぞれ異なります（例：パフォーマンスの低いIEを使用している人や、スマートフォンを使用している人）

これはPHPで行うのは非常に簡単です。

<?php
    $dict[] = array('word' => 'dolor', 'link' => 'DOLORRRRRR');
    $dict[] = array('word' => 'nulla', 'link' => 'NULLAAAARRRR');

    //  Pretty sure there's a more efficient way to separate an array.. my PHP is rusty, sorry. 
    $terms = array();
    $replace = array();
    foreach ($dict as $v) {
        // If you want to make sure it's a complete word, add a space to the term. 
        $terms[] = ' ' . $v['word'] . ' ';
        $replace[] = ' '. $v['link'] . ' ';
    }

    $text = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

    echo str_replace($terms, $replace, $text);


    /* Output: 
    Lorem ipsum DOLORRRRRR sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure DOLORRRRRR in reprehenderit in voluptate velit esse cillum dolore eu fugiat NULLAAAARRRR pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
    */

?>

このスクリプトはかなり基本的なものですが、さまざまなケースを受け入れることはできません。

私がすること：

PHPのパフォーマンスが本当にあなたに大きな打撃を与える場合（私はそれを疑っています..）、一度それを置き換えて保存することができます。次に、新しい単語を追加するときに、キャッシュを削除して再生成します（これを行うためにcronをプログラムできます）

score 0 · Accepted Answer

あなたは何でもうまくいくことができます、問題は：あなたがそれに入れる時間の価値がありますか？

ステップ1、AJAX要件を破棄します。Ajaxは、サーバーとの対話性、サーバーへの少量のデータの送信、および応答の取得を目的としています。あなたが望んでいるものには理想的ではありません。

ステップ2、JS要件を破棄し、ユーザーとの対話のためのJSを使用します。実際には、一部の単語をリンクに置き換えたテキストのブロックを配信する必要があります。これはサーバー側で処理する必要があります。

ステップ3、phpに焦点を当てます。効率が悪い場合は、それを攻撃します。それをより効率的にする方法を見つけてください。PHPで何を試しましたか？なぜ効率的ではなかったのですか？

javascript - 用語の大規模なリストを使用して、ページのテキストを検索し、単語をリンクに置き換えます

5 に答える 5

Related

Reference