string - URL 分類のパターンマッチング

Question

プロジェクトの一環として、私と他の数人は現在、URL 分類器に取り組んでいます。私たちが実装しようとしているのは、実際には非常に単純です。単に URL を見て、その中で関連するキーワードを見つけ、それに応じてページを分類するだけです。

例: URL がhttp://cnnworld/sports/abcdの場合、「スポーツ」カテゴリに分類します。

これを達成するために、次の形式のマッピングを持つデータベースがあります: キーワード -> カテゴリ

現在行っていることは、URL ごとに、データベース内のすべてのデータ項目を読み取り続け、String.find() メソッドを使用してキーワードが URL 内にあるかどうかを確認することです。これが見つかったら、停止します。

しかし、このアプローチにはいくつかの問題があり、主なものは次のとおりです。

(i) 私たちのデータベースは非常に大きく、そのような繰り返しのクエリは非常に遅く実行されます

(ii) ページは複数のカテゴリに属している可能性があり、私たちのアプローチはそのようなケースを処理しません。もちろん、これを確実にする簡単な方法の 1 つは、カテゴリの一致が見つかった場合でもデータベースへのクエリを続行することですが、これは処理をさらに遅くするだけです。

私は別の方法を考えていましたが、逆のことができるかどうか疑問に思っていました.URLを解析し、その中に出現する単語を見つけて、それらの単語のみをデータベースに照会します.

この単純なアルゴリズムは O( n^2 ) で実行されます - URL 内で発生するすべての部分文字列についてデータベースにクエリを実行します。

これを達成するためのより良いアプローチがあるかどうか疑問に思っていました。何か案は？？前もって感謝します：）

score 2 · Accepted Answer

私たちの商用分類器には、400 万個のキーワードのデータベースがあります :) また、HTML の本文も検索します。これを解決する方法はいくつかあります。

Aho-Corasick を使用します。Web コンテンツを操作するために特別に修正されたアルゴリズムを使用しました。たとえば、タブ、スペース、\r、\n をスペースとして扱い、1 つだけとして扱います。小文字/大文字を無視します。
別のオプションは、すべてのキーワードをツリー内に配置することです (std::map など)。これにより、検索が非常に高速になります。欠点は、これにはメモリと多くが必要になることですが、サーバー上にある場合は感じません。それ。

score 1 · Accepted Answer

URL を分解して有用なビットを見つけ、それらの項目だけを照会するという提案は、適切な方法のように思えます。

私は、これが何を伴うと私が考えるかをコード的に説明するのに役立つかもしれないいくつかの Java を一緒に放り投げました。最も価値のある部分はおそらく正規表現ですが、その一般的なアルゴリズムもいくつか役立つことを願っています:

import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.util.List;

public class CategoryParser 
{
    /** The db field that keywords should be checked against */
    private static final String DB_KEYWORD_FIELD_NAME = "keyword";

    /** The db field that categories should be pulled from */
    private static final String DB_CATEGORY_FIELD_NAME = "category";

    /** The name of the table to query */
    private static final String DB_TABLE_NAME = "KeywordCategoryMap";

    /**
     * This method takes a URL and from that text alone determines what categories that URL belongs in.
     * @param url - String URL to categorize
     * @return categories - A List&lt;String&rt; of categories the URL seemingly belongs in
     */
    public static List<String> getCategoriesFromUrl(String url) {

        // Clean the URL to remove useless bits and encoding artifacts
        String normalizedUrl = normalizeURL(url);

        // Break the url apart and get the good stuff
        String[] keywords = tokenizeURL(normalizedUrl);

        // Construct the query we can query the database with
        String query = constructKeywordCategoryQuery(keywords);

        System.out.println("Generated Query: " + query);

        // At this point, you'd need to fire this query off to your database,
        // and the results you'd get back should each be a valid category
        // for your URL. This code is not provided because it's very implementation specific,
        // and you already know how to deal with databases.


        // Returning null to make this compile, even though you'd obviously want to return the
        // actual List of Strings
        return null;
    }

    /**
     * Removes the protocol, if it exists, from the front and
     * removes any random encoding characters
     * Extend this to do other url cleaning/pre-processing
     * @param url - The String URL to normalize
     * @return normalizedUrl - The String URL that has no junk or surprises
     */
    private static String normalizeURL(String url)
    {
        // Decode URL to remove any %20 type stuff
        String normalizedUrl = url;
        try {
            // I've used a URLDecoder that's part of Java here,
            // but this functionality exists in most modern languages
            // and is universally called url decoding
            normalizedUrl = URLDecoder.decode(url, "UTF-8");
        }
        catch(UnsupportedEncodingException uee)
        {
            System.err.println("Unable to Decode URL. Decoding skipped.");
            uee.printStackTrace();
        }

        // Remove the protocol, http:// ftp:// or similar from the front
        if (normalizedUrl.contains("://"))
        {
            normalizedUrl = normalizedUrl.split(":\\/\\/")[1];
        }

        // Room here to do more pre-processing

        return normalizedUrl;
    }

    /**
     * Takes apart the url into the pieces that make at least some sense
     * This doesn't guarantee that each token is a potentially valid keyword, however
     * because that would require actually iterating over them again, which might be
     * seen as a waste.
     * @param url - Url to be tokenized
     * @return tokens - A String array of all the tokens
     */
    private static String[] tokenizeURL(String url)
    {
        // I assume that we're going to use the whole URL to find tokens in
        // If you want to just look in the GET parameters, or you want to ignore the domain
        // or you want to use the domain as a token itself, that would have to be
        // processed above the next line, and only the remaining parts split
        String[] tokens = url.split("\\b|_");

        // One could alternatively use a more complex regex to remove more invalid matches
        // but this is subject to your (?:in)?ability to actually write the regex you want

        // These next two get rid of tokens that are too short, also.

        // Destroys anything that's not alphanumeric and things that are
        // alphanumeric but only 1 character long
        //String[] tokens = url.split("(?:[\\W_]+\\w)*[\\W_]+");

        // Destroys anything that's not alphanumeric and things that are
        // alphanumeric but only 1 or 2 characters long
        //String[] tokens = url.split("(?:[\\W_]+\\w{1,2})*[\\W_]+");

        return tokens;
    }

    private static String constructKeywordCategoryQuery(String[] keywords)
    {
        // This will hold our WHERE body, keyword OR keyword2 OR keyword3
        StringBuilder whereItems = new StringBuilder();

        // Potential query, if we find anything valid
        String query = null;

        // Iterate over every found token
        for (String keyword : keywords)
        {
            // Reject invalid keywords
            if (isKeywordValid(keyword))
            {
                // If we need an OR
                if (whereItems.length() > 0)
                {
                    whereItems.append(" OR ");
                }

                // Simply append this item to the query
                // Yields something like "keyword='thisKeyword'"
                whereItems.append(DB_KEYWORD_FIELD_NAME);
                whereItems.append("='");
                whereItems.append(keyword);
                whereItems.append("'");
            }
        }

        // If a valid keyword actually made it into the query
        if (whereItems.length() > 0)
        {
            query = "SELECT DISTINCT(" + DB_CATEGORY_FIELD_NAME + ") FROM " + DB_TABLE_NAME
                    + " WHERE " + whereItems.toString() + ";";
        }

        return query;
    }

    private static boolean isKeywordValid(String keyword)
    {
        // Keywords better be at least 2 characters long
        return keyword.length() > 1
                // And they better be only composed of letters and numbers
                && keyword.matches("\\w+")
                // And they better not be *just* numbers
                // && !keyword.matches("\\d+") // If you want this
                ;
    }

    // How this would be used
    public static void main(String[] args)
    {
        List<String> soQuestionUrlClassifications = getCategoriesFromUrl("http://stackoverflow.com/questions/10046178/pattern-matching-for-url-classification");
        List<String> googleQueryURLClassifications = getCategoriesFromUrl("https://www.google.com/search?sugexp=chrome,mod=18&sourceid=chrome&ie=UTF-8&q=spring+is+a+new+service+instance+created#hl=en&sugexp=ciatsh&gs_nf=1&gs_mss=spring%20is%20a%20new%20bean%20instance%20created&tok=lnAt2g0iy8CWkY65Te75sg&pq=spring%20is%20a%20new%20bean%20instance%20created&cp=6&gs_id=1l&xhr=t&q=urlencode&pf=p&safe=off&sclient=psy-ab&oq=url+en&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=2176d1af1be1f17d&biw=1680&bih=965");
    }
}

SO リンクの生成されたクエリは次のようになります。

SELECT DISTINCT(category) FROM KeywordCategoryMap WHERE keyword='stackoverflow' OR keyword='com' OR keyword='questions' OR keyword='10046178' OR keyword='pattern' OR keyword='matching' OR keyword='for' OR keyword='url' OR keyword='classification'

最適化の余地は十分にありますが、可能性のあるすべてのキーワードについて文字列をチェックするよりもはるかに高速になると思います。

score 0 · Accepted Answer

キーワードよりも (非常に) 少ないカテゴリがある場合は、カテゴリごとに正規表現を作成すると、そのカテゴリのキーワードのいずれかと一致します。次に、各カテゴリの正規表現に対して URL を実行します。これにより、複数のカテゴリの一致の問題にも対処できます。

string - URL 分類のパターン マッチング

4 に答える 4

Related

Reference

string - URL 分類のパターンマッチング