python - URL の最長一致プレフィックス

Question

URL の「最長プレフィックス一致」に使用できる標準の python パッケージに関する情報が必要です。私は2つの標準パッケージhttp://packages.python.org/PyTrie/#pytrie.StringTrie & 'http://pypi.python.org/pypi/trie/0.1.1' を調べましたが、そうではないようですURL の最長プレフィックスマッチタスクに役立ちます。

たとえば、私のセットにこれらの URL がある場合、1->http://www.google.com/mail 、2->http://www.google.com/document、3->http://www.facebook.comなど。

「http://www.google.com/doc」を検索すると 2 が返され、「http://www.face」を検索すると 3 が返されます。

これを行うのに役立つ標準のpythonパッケージがあるかどうか、またはプレフィックスマッチングのためにTrieを実装する必要があるかどうかを確認したかったのです。

URLの数が増えるにつれて拡張性がないため、正規表現のようなソリューションは探していません。

どうもありがとう。

score 13 · Accepted Answer

性能比較

`suffixtree`vs. `pytrie`vs. `trie`vs. vs. `datrie`-`startswith`関数

設定

記録時間は、1000回の検索を3回繰り返したうちの最小時間です。トライ構築時間が含まれ、すべての検索に分散されます。検索は、1 ～ 1000000 項目のホスト名のコレクションに対して実行されます。

3 種類の検索文字列:

non_existent_key- 文字列に一致するものはありません
rare_key- 100万分の20程度
frequent_key- 出現回数はコレクションサイズに匹敵します

結果

100 万 URL の最大メモリ消費量:

| function    | memory, | ratio |
|             |     GiB |       |
|-------------+---------+-------|
| suffix_tree |   0.853 |   1.0 |
| pytrie      |   3.383 |   4.0 |
| trie        |   3.803 |   4.5 |
| datrie      |   0.194 |   0.2 |
| startswith  |   0.069 |   0.1 |
#+TBLFM: $3=$2/@3$2;%.1f

結果を再現するには、トライベンチマークコードを実行します。

まれなキー/存在しないキーのケース

URL の数が 10000 未満の場合、datrie が最も高速で、N>10000 の場合 -suffixtreeより速く、startwith平均して大幅に遅くなります。

レアキー

軸:
- 垂直 (時間) スケールは ~1 秒 (2**20 マイクロ秒)
- 横軸は、それぞれの場合の URL の総数を示します: N= 1、10、100、1000、10000、100000、および 1000000 (100 万)。

存在しない_キー

頻繁なキー

最大 N=100000datrieが最速です (100 万の URL の場合、時間はトライの構築時間に支配されます)。

見つかった一致の中で最も長い一致を見つけるのに最も時間がかかります。したがって、すべての関数は期待どおりに動作します。

頻繁なキー

startswith- 時間性能はキーの種類に依存しません。

trieそしてpytrieお互いに似たような振る舞いをします。

トライ構築時間のないパフォーマンス

datrie- 最速でまともなメモリ消費
startswith他のアプローチは、トライを構築するのにかかる時間によってペナルティを受けないため、ここではさらに不利です。
datrie, pytrie, trie- ほとんど O(1) (一定時間) まれな/存在しないキー

Rare_key_no_trie_build_time nonexistent_key_no_trie_build_time

frequent_key_no_trie_build_time

比較のために既知の関数の多項式を近似 (近似) します (図と同じ対数/対数スケール):

| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 0.15  log2(N)   +      1.583 | log2(N)           |
| 0.30  log2(N)   +      3.167 | log2(N)*log2(N)   |
| 0.50  log2(N)   +  1.111e-15 | sqrt(N)           |
| 0.80  log2(N)   +  7.943e-16 | N**0.8            |
| 1.00  log2(N)   +  2.223e-15 | N                 |
| 2.00  log2(N)   +  4.446e-15 | N*N               |

score 12 · Accepted Answer

この例は小さな URL リストには適していますが、うまくスケーリングできません。

def longest_prefix_match(search, urllist):
    matches = [url for url in urllist if url.startswith(search)]
    if matches:
        return max(matches, key=len)
    else:
        raise Exception("Not found")

tryモジュールを使用した実装。

import trie


def longest_prefix_match(prefix_trie, search):
    # There may well be a more elegant way to do this without using
    # "hidden" method _getnode.
    try:
        return list(node.value for node in prefix_trie._getnode(search).walk())
    except KeyError:
        return list()

url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = trie.Trie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, longest_prefix_match(url_trie, search)

結果：

'http' -> ['http://www.facebook.com', 'http://www.google.com/document', 'http://www.google.com/mail']
'http://www.go' -> ['http://www.google.com/document', 'http://www.google.com/mail']
'http://www.fa' -> ['http://www.facebook.com']
'http://fail' -> []

またはPyTrieを使用しても同じ結果が得られますが、リストの順序が異なります。

from pytrie import StringTrie


url_list = [ 
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

url_trie = StringTrie()

for url in url_list:
    url_trie[url] = url 

searches = ("http", "http://www.go", "http://www.fa", "http://fail")

for search in searches:
    print "'%s' ->" % search, url_trie.values(prefix=search)

メモリ使用量の観点から、基数ツリー/パトリシアツリーの方が優れていると思い始めています。基数ツリーは次のようになります。

サンプル URL の基数ツリー

トライは次のようになります。 URL の例を試す

score 1 · Accepted Answer

時間のパフォーマンスのためにRAMを交換しても構わないと思っている場合は、SuffixTree役立つかもしれません. 線形時間で最も長い一般的な部分文字列の問題を解決できるなど、優れたアルゴリズム特性があります。

任意の部分文字列ではなく常にプレフィックスを検索する場合は、入力中に一意のプレフィックスを追加できますSubstringDict()。

from SuffixTree import SubstringDict

substr_dict = SubstringDict()
for url in URLS: # urls must be ascii (valid urls are)
    assert '\n' not in url
    substr_dict['\n'+url] = url #NOTE: assume that '\n' can't be in a url

def longest_match(url_prefix, _substr_dict=substr_dict):
    matches = _substr_dict['\n'+url_prefix]
    return max(matches, key=len) if matches else ''

このようなの使用は最適ではないように思えますが、 @StephenPaulger のソリューション[ に基づく]よりも ( の構築時間SuffixTreeなしで) 20-150 倍高速であり、私が試したデータでは十分である可能性があります。SubstringDict().startswith()

をインストールするSuffixTreeには、次を実行します。

pip install SuffixTree -f https://hkn.eecs.berkeley.edu/~dyoo/python/suffix_trees

score 1 · Accepted Answer

以下の関数は、最長一致のインデックスを返します。他の有用な情報も簡単に抽出できます。

from os.path import commonprefix as oscp

def longest_prefix(s, slist):
    pfx_idx = ((oscp([s, url]), i) for i, url in enumerate(slist))
    len_pfx_idx = map(lambda t: (len(t[0]), t[0], t[1]), pfx_idx)
    length, pfx, idx = max(len_pfx_idx)
    return idx

slist = [
    'http://www.google.com/mail',
    'http://www.google.com/document',
    'http://www.facebook.com',
]

print(longest_prefix('http://www.google.com/doc', slist))
print(longest_prefix('http://www.face', slist))

python - URL の最長一致プレフィックス

4 に答える 4

性能比較

suffixtreevs. pytrievs. trievs. vs. datrie-startswith関数

設定

結果

トライ構築時間のないパフォーマンス

Related

Reference

`suffixtree`vs. `pytrie`vs. `trie`vs. vs. `datrie`-`startswith`関数