python - PageRank の計算を最適化する方法は?

Question

Programming Collective Intelligenceという本の中で、PageRank を計算する次の関数を見つけました。

def calculatepagerank(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("drop table if exists pagerank")
    self.con.execute("create table pagerank(urlid primary key,score)")
    self.con.execute("create index prankidx on pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("insert into pagerank select rowid,1.0 from urllist")
    self.dbcommit()

    for i in range(iterations):
        print "Iteration %d" % i
        for (urlid,) in self.con.execute("select rowid from urllist"):
            pr=0.15

            # Loop through all the pages that link to this one
            for (linker,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
                # Get the PageRank of the linker
                linkingpr=self.con.execute("select score from pagerank where urlid=%d" % linker).fetchone()[0]

                # Get the total number of links from the linker
                linkingcount=self.con.execute("select count(*) from link where fromid=%d" % linker).fetchone()[0]

                pr+=0.85*(linkingpr/linkingcount)

            self.con.execute("update pagerank set score=%f where urlid=%d" % (pr,urlid))
        self.dbcommit()

ただし、すべての反復ですべての SQL クエリが実行されるため、この関数は非常に低速です。

>>> import cProfile
>>> cProfile.run("crawler.calculatepagerank()")
         2262510 function calls in 136.006 CPU seconds

   Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.000    0.000  136.006  136.006 <string>:1(<module>)
     1   20.826   20.826  136.006  136.006 searchengine.py:179(calculatepagerank)
    21    0.000    0.000    0.528    0.025 searchengine.py:27(dbcommit)
    21    0.528    0.025    0.528    0.025 {method 'commit' of 'sqlite3.Connecti
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler
1339864  112.602    0.000  112.602    0.000 {method 'execute' of 'sqlite3.Connec 
922600    2.050    0.000    2.050    0.000 {method 'fetchone' of 'sqlite3.Cursor' 
     1    0.000    0.000    0.000    0.000 {range}

だから私は関数を最適化し、これを思いついた：

def calculatepagerank2(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("drop table if exists pagerank")
    self.con.execute("create table pagerank(urlid primary key,score)")
    self.con.execute("create index prankidx on pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("insert into pagerank select rowid,1.0 from urllist")
    self.dbcommit()

    inlinks={}
    numoutlinks={}
    pagerank={}

    for (urlid,) in self.con.execute("select rowid from urllist"):
        inlinks[urlid]=[]
        numoutlinks[urlid]=0
        # Initialize pagerank vector with 1.0
        pagerank[urlid]=1.0
        # Loop through all the pages that link to this one
        for (inlink,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
            inlinks[urlid].append(inlink)
            # get number of outgoing links from a page        
            numoutlinks[urlid]=self.con.execute("select count(*) from link where fromid=%d" % urlid).fetchone()[0]            

    for i in range(iterations):
        print "Iteration %d" % i

        for urlid in pagerank:
            pr=0.15
            for link in inlinks[urlid]:
                linkpr=pagerank[link]
                linkcount=numoutlinks[link]
                pr+=0.85*(linkpr/linkcount)
            pagerank[urlid]=pr
    for urlid in pagerank:
        self.con.execute("update pagerank set score=%f where urlid=%d" % (pagerank[urlid],urlid))
    self.dbcommit()

この関数は、すべての反復で不要な SQL クエリを回避するため、何倍も高速です (ただし、すべての一時辞書により多くのメモリを使用します)。

>>> cProfile.run("crawler.calculatepagerank2()")
     90070 function calls in 3.527 CPU seconds
Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.004    0.004    3.527    3.527 <string>:1(<module>)
     1    1.154    1.154    3.523    3.523 searchengine.py:207(calculatepagerank2
     2    0.000    0.000    0.058    0.029 searchengine.py:27(dbcommit)
 23065    0.013    0.000    0.013    0.000 {method 'append' of 'list' objects}
     2    0.058    0.029    0.058    0.029 {method 'commit' of 'sqlite3.Connectio
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler
 43932    2.261    0.000    2.261    0.000 {method 'execute' of 'sqlite3.Connecti
 23065    0.037    0.000    0.037    0.000 {method 'fetchone' of 'sqlite3.Cursor'
     1    0.000    0.000    0.000    0.000 {range}

しかし、関数をさらに高速化するために、SQL クエリの数をさらに減らすことは可能でしょうか? 更新: calculatepagerank2() のインデントを修正しました。

score 2 · Accepted Answer

非常に大規模なデータベース（たとえば、WWWの＃レコード〜＃ページ）がある場合は、本で提案されているのと同様の方法でデータベースを使用するのが理にかなっています。これは、すべてのデータをメモリに保持できないためです。。

データセットが十分に小さい場合は、それほど多くのクエリを実行しないことで、（おそらく）2番目のバージョンを改善できます。最初のループを次のようなものに置き換えてみてください。

for urlid, in self.con.execute('select rowid from urllist'):
    inlinks[urlid] = []
    numoutlinks[urlid] = 0
    pagerank[urlid] = 1.0

for src, dest in self.con.execute('select fromid, toid from link'):
    inlinks[dest].append(src)
    numoutlinks[src] += 1

このバージョンは、O（n ^ 2）クエリの代わりに正確に2つのクエリを実行します。

score 1 · Accepted Answer

ほとんどの時間は、次の SQL クエリに費やされていると思います。

for (urlid,) in self.con.execute("select rowid from urllist"):
    ...
    for (inlink,) in self.con.execute("select distinct fromid from link where toid=%d" % urlid):
        ...
        numoutlinks[urlid]=self.con.execute("select count(*) from link where fromid=%d" % urlid).fetchone()[0]

十分なメモリがあると仮定すると、これを 2 つのクエリに減らすことができる場合があります。

SELECT fromid,toid FROM link WHERE toid IN (SELECT rowid FROM urllist)
と
SELECT fromid,count(*) FROM link WHERE fromid IN (SELECT rowid FROM urllist) GROUP BY fromid

inlinks次に、結果をループして、numoutlinksおよびを構築できますpagerank。

また、以下を使用することでメリットが得られる場合がありますcollections.defaultdict。

import collections
import itertools
def constant_factory(value):
    return itertools.repeat(value).next

次にinlinks、セットの辞書を作成します。個別の URL のみが必要なため、セットが適切です

inlinks=collections.defaultdict(set)

そして、これはpagerankデフォルト値が 1.0 である dict を作成します:

pagerank=collections.defaultdict(constant_factory(1.0))

collections.defaultdict を使用する利点は、辞書を事前に初期化する必要がないことです。

したがって、私が提案していることをまとめると、次のようになります。

import collections
def constant_factory(value):
    return itertools.repeat(value).next
def calculatepagerank2(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("DROP TABLE IF EXISTS pagerank")
    self.con.execute("CREATE TABLE pagerank(urlid primary key,score)")
    self.con.execute("CREATE INDEX prankidx ON pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("INSERT INTO pagerank SELECT rowid,1.0 FROM urllist")
    self.dbcommit()

    inlinks=collections.defaultdict(set)

    sql='''SELECT fromid,toid FROM link WHERE toid IN (SELECT rowid FROM urllist)'''
    for f,t in self.con.execute(sql):
        inlinks[t].add(f)

    numoutlinks={}
    sql='''SELECT fromid,count(*) FROM link WHERE fromid IN (SELECT rowid FROM urllist) GROUP BY fromid'''
    for f,c in self.con.execute(sql):
        numoutlinks[f]=c

    pagerank=collections.defaultdict(constant_factory(1.0))
    for i in range(iterations):
        print "Iteration %d" % i
        for urlid in inlinks:
            pr=0.15
            for link in inlinks[urlid]:
                linkpr=pagerank[link]
                linkcount=numoutlinks[link]
                pr+=0.85*(linkpr/linkcount)
            pagerank[urlid]=pr
    sql="UPDATE pagerank SET score=? WHERE urlid=?"
    args=((pagerank[urlid],urlid) for urlid in pagerank)
    self.con.executemany(sql, args)
    self.dbcommit()

score 0 · Accepted Answer

(fromid, toid)スパース行列を何らかの形で保持するのに十分な RAM がありますか? これにより、大幅な最適化が可能になります (アルゴリズムの大幅な変更を伴う)。少なくとも、最も内側のループで(fromid, numlinks)行うことをメモリにキャッシュすることは役立つはずです（URLを扱っている場合、スペースにあるキャッシュはメモリに収まる可能性が高いと思います）。select count(*)O(N)N

score 0 · Accepted Answer

最終的に、すべての回答の組み合わせが私にとって最も効果的であることが判明したため、私は自分の質問に答えています。

    def calculatepagerank4(self,iterations=20):
    # clear out the current PageRank tables
    self.con.execute("drop table if exists pagerank")
    self.con.execute("create table pagerank(urlid primary key,score)")
    self.con.execute("create index prankidx on pagerank(urlid)")

    # initialize every url with a PageRank of 1.0
    self.con.execute("insert into pagerank select rowid,1.0 from urllist")
    self.dbcommit()

    inlinks={}
    numoutlinks={}
    pagerank={}

    for (urlid,) in self.con.execute("select rowid from urllist"):
        inlinks[urlid]=[]
        numoutlinks[urlid]=0
        # Initialize pagerank vector with 1.0
        pagerank[urlid]=1.0

    for src,dest in self.con.execute("select distinct fromid, toid from link"):
        inlinks[dest].append(src)
        numoutlinks[src]+=1          

    for i in range(iterations):
        print "Iteration %d" % i

        for urlid in pagerank:
            pr=0.15
            for link in inlinks[urlid]:
                linkpr=pagerank[link]
                linkcount=numoutlinks[link]
                pr+=0.85*(linkpr/linkcount)
            pagerank[urlid]=pr

    args=((pagerank[urlid],urlid) for urlid in pagerank)
    self.con.executemany("update pagerank set score=? where urlid=?" , args)
    self.dbcommit()

したがって、最初の 2 つのループをで提案されているように置き換えましたallyourcodeが、さらに、のソリューションのように executemany() も使用しました˜unutbu。しかし˜unutbu、引数にジェネレーター式を使用するのとは異なり、メモリを浪費しすぎないようにしていますが、リスト内包表記を使用した方が少し高速でした。最終的に、このルーチンは、本で提案されているルーチンよりも 100 倍高速でした。

>>> cProfile.run("crawler.calculatepagerank4()")
     33512 function calls in 1.377 CPU seconds
Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     1    0.004    0.004    1.377    1.377 <string>:1(<module>)
     2    0.000    0.000    0.073    0.036 searchengine.py:27(dbcommit)
     1    0.693    0.693    1.373    1.373 searchengine.py:286(calculatepagerank4
 10432    0.011    0.000    0.011    0.000 searchengine.py:321(<genexpr>)
 23065    0.009    0.000    0.009    0.000 {method 'append' of 'list' objects}
     2    0.073    0.036    0.073    0.036 {method 'commit' of 'sqlite3.Connectio
     1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler
     6    0.379    0.063    0.379    0.063 {method 'execute' of 'sqlite3.Connecti
     1    0.209    0.209    0.220    0.220 {method 'executemany' of 'sqlite3.Conn
     1    0.000    0.000    0.000    0.000 {range}

また、次の問題にも注意する必要があります。

SQL ステートメントを作成するため%fにプレースホルダーを使用する代わりにで文字列の書式設定を使用すると、精度が失われます (たとえば、を使用して 2.9796095721920315 を取得しましたが、を使用して 2.9796100000000001 を取得しました。??%f
あるページから別のページへの重複リンクは、デフォルトの PageRank アルゴリズムでは 1 つのリンクとして扱われます。しかし、本の解決策はそれを考慮していませんでした。
本のアルゴリズム全体に欠陥があります。その理由は、反復ごとに、pagerank スコアが 2 番目のテーブルに格納されていないためです。しかし、これは、反復の結果がループされるページの順序に依存することを意味し、数回の反復後に結果が大幅に変わる可能性があります。この問題を解決するには、追加のテーブル/ディクショナリを使用して次の反復のためにページランクを保存するか、Power Iterationのようなまったく異なるアルゴリズムを使用する必要があります。

python - PageRank の計算を最適化する方法は?

4 に答える 4

Related

Reference