python - freq dist 計算のために db クエリを効率的に処理するにはどうすればよいですか?

Question

私はこれに少しの間取り組んでおり、頻度分布データベース側を作成しようとしています:

from itertools import permutations
import sqlite3

def populate_character_probabilities(connection, table="freq_dist", alphabet='abcdefghijklmnopqrstuvwxyz'):
    c = connection.cursor()
    c.execute("DROP TABLE IF EXISTS tablename".replace('tablename', table))
    c.execute("create table tablename (char1 text, char2 text, freq integer);".replace("tablename", table))    
    char_freq_tuples = [x + (1,) for x in list(permutations(alphabet, 2)) + [(alpha, alpha) for alpha in alphabet + 'S' + 'E']]
    c.executemany("insert into tablename values (?,?,?);".replace("tablename", table), char_freq_tuples)
    connection.commit()
    c.close()

def populate_word_list(connection, table="word_list"):
    cursor = connection.cursor()
    cursor.execute("DROP TABLE IF EXISTS tablename".replace('tablename', table))
    cursor.execute("create table tablename (word text);".replace('tablename', table))
    cursor.executemany("insert into tablename values (?)".replace('tablename', table), [[u'nabisco'], [u'usa'], [u'sharp'], [u'rise']])
    connection.commit()

def update_freq_dist(connection, word_list="word_list", freq_dist="freq_dist"):
    cursor = connection.cursor()
    subset = cursor.execute("select * from tablename;".replace("tablename", word_list)).fetchmany(5)
    for elem in subset: # want to process e.g.: 5 at a time
        elem = 'S' + elem[0] + 'E' # Start and end delimiters
        for i in xrange(len(elem) - 1):
            freq = cursor.execute("SELECT freq FROM tablename WHERE char1=? and char2=?;".replace("tablename", freq_dist), (elem[i], elem[i + 1])).fetchone()
            cursor.execute("UPDATE tablename SET freq=? WHERE char1=? and char2=?;".replace("tablename", freq_dist), (freq + 1, elem[i], elem[i + 1]))
    connection.commit() # seems inefficient having two queries here^
    cursor.close()

if __name__ == '__main__':
    connection = sqlite3.connect('array.db')
    populate_word_list(connection)
    populate_character_probabilities(connection)
    update_freq_dist(connection)
    cursor = connection.cursor()
    print cursor.execute("SELECT * FROM freq_dist;").fetchmany(10)

(うわー、180 行のコードベースがテストケースで 37 行になりました! :D - 実際の単語リストは 4 ではなく 2900 万であることに注意してください!!!)

私はそれを実現しました：

update_freq_dist内側のループ内に 2 つのクエリは必要ないはずです
データベース要素 (行) を反復処理する方法があります。たとえば、一度に 5 つずつです。

ただし、どちらの問題もどのように解決できるかわかりません。

解決策を考えられますか？

score 1 · Accepted Answer

周波数を+1で更新しますか？

UPDATE tablename
SET freq = freq + 1
WHERE char1=? and char2=?;

または、別のテーブルから更新する場合：

UPDATE tablename 
SET freq = t2.freq + 1 -- whatever your calc is
FROM tablename t1
JOIN othertable t2
ON t1.other_id = t2.id
WHERE t1.char1=? and t1.char2=? and t2.char1=? and t2.char2=?

一度に5回繰り返す場合は、limit句とoffset句を使用して何かを近づけることができます。

python - freq dist 計算のために db クエリを効率的に処理するにはどうすればよいですか?

1 に答える 1

Related

Reference