python - SQL を実行する Python スクリプトの最適化

Question

csv ファイルから情報を解析し、SQL ステートメントを実行してテーブルを作成し、データを挿入するスクリプトがあります。~25 GB の csv ファイルを解析する必要がありますが、現在のスクリプトでは、解析した以前のサイズのファイルから判断すると、最大 20 日かかると推定されます。スクリプトを最適化して実行速度を上げる方法について何か提案はありますか? createtable 関数は 1 回しか呼び出されないため省略しました。InsertRow() は、私が本当に高速化する必要があると思う関数です。前もって感謝します。

#Builds sql insert statements and executes sqlite3 calls to insert the rows 
def insertRow(cols):
    first = True; #First value for INSERT arguments doesn't need comma front of it.
    conn = sqlite3.connect('parsed_csv.sqlite')
    c = conn.cursor()
    print cols
    insert = "INSERT INTO test9 VALUES("
    for col in cols:
        col = col.replace("'", "")
        if(first):
            insert +=  "'" + col + "'"
            first = False;
        else:
            insert += "," + "'" + col+ "'" + " "
    insert += ")"
    print (insert)
    c.execute(insert)
    conn.commit()


def main():
    #Get rid of first argument (filename)
    cmdargs = sys.argv[1:]
    #Convert values to integers
    cmdargs = list(map(int, cmdargs))

    #Get headers
    with open(r'requests_fields.csv','rb') as source:
        rdr = csv.reader(source)
        for row in rdr:          
             createTable(row[:], cmdargs[:])

    with open(r'test.csv','rb') as source:
        rdr= csv.reader( source )
        for row in rdr:
            #Clear contents of list
            outlist =[]
            #Append all rows onto list and then write to row in output csv file
            for index in cmdargs:
                outlist.append(row[index])
            insertRow(outlist[:])

私が経験している遅い速度は、insertRow() で毎回データベースへの接続を作成することが原因である可能性がありますか?

score 1 · Accepted Answer

データベースに接続し、各行のカーソルを作成しています。このステップをループの外に移動すると、はるかに高速になります。

また、各行の後にコミットしています。アプリケーションのロジックに適合する場合は、これが発生する頻度を減らすこともお勧めしますが、1 回のコミットで 25 GB のファイル全体を実行すると、独自の問題が発生する可能性があります。

score 1 · Accepted Answer

1) Python、SQLLite、25 GB。速い=素晴らしい。

2) insert = "INSERT INTO test9 VALUES(" - それは間違った解決策です。良いスタイルはパラメータを使用しています:

insert = "INSERT INTO test9 (field1, field2, field3) VALUES(?, ?, ?)"
c.execute(insert, [value1, value2, value3])

python - SQL を実行する Python スクリプトの最適化

2 に答える 2

Related

Reference