1

I have a very large csv file that is contains several strings of HTML code. I am using BeautifulSoup to extract only the code that is in the <p>tags. My code seems to work for several examples, except when I run it on the full csv file I get a memory error. The csv file is about 6 GB. Here is my code

def import_data():
    doc=[]
    with open('input_file.csv','rb') as f:
        reader=csv.reader(f)
        for row in reader:
            doc.append((row[0],row[2]))
    return doc

def main():

    data=import_data()

    desc=[]

    for i in data:
        soup = BeautifulSoup(i[1], 'html')
        desc.append([i[0],' '.join(el.string for el in soup.find_all('p', text=True))])


    with open("output_file.csv",'a') as the_file:
        writer=csv.writer(the_file,dialect='excel')
        writer.writerow(desc)

if __name__ == '__main__':
    main()

I can see why I am running out of memory, because I am essentially holding the 6 GB file in two places (data and desc). I know that I am able to hold one of them in memory since I am able to import the data without any problems. But how would you suggest I get around this? Should I try replacing the second column with the BeautifulSoup output rather than having two structures? Or should I do something where I read the input file line-by-line where I read one line, perform the BeautifulSoup transformation on it, then export it (so I only have one line in memory at a time). Thanks,

4

2 に答える 2

2

入力ファイル全体がメモリ内にあるのを回避するのに役立つメモリマップファイルを確認できます。

http://docs.python.org/2/library/mmap.html

于 2013-09-17T14:07:01.880 に答える