python - Using BeautifulSoup but getting memory error from large file

Question

I have a very large csv file that is contains several strings of HTML code. I am using BeautifulSoup to extract only the code that is in the <p>tags. My code seems to work for several examples, except when I run it on the full csv file I get a memory error. The csv file is about 6 GB. Here is my code

def import_data():
    doc=[]
    with open('input_file.csv','rb') as f:
        reader=csv.reader(f)
        for row in reader:
            doc.append((row[0],row[2]))
    return doc

def main():

    data=import_data()

    desc=[]

    for i in data:
        soup = BeautifulSoup(i[1], 'html')
        desc.append([i[0],' '.join(el.string for el in soup.find_all('p', text=True))])


    with open("output_file.csv",'a') as the_file:
        writer=csv.writer(the_file,dialect='excel')
        writer.writerow(desc)

if __name__ == '__main__':
    main()

I can see why I am running out of memory, because I am essentially holding the 6 GB file in two places (data and desc). I know that I am able to hold one of them in memory since I am able to import the data without any problems. But how would you suggest I get around this? Should I try replacing the second column with the BeautifulSoup output rather than having two structures? Or should I do something where I read the input file line-by-line where I read one line, perform the BeautifulSoup transformation on it, then export it (so I only have one line in memory at a time). Thanks,

score 2 · Accepted Answer

入力ファイル全体がメモリ内にあるのを回避するのに役立つメモリマップファイルを確認できます。

http://docs.python.org/2/library/mmap.html

python - Using BeautifulSoup but getting memory error from large file

2 に答える 2

Related

Reference