I have a very large csv file that is contains several strings of HTML code. I am using BeautifulSoup to extract only the code that is in the <p>
tags. My code seems to work for several examples, except when I run it on the full csv file I get a memory error. The csv file is about 6 GB. Here is my code
def import_data():
doc=[]
with open('input_file.csv','rb') as f:
reader=csv.reader(f)
for row in reader:
doc.append((row[0],row[2]))
return doc
def main():
data=import_data()
desc=[]
for i in data:
soup = BeautifulSoup(i[1], 'html')
desc.append([i[0],' '.join(el.string for el in soup.find_all('p', text=True))])
with open("output_file.csv",'a') as the_file:
writer=csv.writer(the_file,dialect='excel')
writer.writerow(desc)
if __name__ == '__main__':
main()
I can see why I am running out of memory, because I am essentially holding the 6 GB file in two places (data and desc). I know that I am able to hold one of them in memory since I am able to import the data without any problems. But how would you suggest I get around this? Should I try replacing the second column with the BeautifulSoup output rather than having two structures? Or should I do something where I read the input file line-by-line where I read one line, perform the BeautifulSoup transformation on it, then export it (so I only have one line in memory at a time). Thanks,