I am new to Python, and with some really great assistance from StackOverflow, I've written a program that:
1) Looks in a given directory, and for each file in that directory:
2) Runs a HTML-cleaning program, which:
- Opens each file with BeautifulSoup
- Removes blacklisted tags & content
- Prettifies the remaining content
- Runs Bleach to remove all non-whitelisted tags & attributes
- Saves out as a new file
It works very well, except when it hits a certain kind of file content that throws up a bunch of BeautifulSoup errors and aborts the whole thing. I want it to be robust against that, as I won't have control over what sort of content winds up in this directory.
So, my question is: How can I re-structure the program so that when it errors on one file within the directory, it reports that it was unable to process that file, and then continues to run through the remaining files?
Here is my code so far (with extraneous detail removed):
def clean_dir(directory):
os.chdir(directory)
for filename in os.listdir(directory):
clean_file(filename)
def clean_file(filename):
tag_black_list = ['iframe', 'script']
tag_white_list = ['p', 'div']
attr_white_list = {'*': ['title']}
with open(filename, 'r') as fhandle:
text = BeautifulSoup(fhandle)
text.encode("utf-8")
print "Opened "+ filename
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
print "Prettified"
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("../posts-cleaned/"+filename, "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Saved " + filename +" in /posts-cleaned"
print "Done"
clean_dir("../posts/")
I looking for any guidance on how to write this so that it will keep running after hitting a parsing/encoding/content/attribute/etc error within the clean_file function.