4列のcsvファイルからテキストを前処理するために使用している次のpythonスクリプトがあります。
bad_tags=['code','a','img']
stops=['the','i',"i'd",'cannot','like','if','an','a','is','or','no','that',"i'm",'and','you','which','there','way','to','if','from','certain','quite',
'help','me','how','should','why','can','what','in','on','where','thanks','thank','want','need','so','could','would','when','do','using',
'another',"i've",'gives','still','while','this','for','but','actually','that','into','these','something','some','want','not','please','me',
'know','it','have','stuff','with','each','able','wondering','such','finding','matter','question','as','make','use','my','any','be','more','than',
'was','of','etc','find','answer','myself','since','work','without','kinds','very','then','think','thinking','thought','although','however','which',
'anyway','anyways','more','at','every','everyone','never',"can't","won't","shouldn't","couldn't","there's",'sure','no','already','works','problem',
'most','mostly','turned','am','create',"that's",'whole','putting','getting','good','bad','great','worst','best','worse','only','better',
'now','often','happen','happens','happening','out','in','all','appreciate','basically','given','gives','gave','somewhere','try','tried','takes','taking',
'e.g','question','trouble','based','guess','after','enough','has','them','ie','eg','having','weird','those','trying','wants','said','its','giving','whats','later',
'used',"isn't",'gonna','will','explain','once','take','after','unfortunately','fortunately','receive','they','suppose','being','hence','did','wanna','usual',
'questions','before','by','are',"aren't",'almost','wanted','does','someone','containing','because','within','just','own','easier','much','appreciated']
with open(r"input_file.csv") as r, open(r"output_file.csv", "w") as w:
reader=csv.reader(r)
next(r)
for row in reader:
soup=BS(row[2],'html')
for tag in soup.findAll(True):
if tag.name in bad_tags:
tag.extract()
new_string=soup.renderContents()
final0=re.sub(r'<[^>]+>', '', new_string)
parsing=re.findall(r"[\w+]+(?:[-'/.][\w+]+)*|'|[-.(]+|\S[\w+]*",re.sub(r'<[^>]+>', '', row[1]))
final=' '.join(w.lower() for w in parsing if w not in string.punctuation)
parsing2=[b for b in final.split(' ') if not b in stops]
final2=' '.join(parsing2)
parsing3=re.findall(r"[\w+]+(?:[-'/.][\w+]+)*|'|[-.(]+|\S[\w+]*",final0)
final3=' '.join(w.lower() for w in parsing3 if w not in string.punctuation)
parsing4=[b for b in final3.split(' ') if not b in stops]
final4=' '.join(parsing4)
w.write("{},{},{},{}\n".format(row[0],final2,final4,row[3]))
ほとんどの例で正しく機能しますが、入力ファイルの複数の行が出力ファイルの 1 つの行に連結される場合があり、なぜこれが起こるのかわかりません。誰でもこれを理解できますか?