python - Python -- ヘッダー/チャプターを別々のファイルに自動的に分割する方法

Question

テキストを直接 epub に変換していますが、HTML ブックファイルを個別のヘッダー/チャプターファイルに自動的に分割する際に問題が発生しています。現時点では、以下のコードは部分的に機能しますが、1 つおきのチャプターファイルしか作成しません。そのため、ヘッダー/チャプターファイルの半分が出力から欠落しています。コードは次のとおりです。

def splitHeaderstoFiles(fpath):

infp = open(fpath, 'rt', encoding=('utf-8'))
for line in infp:

    # format and split headers to files
    if '<h1' in line:   

       #-----------format header file names and other stuff ------------#

        # create a new file for the header/chapter section
        path = os.getcwd() + os.sep + header
        with open(path, 'wt', encoding=('utf-8')) as outfp:            

            # write html top meta headers
            outfp = addMetaHeaders(outfp)
            # add the header
            outfp = outfp.write(line)

            # add the chapter/header bodytext
            for line in infp:
                if '<h1' not in line:
                    outfp.write(line)
                else:                     
                    outfp.write('</body>\n</html>')         
                    break                
    else:          
        continue

infp.close()

この問題は、コードの下部にある 2 番目の「for ループ」で発生し、次の h1 タグを探して分割を停止します。プログラムが次の反復で次のヘッダー/チャプターを見つけることができるように、seek() または tell() を使用して 1 行巻き戻しまたは戻ることはできません。どうやら、暗黙の iter または操作中の次のオブジェクトを含む for ループで Python でこれらを使用することはできません。「ゼロ以外のcur-relative seeksを実行できません」というエラーが発生するだけです。

コードでwhile line != ' ' + readline() の組み合わせも試しましたが、上記と同じエラーが発生します。

さまざまな長さの HTML ヘッダー/チャプターを Python で個別のファイルに分割する簡単な方法を知っている人はいますか? このタスクを簡単にするのに役立つ特別な python モジュール (ピクルスなど) はありますか?

私はPython 3.4を使用しています

この問題の解決策について、事前に感謝します...

score 0 · Accepted Answer

私は最終的に上記の問題に対する答えを見つけました。以下のコードは、ファイルヘッダーを取得するだけでなく、さらに多くのことを行います。また、フォーマットされたファイル名データ (拡張子付き) と純粋なヘッダー名データをそれぞれ含む 2 つの並列リスト配列を同時にロードするので、これらのリストを使用して、1 回のヒットで while ループ内でこれらの html ファイルにフォーマットされたファイル名拡張子を入力できます。コードは正常に機能するようになりました。以下に示します。

def splitHeaderstoFiles(dir, inpath):
count = 1
t_count = 0
out_path = ''
header = ''
write_bodytext = False
file_path_names = []
pure_header_names = []

inpath = dir + os.sep + inpath
with open(inpath, 'rt', encoding=('utf-8')) as infp:

    for line in infp:

        if '<h1' in line:                
            #strip html tags, convert to start caps
            p = re.compile(r'<.*?>')
            header = p.sub('', line)
            header = capwords(header)
            line_save = header

            # Add 0 for count below 10
            if count < 10: 
                header = '0' + str(count) + '_' + header
            else:
                header = str(count) + '_' + header              

            # remove all spaces + add extension in header
            header = header.replace(' ', '_')
            header = header + '.xhtml'
            count = count + 1

            #create two parallel lists used later 
            out_path = dir + os.sep + header
            outfp = open(out_path, 'wt', encoding=('utf-8'))
            file_path_names.insert(t_count, out_path)
            pure_header_names.insert(t_count, line_save)
            t_count = t_count + 1

            # Add html meta headers and write it 
            outfp = addMainHeaders(outfp)
            outfp.write(line)
            write_bodytext = True

        # add header bodytext   
        elif write_bodytext == True:
            outfp.write(line)

# now add html titles and close the html tails on all files    
max_num_files = len(file_path_names)
tmp = dir + os.sep + 'temp1.tmp'
i = 0

while i < max_num_files:
    outfp = open(tmp, 'wt', encoding=('utf-8'))     
    infp = open(file_path_names[i], 'rt', encoding=('utf-8'))

    for line in infp:
        if '<title>'  in line:
            line = line.strip(' ')
            line = line.replace('<title></title>', '<title>' +    pure_header_names[i] + '</title>')
            outfp.write(line)
        else:
            outfp.write(line)            

    # add the html tail
    if '</body>' in line or '</html>' in line:
        pass
    else:            
        outfp.write('  </body>' + '\n</html>')    

    # clean up
    infp.close()
    outfp.close()
    shutil.copy2(tmp, file_path_names[i])
    os.remove(tmp) 
    i = i + 1                

# now rename just the title page
if os.path.isfile(file_path_names[0]):    
    title_page_name = file_path_names[0]
    new_title_page_name = dir + os.sep + '01_Title.xhtml'    
    os.rename(title_page_name, new_title_page_name)
    file_path_names[0] = '01_Title.xhtml'
else:
    logmsg27(DEBUG_FLAG)
    os._exit(0) 

# xhtml file is no longer needed    
if os.path.isfile(inpath):
    os.remove(inpath)    

# returned list values are also used 
# later to create epub opf and ncx files
return(file_path_names, pure_header_names)

@Hai Vu と @Seth -- ご協力ありがとうございます。

python - Python -- ヘッダー/チャプターを別々のファイルに自動的に分割する方法

3 に答える 3

Related

Reference