python - タブ付きテキストを HTML 順不同リストに変換しますか?

Question

私は初心者のプログラマーなので、この質問は些細なことに聞こえるかもしれません: 次のようなタブ区切りのテキストを含むテキストファイルがいくつかあります。

ここで、次の構造で、これから順不同の .html リストを生成したいと思います。

<ul>
<li>A
<ul><li>B</li>
<li>C
<ul><li>D</li>
<li>E</li></ul></li></ul></li>
</ul>

私の考えは、Python スクリプトを作成することでしたが、より簡単な (自動) 方法があれば、それも問題ありません。インデントレベルとアイテム名を特定するには、次のコードを使用します。

import sys
indent = 0
last = []
for line in sys.stdin:
    count = 0
    while line.startswith("\t"):
       count += 1
       line = line[1:]
    if count > indent:
       indent += 1
       last.append(last[-1])
    elif count < indent:
       indent -= 1
       last = last[:-1]

score 5 · Accepted Answer

これを試してください（テストケースで動作します）：

import itertools
def listify(filepath):
    depth = 0
    print "<ul>"*(depth+1)
    for line in open(filepath):
        line = line.rstrip()
        newDepth = sum(1 for i in itertools.takewhile(lambda c: c=='\t', line))
        if newDepth > depth:
            print "<ul>"*(newDepth-depth)
        elif depth > newDepth:
            print "</ul>"*(depth-newDepth)
        print "<li>%s</li>" %(line.strip())
        depth = newDepth
    print "</ul>"*(depth+1)

お役に立てれば

score 2 · Accepted Answer

tokenizeモジュールは入力形式を認識します。行には有効な Python 識別子が含まれており、ステートメントのインデントレベルは重要です。ElementTreeモジュールを使用すると、メモリ内のツリー構造を操作できるため、ツリーの作成を html としてのレンダリングから分離する方がより柔軟になる可能性があります。

from tokenize import NAME, INDENT, DEDENT, ENDMARKER, NEWLINE, generate_tokens
from xml.etree import ElementTree as etree

def parse(file, TreeBuilder=etree.TreeBuilder):
    tb = TreeBuilder()
    tb.start('ul', {})
    for type_, text, start, end, line in generate_tokens(file.readline):
        if type_ == NAME: # convert name to <li> item
            tb.start('li', {})
            tb.data(text)
            tb.end('li')
        elif type_ == NEWLINE:
            continue
        elif type_ == INDENT: # start <ul>
            tb.start('ul', {})
        elif type_ == DEDENT: # end </ul>
            tb.end('ul')
        elif type_ == ENDMARKER: # done
            tb.end('ul') # end parent list
            break
        else: # unexpected token
            assert 0, (type_, text, start, end, line)
    return tb.close() # return root element

.start(), .end(), .data(),.close()メソッドを提供する任意のクラスを使用できます。TreeBuilderたとえば、ツリーを構築する代わりに、オンザフライで html を記述できます。

stdin を解析し、html を stdout に書き込むには、次を使用できますElementTree.write()。

import sys

etree.ElementTree(parse(sys.stdin)).write(sys.stdout, method='html')

出力：

<ul><li>A</li><ul><li>B</li><li>C</li><ul><li>D</li><li>E</li></ul></ul></ul>

だけでなく、任意のファイルを使用できますsys.stdin/sys.stdout。

注: Python 3 で stdout に書き込むには、sys.stdout.bufferまたはencoding="unicode"バイト/Unicode の区別によります。

score 0 · Accepted Answer

アルゴリズムは次のようになると思います。

現在のインデントレベルを追跡する (1 行あたりのタブ数を数えることにより)
インデントレベルが増加した場合: 出力<ul> <li>current item</li>
インデントレベルが減少した場合: 出力<li>current item</li></ul>
インデントレベルが同じままの場合: 出力<li>current item</li>

これをコードに入れることは、演習としてOPに任されています

score -1 · Accepted Answer

アルゴリズムは単純です。タブ \t で示される行の深さレベルを取得し、次の箇条書きを右 \t+\t または左 \t\t-\t にシフトするか、同じレベル \t のままにします。

「in.txt」にタブが含まれていることを確認するか、ここからコピーする場合はインデントをタブに置き換えます。インデントが空白で構成されている場合、何も機能しません。そして区切りは最後の空白行です。必要に応じて、コードで変更できます。

JF Sebastian のソリューションは問題ありませんが、Unicode を処理しません。

UTF-8 エンコーディングでテキストファイル "in.txt" を作成します。

qqq
    www
    www
        яяя
        яяя
    ыыы
    ыыы
qqq
qqq

スクリプト「ul.py」を実行します。スクリプトは「out.html」を作成し、Firefox で開きます。

#!/usr/bin/python
# -*- coding: utf-8 -*-

# The script exports a tabbed list from string into a HTML unordered list.

import io, subprocess, sys

f=io.open('in.txt', 'r',  encoding='utf8')
s=f.read()
f.close()

#---------------------------------------------

def ul(s):

    L=s.split('\n\n')

    s='<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n\
<html><head><meta content="text/html; charset=UTF-8" http-equiv="content-type"><title>List Out</title></head><body>'

    for p in L:
        e=''
        if p.find('\t') != -1:

            l=p.split('\n')
            depth=0
            e='<ul>'
            i=0

            for line in l:
                if len(line) >0:
                    a=line.split('\t')
                    d=len(a)-1

                    if depth==d:
                        e=e+'<li>'+line+'</li>'


                    elif depth < d:
                        i=i+1
                        e=e+'<ul><li>'+line+'</li>'
                        depth=d


                    elif depth > d:
                        e=e+'</ul>'*(depth-d)+'<li>'+line+'</li>'
                        depth=d
                        i=depth


            e=e+'</ul>'*i+'</ul>'
            p=e.replace('\t','')

            l=e.split('<ul>')
            n1= len(l)-1

            l=e.split('</ul>')
            n2= len(l)-1

            if n1 != n2:
                msg='<div style="color: red;">Wrong bullets position.<br>&lt;ul&gt;: '+str(n1)+'<br>&lt;&frasl;ul&gt;: '+str(n2)+'<br> Correct your source.</div>'
                p=p+msg

        s=s+p+'\n\n'

    return s

#-------------------------------------      

def detach(cmd):
    process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
    sys.exit()

s=ul(s)

f=io.open('out.html', 'w',  encoding='utf8')
s=f.write(s)
f.close()

cmd='firefox out.html'
detach(cmd)

HTML は次のようになります。

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html><head><meta content="text/html; charset=UTF-8" http-equiv="content-type"><title>List Out</title></head><body><ul><li>qqq</li><ul><li>www</li><li>www</li><ul><li>яяя</li><li>яяя</li></ul><li>ыыы</li><li>ыыы</li></ul><li>qqq</li><li>qqq</li></ul>

python - タブ付きテキストを HTML 順不同リストに変換しますか?

4 に答える 4

Related

Reference