python - HTMLリストの変換（
）タブ（つまりインデント）

Question

数十の言語で作業してきましたが、Pythonは初めてです。

ここでの私の最初の（おそらく2番目の）質問ですので、優しくしてください...

HTMLのようなマークダウンテキストをwiki形式（具体的には、Linux Tomboy / GNoteのメモをZimに）に効率的に変換しようとして、リストの変換に行き詰まりました。

このような2レベルの順序付けされていないリストの場合...

最初のレベル
- セカンドレベル

おてんば娘/GNoteは次のようなものを使用しています...

<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>

ただし、Zimの個人用ウィキはそれを望んでいます...

* First level
  * Second level

...先頭のタブ付き。

正規表現モジュール関数re.sub（）、re.match（）、re.search（）などを調べて、繰り返しテキストを次のようにコーディングするクールなPython機能を見つけました...

 count * "text"

したがって、次のようなことを行う方法があるはずです...

 newnote = re.sub("<list>", LEVEL * "\t", oldnote)

ここで、LEVELは、<list>メモの序数（出現）です。したがって0、最初に<list>遭遇1した場合、2番目に遭遇した場合などになります。

</list>その後、LEVELは、遭遇するたびにデクリメントされます。

<list-item>タグは箇条書きのアスタリスクに変換され（必要に応じて改行が前に付きます）、</list-item>タグは削除されます。

最後に...質問...

LEVELの値を取得し、それをタブ乗数として使用するにはどうすればよいですか？

score 4 · Accepted Answer

これを行うには、実際にはxmlパーサーを使用する必要がありますが、質問に答えるには：

import re

def next_tag(s, tag):
    i = -1
    while True:
        try:
            i = s.index(tag, i+1)
        except ValueError:
            return
        yield i

a = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"

a = a.replace("<list-item>", "* ")

for LEVEL, ind in enumerate(next_tag(a, "<list>")):
    a = re.sub("<list>", "\n" + LEVEL * "\t", a, 1)

a = a.replace("</list-item>", "")
a = a.replace("</list>", "")

print a

これはあなたの例で機能し、あなたの例でのみ機能します。XMLパーサーを使用します。使用できますxml.dom.minidom（Pythonに含まれています（少なくとも2.7）、何もダウンロードする必要はありません）：

import xml.dom.minidom

def parseList(el, lvl=0):
    txt = ""
    indent = "\t" * (lvl)
    for item in el.childNodes:
        # These are the <list-item>s: They can have text and nested <list> tag
        for subitem in item.childNodes:
            if subitem.nodeType is xml.dom.minidom.Element.TEXT_NODE:
                # This is the text before the next <list> tag
                txt += "\n" + indent + "* " + subitem.nodeValue
            else:
                # This is the next list tag, its indent level is incremented
                txt += parseList(subitem, lvl=lvl+1)
    return txt

def parseXML(s):
    doc = xml.dom.minidom.parseString(s)
    return parseList(doc.firstChild)

a = "<list><list-item>First level<list><list-item>Second level</list-item><list-item>Second level 2<list><list-item>Third level</list-item></list></list-item></list></list-item></list>"
print parseXML(a)

出力：

* First level
    * Second level
    * Second level 2
        * Third level

score 2 · Accepted Answer

美しいスープを使用すると、税関であってもタグを繰り返すことができます。このタイプの操作を行うために非常に実用的です

from BeautifulSoup import BeautifulSoup
tags = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
soup = BeautifulSoup(tags)
print [[ item.text for item in list_tag('list-item')]  for list_tag in soup('list')]

Output : [[u'First level'], [u'Second level']]

ネストされたリスト内包表記を使用しましたが、ネストされたforループを使用できます

for list_tag in soup('list'):
     for item in list_tag('list-item'):
         print item.text

お役に立てば幸いです。

私の例ではBeautifulSoup3を使用しましたが、この例はBeautifulSoup4で機能するはずですが、インポートの変更のみです。

from bs4 import BeautifulSoup

python - HTMLリストの変換（）タブ（つまりインデント）

2 に答える 2

Related

Reference

python - HTMLリストの変換（
）タブ（つまりインデント）