python - Pythonはhtmlをテキストに変換し、フォーマットを模倣します

Question

私はBeautifulSoupを学んでいて、多くの「html2text」ソリューションを見つけましたが、私が探しているものはフォーマットを模倣する必要があります。

<ul>
<li>One</li>
<li>Two</li>
</ul>

になります

* One
* Two

と

Some text
<blockquote>
More magnificent text here
</blockquote>
Final text

に

Some text

    More magnificent text here

Final text

私はドキュメントを読んでいますが、簡単なことは何も見ていません。何か助けはありますか？私は美しいスープ以外のものを使用することにオープンです。

score 13 · Accepted Answer

Aaron Swartzのhtml2textスクリプトを見てください（でインストールできますpip install html2text）。出力は有効なMarkdownであることに注意してください。何らかの理由で完全に自分に合わない場合は、少し微調整するだけで、質問の正確な出力が得られるはずです。

In [1]: import html2text

In [2]: h1 = """<ul>
   ...: <li>One</li>
   ...: <li>Two</li>
   ...: </ul>"""

In [3]: print html2text.html2text(h1)
  * One
  * Two

In [4]: h2 = """<p>Some text
   ...: <blockquote>
   ...: More magnificent text here
   ...: </blockquote>
   ...: Final text</p>"""

In [5]: print html2text.html2text(h2)
Some text

> More magnificent text here

Final text

score 5 · Accepted Answer

より単純なタスクのコードがあります。HTMLタグを削除し、適切な場所に改行を挿入します。たぶん、これはあなたの出発点になるかもしれません。

Pythonのtextwrapモジュールは、インデントされたテキストのブロックを作成するのに役立つ場合があります。

http://docs.python.org/2/library/textwrap.html

class HtmlTool(object):
    """
    Algorithms to process HTML.
    """
    #Regular expressions to recognize different parts of HTML. 
    #Internal style sheets or JavaScript 
    script_sheet = re.compile(r"<(script|style).*?>.*?(</\1>)", 
                              re.IGNORECASE | re.DOTALL)
    #HTML comments - can contain ">"
    comment = re.compile(r"<!--(.*?)-->", re.DOTALL) 
    #HTML tags: <any-text>
    tag = re.compile(r"<.*?>", re.DOTALL)
    #Consecutive whitespace characters
    nwhites = re.compile(r"[\s]+")
    #<p>, <div>, <br> tags and associated closing tags
    p_div = re.compile(r"</?(p|div|br).*?>", 
                       re.IGNORECASE | re.DOTALL)
    #Consecutive whitespace, but no newlines
    nspace = re.compile("[^\S\n]+", re.UNICODE)
    #At least two consecutive newlines
    n2ret = re.compile("\n\n+")
    #A return followed by a space
    retspace = re.compile("(\n )")

    #For converting HTML entities to unicode
    html_parser = HTMLParser.HTMLParser()

    @staticmethod
    def to_nice_text(html):
        """Remove all HTML tags, but produce a nicely formatted text."""
        if html is None:
            return u""
        text = unicode(html)
        text = HtmlTool.script_sheet.sub("", text)
        text = HtmlTool.comment.sub("", text)
        text = HtmlTool.nwhites.sub(" ", text)
        text = HtmlTool.p_div.sub("\n", text) #convert <p>, <div>, <br> to "\n"
        text = HtmlTool.tag.sub("", text)     #remove all tags
        text = HtmlTool.html_parser.unescape(text)
        #Get whitespace right
        text = HtmlTool.nspace.sub(" ", text)
        text = HtmlTool.retspace.sub("\n", text)
        text = HtmlTool.n2ret.sub("\n\n", text)
        text = text.strip()
        return text

コードに余分な正規表現が残っている可能性があります。

score 4 · Accepted Answer

Pythonの組み込みhtml.parser（以前のバージョンではHTMLParser）モジュールを簡単に拡張して、正確なニーズに合わせて調整できる単純なトランスレーターを作成できます。パーサーがHTMLを介して食べるときに、特定のイベントにフックできます。

その単純な性質のため、Beautiful SoupのようにHTMLツリー内を移動することはできません（たとえば、兄弟、子、親ノードなど）が、あなたのような単純な場合にはそれで十分です。

html.parserホームページ

あなたの場合、特定のタイプの開始タグまたは終了タグが検出されるたびに適切なフォーマットを追加することで、このように使用できます。

from html.parser import HTMLParser
from os import linesep

class MyHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self, strict=False)
    def feed(self, in_html):
        self.output = ""
        super(MyHTMLParser, self).feed(in_html)
        return self.output
    def handle_data(self, data):
        self.output += data.strip()
    def handle_starttag(self, tag, attrs):
        if tag == 'li':
            self.output += linesep + '* '
        elif tag == 'blockquote' :
            self.output += linesep + linesep + '\t'
    def handle_endtag(self, tag):
        if tag == 'blockquote':
            self.output += linesep + linesep

parser = MyHTMLParser()
content = "<ul><li>One</li><li>Two</li></ul>"
print(linesep + "Example 1:")
print(parser.feed(content))
content = "Some text<blockquote>More magnificent text here</blockquote>Final text"
print(linesep + "Example 2:")
print(parser.feed(content))

score 0 · Accepted Answer

samaspinのソリューションを使用しているときに、英語以外のUnicode文字がある場合、パーサーは機能を停止し、空の文字列を返すだけです。ループごとにパーサーを初期化すると、パーサーオブジェクトが破損した場合でも、後続の解析で空の文字列が返されないようになります。samaspinのソリューションに加えて、<br>タグの処理も同様です。HTMLコードを処理し、htmlタグをクリーンアップしないという点では、後続のタグを追加して、期待される出力を関数に書き込むことができます。handle_starttag

            class MyHTMLParser(HTMLParser):
            """
            This class will be used to clean the html tags whilst ensuring the
            format is maintained. Therefore all the whitespces, newlines, linebrakes, etc are
            converted from html tags to their respective counterparts in python.

            """

            def __init__(self):
                HTMLParser.__init__(self)

            def feed(self, in_html):
                self.output = ""
                super(MyHTMLParser, self).feed(in_html)
                return self.output

            def handle_data(self, data):
                self.output += data.strip()

            def handle_starttag(self, tag, attrs):
                if tag == 'li':
                    self.output += linesep + '* '
                elif tag == 'blockquote':
                    self.output += linesep + linesep + '\t'
                elif tag == 'br':
                    self.output += linesep + '\n'

            def handle_endtag(self, tag):
                if tag == 'blockquote':
                    self.output += linesep + linesep


        parser = MyHTMLParser()

python - Pythonはhtmlをテキストに変換し、フォーマットを模倣します

4 に答える 4

Related

Reference