python - PythonでのHTMLの切り捨て

Question

HTMLを取得して、指定された長さにできるだけ近づけるための純粋なPythonツールはありますが、結果のスニペットが整形式であることを確認してください。たとえば、次のHTMLを指定します。

<h1>This is a header</h1>
<p>This is a paragraph</p>

それは生成されません：

<h1>This is a hea

しかし：

<h1>This is a header</h1>

または少なくとも：

<h1>This is a hea</h1>

に依存しているものを見つけましたが、機能するものは見つかりませんpullparser。これは、廃止されており、機能していません。

score 7 · Accepted Answer

DJANGO libを使用している場合は、次のことができます。

from django.utils import text, html

    class class_name():


        def trim_string(self, stringf, limit, offset = 0):
            return stringf[offset:limit]

        def trim_html_words(self, html, limit, offset = 0):
            return text.truncate_html_words(html, limit)


        def remove_html(self, htmls, tag, limit = 'all', offset = 0):
            return html.strip_tags(htmls)

とにかく、これがdjangoのtruncate_html_wordsのコードです：

import re

def truncate_html_words(s, num):
    """
    Truncates html to a certain number of words (not counting tags and comments).
    Closes opened tags if they were correctly closed in the given html.
    """
    length = int(num)
    if length <= 0:
        return ''
    html4_singlets = ('br', 'col', 'link', 'base', 'img', 'param', 'area', 'hr', 'input')
    # Set up regular expressions
    re_words = re.compile(r'&.*?;|<.*?>|([A-Za-z0-9][\w-]*)')
    re_tag = re.compile(r'<(/)?([^ ]+?)(?: (/)| .*?)?>')
    # Count non-HTML words and keep note of open tags
    pos = 0
    ellipsis_pos = 0
    words = 0
    open_tags = []
    while words <= length:
        m = re_words.search(s, pos)
        if not m:
            # Checked through whole string
            break
        pos = m.end(0)
        if m.group(1):
            # It's an actual non-HTML word
            words += 1
            if words == length:
                ellipsis_pos = pos
            continue
        # Check for tag
        tag = re_tag.match(m.group(0))
        if not tag or ellipsis_pos:
            # Don't worry about non tags or tags after our truncate point
            continue
        closing_tag, tagname, self_closing = tag.groups()
        tagname = tagname.lower()  # Element names are always case-insensitive
        if self_closing or tagname in html4_singlets:
            pass
        elif closing_tag:
            # Check for match in open tags list
            try:
                i = open_tags.index(tagname)
            except ValueError:
                pass
            else:
                # SGML: An end tag closes, back to the matching start tag, all unclosed intervening start tags with omitted end tags
                open_tags = open_tags[i+1:]
        else:
            # Add it to the start of the open tags list
            open_tags.insert(0, tagname)
    if words <= length:
        # Don't try to close tags if we don't need to truncate
        return s
    out = s[:ellipsis_pos] + ' ...'
    # Close any tags still open
    for tag in open_tags:
        out += '</%s>' % tag
    # Return string
    return out

score 7 · Accepted Answer

本格的なパーサーは必要ないと思います。入力文字列を次のいずれかにトークン化するだけで済みます。

文章
タグを開く
タグを閉じる
自己閉鎖タグ
キャラクターエンティティ

このようなトークンのストリームができたら、スタックを使用して、どのタグを閉じる必要があるかを追跡するのは簡単です。私は実際にしばらく前にこの問題に遭遇し、これを行うための小さなライブラリを作成しました。

https://github.com/eentzel/htmltruncate.py

これは私にとってはうまく機能し、任意にネストされたマークアップ、文字エンティティを単一の文字としてカウントする、不正な形式のマークアップでエラーを返すなど、ほとんどのコーナーケースをうまく処理します。

それは生成します：

<h1>This is a hea</h1>

あなたの例で。これはおそらく変更される可能性がありますが、一般的なケースでは困難です。10文字に切り捨てようとしているが、<h1>タグが別の文字、たとえば300文字で閉じられていない場合はどうでしょうか。

score 4 · Accepted Answer

私はslacyによる答えが非常に役に立ち、評判があればそれを賛成するだろうと思いましたが、もう1つ注意すべきことがありました。私の環境では、html5libとBeautifulSoup4をインストールしました。BeautifulSoupはhtml5libパーサーを使用しました。これにより、私のhtmlスニペットがhtmlタグとbodyタグでラップされましたが、これは私が望んでいたものではありません。

>>> truncate_html("<p>sdfsdaf</p>", 4)
u'<html><head></head><body><p>s</p></body></html>'

これらの問題を解決するために、BeautifulSoupにPythonパーサーを使用するように指示しました。

from bs4 import BeautifulSoup
def truncate_html(html, length): 
    return unicode(BeautifulSoup(html[:length], "html.parser"))

>>> truncate_html("<p>sdfsdaf</p>", 4)
u'<p>s</p>'

score 3 · Accepted Answer

これは、BeautifulSoupを使用して1行で実行できます（コンテンツ文字の数ではなく、特定の数のソース文字で切り捨てたいと仮定します）。

from BeautifulSoup import BeautifulSoup

def truncate_html(html, length): 
    return unicode(BeautifulSoup(html[:length]))

score 2 · Accepted Answer

これはあなたの要件を満たします。使いやすいHTMLパーサーと悪いマークアップコレクター

http://www.crummy.com/software/BeautifulSoup/

score 0 · Accepted Answer

私の最初の考えは、XMLパーサー（おそらくpythonのsaxパーサー）を使用してから、おそらく各xml要素のテキスト文字を数えることです。一貫性とシンプルさを高めるために、文字数のタグを無視しますが、どちらも可能であるはずです。

score 0 · Accepted Answer

最初にHTMLを完全に解析してから、切り捨てることをお勧めします。Python用の優れたHTMLパーサーはlxmlです。解析して切り捨てた後、HTML形式に印刷して戻すことができます。

score 0 · Accepted Answer

HTML Tidyを見て、 HTMLをクリーンアップ/再フォーマット/再インデントします。

python - PythonでのHTMLの切り捨て

8 に答える 8

Related

Reference