python - img タグに HTML ドキュメントの alt 要素が含まれているかどうかを確認する堅牢な正規表現を要求する

Question

HTMLドキュメントのIMGタグをチェックするためのpythonスクリプトを書いています。alt="" が IMG タグ内に存在することを確認する必要があります。次に、行番号を出力します。

正規表現は、コンテンツの異なる順序を考慮する必要があります。例えば：

<img class="" alt="" src="">
<img class="" src="">
<img src="" class="">
<img src="">

そうです、夏に。imgタグのすべての要素が存在することを確認するための正規表現は、可能な配置の範囲を説明する必要があります

ありがとうございました

score 2 · Accepted Answer

正規表現を使用して HTML を評価するのは少し危険ですが、欠点を受け入れる意思がある場合*、肯定的な先読みアサーションを使用してこれを機能させることができます。

regex = re.compile(r'<img (?=[^>]*\balt=")(?=[^>]*\bsrc=")(?=[^>]*\bclass=")')

現在の文字列に<img(同じタグ内で) alt="、src="およびclass="が任意の順序で続く場合に一致します。

説明：

<img    # Match '<img'
(?=     # Assert that it's possible to match the following from this position:
 [^>]*  #  Any number of characters except >
 \b     #  A word boundary (here: start of a word)
 alt="  #  The literal text 'alt="'
)       # End of lookahead
(?=[^>]*\bsrc=")   # Do the same for `src`, from the same position as before
(?=[^>]*\bclass=") # Do the same for `class`, from the same position as before

_{*もちろん、この正規表現は、一致するタグがコメント内にあるか、コメントによって中断されているか、不正な形式であるか、<pre>タグで囲まれているか、または実際の HTML パーサーの意味を変更する可能性のあるその他の状況については完全に無知です。}

python - img タグに HTML ドキュメントの alt 要素が含まれているかどうかを確認する堅牢な正規表現を要求する

1 に答える 1

Related

Reference