python - 重い正規表現 - 本当に時間がかかる

Question

HTMLファイルの開始スクリプトタグと終了スクリプトタグを検出するために正規表現に従っています：

<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

要するに: <script NOT</s > NOT</s </script>

動作しますが、 <script> を検出するのに非常に長い時間が必要です。長い文字列の場合は数分から数時間もかかります

ライトバージョンは長い文字列でも完璧に機能します:

<script[^<]*>[^<]*</script>

ただし、属性の値として < および > が可能な <a> などの他のタグにも使用する拡張パターン

あなたのためのpythonテスト：

import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()

どうすれば修正できますか？正規表現の内部 (<script> の後) を変更して単純化する必要があります。

PS :) HTML解析で正規表現を使用するなどの間違ったアプローチについてのあなたの答えを予想してください。

コメント: まあ、ハンドルが必要です:
各 <a < document like this.border="5px;">
とアプローチはパーサーと正規表現を一緒に使用することです BeautifulSoup はわずか 2k 行で、すべての html を処理するのではなく、sgmllib から正規表現を拡張するだけです。

主な理由は、すべてのタグの開始位置と停止位置を正確に知る必要があるためです。すべての壊れた html を処理する必要があります。
BS は完全ではありません。時々起こります:
BeautifulSoup('< scriPt\n\n>a<aa>s< /script>').findAll('script') == [

] Python の re で利用できます。
とても非ゲーディーなすべて.*? この時点で<\s*/\s*tag\s*>が勝者となるまで。

その場合は完璧ではないことはわかっています: re.search('<\s*script. ?<\s /\s*script\s*>','< script </script> shit </script>') .group() しかし、次の解析で拒否されたテールを処理できます。

正規表現を使用した html の解析が 1 つの戦いではないことは明らかです。

score 3 · Accepted Answer

Beautifulsoup のような HTML パーサーを使用します。

「 beautifulsoup で script タグを削除できますか?」の優れた回答を参照してください。

あなたの唯一のツールがハンマーであるなら、すべての問題は釘のように見え始めます. 正規表現は強力なハンマーですが、問題によっては常に最適なソリューションとは限りません。

セキュリティ上の理由から、ユーザーが投稿した HTML からスクリプトを削除したいのでしょう。セキュリティが主な関心事である場合、ハッカーが正規表現をだますために変更できるものが非常に多いため、正規表現を実装するのは困難ですが、ほとんどのブラウザは喜んで評価します...専用のパーサーは使いやすく、パフォーマンスが高く、安全です.

「正規表現を使用できない理由」をまだ考えている場合は、mayhewrのコメントが指すこの回答を読んでください。私はそれをうまく表現できませんでした.男はそれを釘付けにし、彼の4433の賛成票は当然のことです.

score 2 · Accepted Answer

Python はわかりませんが、正規表現は知っています。

貪欲/非貪欲な演算子を使用すると、より単純な正規表現が得られます。

<script.*?>.*?</script>

これは、ネストされたスクリプトがないことを前提としています。

score 0 · Accepted Answer

パターンの問題は、バックトラックであることです。アトミックグループを使用すると、この問題を解決できます。パターンをこれに変更**

<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>   
         ^^^^^                           ^^^^^

説明

<!--
<script(?>[^<]+?|<(?:[^/]|/(?:[^s])))*>(?>[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

Match the characters “&lt;script” literally «<script»
Python does not support atomic grouping «(?>[^<]+?|<(?:[^/]|/(?:[^s])))*»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+?»
      Match any character that is NOT a “&lt;” «[^<]+?»
         Between one and unlimited times, as few times as possible, expanding as needed (lazy) «+?»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))»
      Match the character “&lt;” literally «<»
      Match the regular expression below «(?:[^/]|/(?:[^s]))»
         Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
            Match any character that is NOT a “/” «[^/]»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
            Match the character “/” literally «/»
            Match the regular expression below «(?:[^s])»
               Match any character that is NOT a “s” «[^s]»
Match the character “&gt;” literally «>»
Python does not support atomic grouping «(?>[^<]+|<(?:[^/]|/(?:[^s]))*)»
   Match either the regular expression below (attempting the next alternative only if this one fails) «[^<]+»
      Match any character that is NOT a “&lt;” «[^<]+»
         Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
   Or match regular expression number 2 below (the entire group fails if this one fails to match) «<(?:[^/]|/(?:[^s]))*»
      Match the character “&lt;” literally «<»
      Match the regular expression below «(?:[^/]|/(?:[^s]))*»
         Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
         Match either the regular expression below (attempting the next alternative only if this one fails) «[^/]»
            Match any character that is NOT a “/” «[^/]»
         Or match regular expression number 2 below (the entire group fails if this one fails to match) «/(?:[^s])»
            Match the character “/” literally «/»
            Match the regular expression below «(?:[^s])»
               Match any character that is NOT a “s” «[^s]»
Match the characters “&lt;/script>” literally «</script>»
-->

python - 重い正規表現 - 本当に時間がかかる

3 に答える 3

Related

Reference