python - 特定のタグ内のすべてのタグを正規表現でキャッチする方法は?

Question

たとえば、次のようなコードがあります

<tag1 blablablah>sometext<i>sometext</i>sometext<i>sometext</i>sometext</tag1>

私がやりたいのは、それを次のようにすることです

<tag1 blablablah>sometext<XXX><i></XXX>sometext<XXX></i></XXX>sometext<XXX><i></XXX>sometext<XXX></i></XXX>sometext</tag1>

検索に正規表現を使用しています（Notepad ++およびPythonのre.compile関数でも機能します）

(<tag1[^>]*>.*?)(<[^>]*>.*?)(.*?</tag1>)

そして交換用（re.subでも動作します）

\1<XXX>\2</XXX>\3

しかし、それは最初に出現したものだけを見つけて変更します...

<tag1 blablablah>sometext<XXX><i></XXX>sometext</i>sometext<i>sometext</i>sometext</tag1>

誰でもこれで私を助けることができますか？

score 2 · Accepted Answer

これを試して

<((?:[a-z]+:)?[a-z]\w+)\b[^<>]+?>(.+)</\1>

説明

"
<              # Match the character “&lt;” literally
(              # Match the regular expression below and capture its match into backreference number 1
   (?:            # Match the regular expression below
      [a-z]          # Match a single character in the range between “a” and “z”
         +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      :              # Match the character “:” literally
   )?             # Between zero and one times, as many times as possible, giving back as needed (greedy)
   [a-z]          # Match a single character in the range between “a” and “z”
   \w             # Match a single character that is a “word character” (letters, digits, and underscores)
      +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b             # Assert position at a word boundary
[^<>]          # Match a single character NOT present in the list “&lt;>”
   +?             # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
>              # Match the character “&gt;” literally
(              # Match the regular expression below and capture its match into backreference number 2
   .              # Match any single character that is not a line break character
      +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
</             # Match the characters “&lt;/” literally
\1             # Match the same text as most recently matched by capturing group number 1
>              # Match the character “&gt;” literally
"

score 0 · Accepted Answer

問題は、最初と最後のタグを回避することです。それらを分割すると、それは非常に簡単です。

s = '<tag1 blablablah>sometext<i>sometext</i>sometext<i>sometext</i>sometext</tag1>'
start, end = s.find('>') + 1, s.rfind('<')
s_list = [s[:start], s[start:end], s[end:]]
s_list[1] = re.sub(r'(<[^>]*>)', r'<XXX>\1</XXX>', s_list[1])
print ''.join(s_list)

ただし、これはワンライナーではありません。

または、次のようにすることもできます。

print re.sub(r'([^(^<)])(<[^>]*>(?!$))', r'\1<XXX>\2</XXX>', s)

これは、最も外側のタグが文字列の最初と最後にある場合にのみ機能することに注意してください。

score 0 · Accepted Answer

0

このようにパターンを変更してみてください

(<tag1[^>]*>).*?(<[^>]+>).*?(</tag1>)

于 2012-05-29T16:50:59.277 に答える

python - 特定のタグ内のすべてのタグを正規表現でキャッチする方法は?

3 に答える 3

Related

Reference