5 に答える
The problem is that <code>
is treated according to the normal rules for HTML markup, and content inside <code>
tags is still HTML (The tags exists mainly to drive CSS formatting, not to change the parsing rules).
What you are trying to do is create a different markup language that is very similar, but not identical, to HTML. The simple solution would be to assume certain rules, such as, "<code>
and </code>
must appear on a line by themselves," and do some pre-processing yourself.
- A very simple — though not 100% reliable — technique is to replace
^<code>$
with<code><![CDATA[
and^</code>$
with]]></code>
. It isn't completely reliable, because if the code block contains]]>
, things will go horribly wrong. - A safer option is to replace dangerous characters inside code blocks (
<
,>
and&
probably suffice) with their equivalent character entity references (<
,>
and&
). You can do this by passing each block of code you identify tocgi.escape(code_block)
.
Once you've completed preprocessing, submit the result to BeautifulSoup as usual.
From Python wiki
>>>import cgi
>>>cgi.escape("<string.h>")
>>>'<string.h>'
>>>BeautifulSoup('<string.h>',
... convertEntities=BeautifulSoup.HTML_ENTITIES)
<code>
要素にエスケープされていない<
、、文字がコード内&
に含まれている場合>
、それは有効なhtmlではありません。BeautifulSoup
有効なhtmlに変換しようとします。それはおそらくあなたが望むものではありません。
テキストを有効なhtmlに変換するには、htmlからタグを削除する正規表現を適応させて、ブロックからテキストを抽出し、バージョン<code>
に置き換えcgi.escape()
ます。<code>
ネストされたタグがない場合は正常に機能するはずです。その後、サニタイズされたhtmlをにフィードできますBeautifulSoup
。
Unfortunately, BeautifulSoup can not be blocked to parse the code blocks.
One solution to what you want to achieve is too
1) Remove the code blocks
soup = BeautifulSoup(unicode(content))
code_blocks = soup.findAll(u'code')
for block in code_blocks:
block.replaceWith(u'<code class="removed"></code>')
2) Do the usual parsing to strip the non-allowed tags.
3) Re-insert the code blocks and re-generate the html.
stripped_code = stripped_soup.findAll(u"code", u"removed")
# re-insert pygment formatted code
I would have answered with some code, but I recently read a blog that does this elegantly.
編集:
python-markdown2を使用して入力を処理し、ユーザーにコード領域をインデントさせます。
>>> print html
I like this article, but the third code example <em>could have been simpler</em>:
#include <stdbool.h>
#include <stdio.h>
int main()
{
printf("Hello World\n");
}
>>> import markdown2
>>> marked = markdown2.markdown(html)
>>> marked
u'<p>I like this article, but the third code example <em>could have been simpler</em>:</p>\n\n<pre><code>#include <stdbool.h>\n#include <stdio.h>\n\nint main()\n{\n printf("Hello World\\n");\n}\n</code></pre>\n'
>>> print marked
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>
<pre><code>#include <stdbool.h>
#include <stdio.h>
int main()
{
printf("Hello World\n");
}
</code></pre>
BeautifulSoup でナビゲートして編集する必要がある場合は、以下の手順を実行してください。'<' と '>' を ('<' と '>' の代わりに) 再挿入する必要がある場合は、エンティティ変換を含めます。
soup = BeautifulSoup(marked,
convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> soup
<p>I like this article, but the third code example <em>could have been simpler</em>:</p>
<pre><code>#include <stdbool.h>
#include <stdio.h>
int main()
{
printf("Hello World\n");
}
</code></pre>
def thickened(soup):
"""
<code>
blah blah <entity> blah
blah
</code>
"""
codez = soup.findAll('code') # get the code tags
for code in codez:
# take all the contents inside of the code tags and convert
# them into a single string
escape_me = ''.join([k.__str__() for k in code.contents])
escaped = cgi.escape(escape_me) # escape them with cgi
code.replaceWith('<code>%s</code>' % escaped) # replace Tag objects with escaped string
return soup