python - 美しいスープにスクリプトタグの内容をエンコードおよびデコードさせる方法

Question

私はhtmlを解析するためにbeautifulsoupを使用しようとしていますが、インラインスクリプトタグのページにアクセスするたびに、beautifulsoupはコンテンツをエンコードしますが、最後にデコードしません。

これは私が使用するコードです：

from bs4 import BeautifulSoup

if __name__ == '__main__':

    htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>'
    soup = BeautifulSoup(htmlData)
    #... using BeautifulSoup ...
    print(soup.prettify() )

私はこの出力が欲しい：

<html>
 <head>
  <script type="text/javascript">
   console.log("< < not able to write these & also these >> ");
  </script>
 </head>
 <body>
  <div>
   start of div
  </div>
 </body>
</html>

しかし、私はこの出力を得ます:

<html>
 <head>
  <script type="text/javascript">
   console.log("&lt; &lt; not able to write these &amp; also these &gt;&gt; ");
  </script>
 </head>
 <body>
  <div>
   start of div
  </div>
 </body>
</html>

score 1 · Accepted Answer

lxmlを試してみてください:

import lxml.html as LH

if __name__ == '__main__':
    htmlData = '<html> <head> <script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script> </head> <body> <div> start of div </div> </body> </html>'
    doc = LH.fromstring(htmlData)
    print(LH.tostring(doc, pretty_print = True))

収量

<html>
<head><script type="text/javascript"> console.log("< < not able to write these & also these >> "); </script></head>
<body> <div> start of div </div> </body>
</html>

score -1 · Accepted Answer

次のようなことができます。

htmlCodes = (
('&', '&amp;'),
('<', '&lt;'),
('>', '&gt;'),
('"', '&quot;'),
("'", '&#39;'),
)

for i in htmlCodes:
    soup.prettify().replace(i[1], i[0])

python - 美しいスープにスクリプトタグの内容をエンコードおよびデコードさせる方法

2 に答える 2

Related

Reference