python - BeautifulSoupがドキュメントを正しく読み取っていません

Question

機械学習を実行する目的でNBAプレーヤーの統計情報を取得しようとしていますが、統計情報が豊富なこれらの「印刷可能なプレーヤーファイル」が適切であることがわかりました。残念ながら、BeautifulSoupを使用してhtmlを解析しようとしていますが、まったく機能していません。例えば：

from bs4 import BeautifulSoup
import codecs
import urllib2

url = 'http://www.nba.com/playerfile/ray_allen/printable_player_files.html'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)

with open('ray_allen.txt', 'w') as f:
    f.write(soup.prettify())
    f.close()

次のようなファイルを取得します。

<html>
 <head>
  <!--no description was found-->
  <!--no title was found-->
  <!--no keywords found-->
  <!--not article-->
  <script>
   var site = "nba";
var page = "player";
  </script>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <script language="Javascript">
   &lt;!--
var flashinstalled = 0;
var flashversion = 0;
MSDetect = "false";
if (navigator.plugins &amp;&amp; navigator.plugins.length) {
    x = navigator.plugins["Shockwave Flash"];
    if (x) {
        flashinstalle   d       =       2   ;   

           i   f       (   x   .   d   e   s   c   r   i   p   t   i   o   n   )       {   

               y       =       x   .   d   e   s   c   r   i   p   t   i   o   n   ;   

               f   l   a   s   h   v   e   r   s   i   o   n       =       y   .   c   h   a   r   A   t   (   y   .   i   n   d   e   x   O   f   (   '   .   '   )   -   1   )   ;   

           }   

       }       e   l   s   e   

           f   l   a   s   h   i   n   s   t   a   l   l   e   d       =       1   ;   

       i   f       (   n   a   v   i   g   a   t   o   r   .   p   l   u   g   i   n   s   [   "   S   h   o   c   k   w   a   v   e       F   l   a   s   h       2   .   0   "   ]   )       {   

           f   l   a   s   h   i   n   s   t   a   l   l   e   d       =       2   ;   

           f   l   a   s   h   v   e   r   s   i   o   n       =       2   ;   

       }   
[...]

次に、（[...]は私が追加します）で終了する前に、さらに3000行以上続きます。

[...]
   &lt;   /   b   o   d   y   &gt;   

   &lt;   /   h   t   m   l   &gt;
  </script>
 </head>
</html>

代わりに「http://www.basketball-reference.com/players/a/allenra02.html」も試してみましたが、次のエラーが発生します。

トレースバック（最後の最後の呼び出し）：ファイル "test.py"、9行目、f.write（soup.prettify（））UnicodeEncodeError：'ascii'コーデックは位置6167の文字u'\ xb7'をエンコードできません：序数範囲外（128）

おそらく私はhtmlを解析するために何か他のものを使うべきですか？または、これらの問題の1つは簡単に修正できますか？私がここで読んだことは、BeautifulSoupを使用することで、困難ではなく簡単になるはずだということを示しているようです。

編集：行：

print soup.prettify()

ターミナルの2ページ目で動作するため、ファイルに書き込もうとすると何かが発生します。BeautifulSoupでは問題ありません。

score 4 · Accepted Answer

これは、 4.0.3 で修正されたバグ 972466と同じ症状を示します。Beautiful Soup 4 の最新バージョンにアップグレードすることをお勧めします。

python - BeautifulSoupがドキュメントを正しく読み取っていません

2 に答える 2

Related

Reference