0

私はこのコードを持っています:

import urllib
from bs4 import BeautifulSoup

url = 'http://www.brothersoft.com/windows/categories.html'
pageHtml = urllib.urlopen(url).read()
soup = BeautifulSoup(pageHtml)

for a in soup.select('div.brLeft a[href]'):
    suburl = "http://www.brothersoft.com"+ a['href'].encode('utf-8', 'replace')

    content = urllib.urlopen(suburl).read()
    soup = BeautifulSoup(content)
    for a in soup.select('div.coLeft.cate.mBottom dd a[href]'):
        print "http://www.brothersoft.com"+a['href'].encode('utf-8', 'replace')
        suburl = "http://www.brothersoft.com"+a['href'].encode('utf-8', 'replace')

        content = urllib.urlopen(suburl).read()
        soup = BeautifulSoup(content)
        for a in soup.select('div.freeText dl a[href]'):
            print "http://www.brothersoft.com"+ a['href'].encode('utf-8', 'replace')
            suburl2 = "http://www.brothersoft.com"+ a['href'].encode('utf-8', 'replace')

            content = urllib.urlopen(suburl2).read()
            soup = BeautifulSoup(content)
            for li in soup.select('div.Updated.coLeft li'):
                    print ' '.join(li.stripped_strings).encode('utf-8', 'replace')

このコードを実行すると、次のエラーが発生するまで実行されます。

C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py:155: RuntimeWarning: Py
thon's built-in HTMLParser cannot parse the given document. This is not a bug in
 Beautiful Soup. The best solution is to install an external parser (lxml or htm
l5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/softw
are/BeautifulSoup/bs4/doc/#installing-a-parser for help.
  "Python's built-in HTMLParser cannot parse the given document. This is not a b
ug in Beautiful Soup. The best solution is to install an external parser (lxml o
r html5lib), and use Beautiful Soup with that parser. See http://www.crummy.com/
software/BeautifulSoup/bs4/doc/#installing-a-parser for help."))
Traceback (most recent call last):
  File "C:\Documents and Settings\Fairuz\Desktop\soup7.py", line 26, in <module>

    soup = BeautifulSoup(content)
  File "C:\Python27\lib\site-packages\bs4\__init__.py", line 183, in __init__
    self._feed()
  File "C:\Python27\lib\site-packages\bs4\__init__.py", line 197, in _feed
    self.builder.feed(self.markup)
  File "C:\Python27\lib\site-packages\bs4\builder\_htmlparser.py", line 156, in
feed
    raise e
HTMLParser.HTMLParseError: malformed start tag, at line 1, column 18498

このコードの何が問題になっていますか? 最初にhttp://www.brothersoft.com/windows/photo_image/other_image_tools/ http://www.brothersoft.com/microsoft-office-visio-60485.htmlまで実行され、その後エラー メッセージが表示されます。

4

0 に答える 0