python - issue with content in urllib2.urlopen

Question

I have some simple python code that makes request to a server

html_page = urllib2.urlopen(baseurl, timeout=20)
print html_page.read()
html_page.close()

when i am trying to scrape a page that has a '-'(dash) character in it. It is a dash in the browser, but when i try to print out the request of the response of urlopen it prints out as '?'. I tried recreating the html page with a local file, copying the afflicted text over from source, but I could not recreate it.

What other factors/variables might be in play? Could this have something to do with encoding?

UPDATE: I now know this problem is about encoding. the website i encoded in 'iso-8859-1'. the problem is i still cannot decode it, even after following Python: Converting from ISO-8859-1/latin1 to UTF-8

The character, when decoded, gives me:

>>>text.decode("iso-8859-1")
  u"</strong><p>Let's\x97in "
>>> text.decode("iso-8859-1").encode("utf8")
  "</strong><p>Let's\xc2\x97in "
>>> print text.decode("iso-8859-1").encode("utf8")
  </strong><p>Let'sin

The character just completely disappears. Anyone have any ideas?

score 1 · Accepted Answer

それで、アダム・ローゼンフィールドのおかげで、私は自分の問題を理解しました。ウェブサイトは、文字セットがiso-8859-1であることを示しました

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

ぶ！問題が発生した文字は、Windows-1252でエンコードされた「emdash」でした。

>>> text.decode("windows-1252")
  </strong><p>Let's\u2014in"
>>> print text.decode("windows-1252")
  </strong><p>Let's—in

みんなありがとう！

python - issue with content in urllib2.urlopen

1 に答える 1

Related

Reference