I have some simple python code that makes request to a server
html_page = urllib2.urlopen(baseurl, timeout=20)
print html_page.read()
html_page.close()
when i am trying to scrape a page that has a '-'(dash) character in it. It is a dash in the browser, but when i try to print out the request of the response of urlopen it prints out as '?'. I tried recreating the html page with a local file, copying the afflicted text over from source, but I could not recreate it.
What other factors/variables might be in play? Could this have something to do with encoding?
UPDATE: I now know this problem is about encoding. the website i encoded in 'iso-8859-1'. the problem is i still cannot decode it, even after following Python: Converting from ISO-8859-1/latin1 to UTF-8
The character, when decoded, gives me:
>>>text.decode("iso-8859-1")
u"</strong><p>Let's\x97in "
>>> text.decode("iso-8859-1").encode("utf8")
"</strong><p>Let's\xc2\x97in "
>>> print text.decode("iso-8859-1").encode("utf8")
</strong><p>Let'sin
The character just completely disappears. Anyone have any ideas?