I want to scrape the content of websites with Python. Just like this:
Apple’s stock continued to dominate the news over the weekend, with Barron’s placing it on the top of its favorite 2013 stock list.
But print them with error result:
Apple âs stock continued to dominate the news over the weekend, with Barronâs placing it on the top of its favorite 2013 stock list.
The symbol "’" can't be shown, here is my code:
#-*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
import urllib
from lxml import *
import urllib
import lxml.html as HTML
url = "http://www.forbes.com/sites/panosmourdoukoutas/2012/12/09/apple-tops-barrons- 10-favorite-stocks-for-2013/?partner=yahootix"
sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
root = HTML.document_fromstring(htmlSource)
contents = ' '.join([x.strip() for x in root.xpath("//div[@class='body']/descendant::text()")])
print contents
f = open('C:/Users/yinyao/Desktop/Python Code/data.txt','w')
f.write(contents)
f.close()
However, after setting, the function of printf is not useful. Why? And what should I do? I'm using Windows, and the default encoding approach is gbk.