このような単純なタスクに BeautifulSoup を使用する必要はありません。正規表現を直接使用するよりもはるかに遅くなります。
してくださいre.findall('^\s*Text Start:.*',page)
。
Webページをスクレイピングするとき、ページの内容を非常に正確に知ることができると便利です. 個人的には、私はこれを行います:
from httplib import HTTPConnection
hypr = HTTPConnection(host='stackoverflow.com',
timeout = 3)
rekete = ('/questions/17503336/'
'scraping-webpage-sentences-'
'beginning-with-certain-word')
hypr.request('GET',rekete)
page = hypr.getresponse().read()
print '\n'.join('%d %r' % (i,line)
for i,line in enumerate(page.splitlines(True)))
表示は
0 '<!DOCTYPE html>\r\n'
1 '<html>\r\n'
2 '<head>\r\n'
3 ' \r\n'
4 ' <title>python - Scraping webpage sentences beginning with certain word - Stack Overflow</title>\r\n'
5 ' <link rel="shortcut icon" href="https://cdn.sstatic.net/stackoverflow/img/favicon.ico">\r\n'
6 ' <link rel="apple-touch-icon image_src" href="https://cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png">\r\n'
7 ' <link rel="search" type="application/opensearchdescription+xml" title="Stack Overflow" href="/opensearch.xml">\r\n'
8 ' \r\n'
9 ' <script type="text/javascript" src="//ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js"></script>\r\n'
10 ' <script type="text/javascript" src="https://cdn.sstatic.net/js/stub.js?v=d2c9bad99c24"></script>\r\n'
11 ' <link rel="stylesheet" type="text/css" href="https://cdn.sstatic.net/stackoverflow/all.css?v=2079d4ae31a4">\r\n'
12 ' \r\n'
13 ' <link rel="canonical" href="http://stackoverflow.com/questions/17503336/scraping-webpage-sentences-beginning-with-certain-word">\r\n'
14 ' <link rel="alternate" type="application/atom+xml" title="Feed for question \'Scraping webpage sentences beginning with certain word\'" href="/feeds/question/17503336">\r\n'
15 ' <script type="text/javascript">\r\n'
16 ' \r\n'
17 ' StackExchange.ready(function () {\r\n'
18 ' StackExchange.using("postValidation", function () {\r\n'
19 " StackExchange.postValidation.initOnBlurAndSubmit($('#post-form'), 2, 'answer');\r\n"
20 ' });\r\n'
21 '\r\n'
22 ' \r\n'
23 " StackExchange.question.init({showAnswerHelp:true,totalCommentCount:0,shownCommentCount:0,highlightColor:'#F4A83D',backgroundColor:'#FFF',questionId:17503336});\r\n"
24 '\r\n'
25 ' styleCode();\r\n'
26 '\r\n'
27 " StackExchange.realtime.subscribeToQuestion('1', '17503336');\r\n"
28 ' \r\n'
29 ' });\r\n'
30 ' </script>\r\n'
31 '\r\n'
etc etc