コードを書くのは初めてで、特定の Web サイトをスクレイピングするコードを書こうとしています。問題は、この Web サイトに使用条件とプライバシー ページに同意するページがあることです。これは Web サイトで確認できます: http://cpdocket.cp.cuyahogacounty.us/
どうにかしてこのページをバイパスする必要がありますが、その方法がわかりません。私は Java でコードを書いていますが、これまでのところ、任意の Web サイトのソースをスクレイピングする作業コードがあります。このコードは次のとおりです。
import java.net.URL;
import java.net.URLConnection;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.lang.StringBuilder;
import java.io.IOException;
// Scraper class takes an input of a string, and returns the source code of the of the website
public class Scraper {
private static String url; // the input website to be scraped
//constructor
public Scraper(String url) {
this.url = url;
}
//scrapeWebsite runs the method to scrape the input variable. As of now it retuns a string. This string idealy should be saved
//so it is able to be parsed by another method
public static String scrapeWebsite() throws IOException {
URL urlconnect = new URL(url); //creates the url from the variable
URLConnection connection = urlconnect.openConnection(); // connects to the created url
BufferedReader in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8")); // annonymous class to stream the website
String inputLine; //creates a new variable of string
StringBuilder a = new StringBuilder(); // creates stringbuilder
//loop appends to the string builder as long as there is information
while ((inputLine = in.readLine()) != null)
a.append(inputLine);
in.close();
return a.toString();
}
}
これを行う方法についての提案は大歓迎です。
Rubyコードをベースにコードを書き直しています。コードは次のとおりです。
def initializeSession()
## SETUP # POST headers
post_header = Hash.new()
post_header['Host'] = 'cpdocket.cp.cuyahogacounty.us'
post_header['User-Agent'] = 'Mozilla/5.0 (Windows NT 5.1; rv:20.0) Gecko/20100101 Firefox/20.0'
post_header['Accept'] = 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'
post_header['Accept-Language'] = 'en-US,en;q=0.5'
post_header['Accept-Encoding'] = 'gzip, deflate'
post_header['X-Requested-With'] = 'XMLHttpRequest'
post_header['X-MicrosoftAjax'] = 'Delta=true'
post_header['Cache-Control'] = 'no-cache'
post_header['Content-Type'] = 'application/x-www-form-urlencoded; charset=utf-8'
post_header['Referer'] = 'http://cpdocket.cp.cuyahogacounty.us/Search.aspx' # may have to alter this per request
# post_header['Content-Length'] = '12197'
post_header['Connection'] = 'keep-alive'
post_header['Pragma'] = 'no-cache'
# STEP # set up simulated browser and make first request
#browser = SimBrowser.new()
#logname = 'log.txt'
#s = Scribe.new(logname)
session_cookie = 'ASP.NET_SessionId'
url = 'http://cpdocket.cp.cuyahogacounty.us/'
@browser.http_get(url)
#puts browser.get_body() # debug
puts 'DEBUG: session cookie: ' + @browser.get_cookie_var(session_cookie)
@log.slog('DEBUG: home page response code: expected 200, actual ' + @browser.get_response().code)
# s.flog('### HOME PAGE RESPONSE')
# s.flog(browser.get_body()) # debug
# STEP # send our acceptance of the terms of service
data = {
'ctl00$SheetContentPlaceHolder$btnYes' => 'Yes',
'__EVENTARGUMENT'=>'',
'__EVENTTARGET'=>'',
'__EVENTVALIDATION'=>'/wEWBwKc78CQCQLn3/HqCQLZw/fZCgLipuudAQK42duKDQL33NjnAwKn6+K4CIM3TSmrbrsn2xBRJf2DRwg01Vsbdk+oJV9lhG/in+xD',
'__VIEWSTATE'=>'/wEPDwUKLTI4MzA1ODM0OA9kFgJmD2QWAgIDD2QWDgIDD2QWAgIBD2QWCAIBDxYCHgRUZXh0BQ9BbmRyZWEgRi4gUm9jY29kAgMPFgIfAAUfQ3V5YWhvZ2EgQ291bnR5IENsZXJrIG9mIENvdXJ0c2QCBQ8PFgIeB1Zpc2libGVoZGQCBw8PFgIfAWhkZAIHDw9kFgIeB29uY2xpY2sFGmphdmFzY3JpcHQ6d2luZG93LnByaW50KCk7ZAILDw9kFgIfAgUiamF2YXNjcmlwdDpvbkNsaWNrPXdpbmRvdy5jbG9zZSgpO2QCDw8PZBYCHwIFRmRpc3BsYXlQb3B1cCgnaF9EaXNjbGFpbWVyLmFzcHgnLCdteVdpbmRvdycsMzcwLDIyMCwnbm8nKTtyZXR1cm4gZmFsc2VkAhMPZBYCZg8PFgIeC05hdmlnYXRlVXJsBRMvVE9TLmFzcHg/aXNwcmludD1ZZGQCFQ8PZBYCHwIFRWRpc3BsYXlQb3B1cCgnaF9RdWVzdGlvbnMuYXNweCcsJ215V2luZG93JywzNzAsMzcwLCdubycpO3JldHVybiBmYWxzZWQCFw8WAh8ABQYxLjAuNTRkZEnXSWiVLEPsDmlc7dX4lH/53vU1P1SLMCBNASGt4T3B'
}
#post_header['Referer'] = url
@browser.http_post(url, data, post_header)
@log.slog('DEBUG: accept terms response code: expected 200, actual ' + @browser.get_response().code)
@log.flog('### TOS ACCPTANCE RESPONSE')
# @log.flog(@browser.get_body()) # debug
end
これはJavaでもできますか?