1

重複の可能性:
Ethics of Robots.txt

サイトでの作業を自動化するために Mechanize を試しています。br.set_handle_robots(False) を使用して、上記のエラーを回避することができました。それを使用することはどれほど倫理的ですか?

そうでない場合は、「robots.txt」に従うことを考えましたが、機械化しようとしているサイトが robots.txt の表示をブロックしています。これは、ボットが許可されていないことを意味しますか? 次のステップは何ですか?

前もって感謝します。

4

1 に答える 1

1

For your first question, see Ethics of robots.txt

You need to keep in mind the purpose of robots.txt. Robots that are crawling a site can potentially wreck havoc on the site and essentially cause a DoS attack. So if your "automation" is crawling at all or is downloading more than just a few pages every day or so, AND the site has a robots.txt file that excludes you, then you should honor it.

Personally, I find a little grey area. If my script is working at the same pace as a human using a browser and is only grabbing a few pages then I, in the spirit of the robots exclusion standard, have no problem scrapping the pages so long as it doesn't access the site more than once a day. Please read that last sentence carefully before judging me. I feel it is perfectly logical. Many people may disagree with me there though.

For your second question, web servers have the ability to return a 403 based on the User-Agent attribute of the HTTP header sent with your request. In order to have your script mimic a browser, you have to miss-represent yourself. Meaning, you need to change the HTTP header User-Agent attribute to be the same as the one used by a mainstream web browser (e.g., Firefox, IE, Chrome). Right now it probably says something like 'Mechanize'.

Some sites are more sophisticated than that and have other methods for detecting non-human visitors. In that case, give up because they really don't want you accessing the site in that manner.

于 2012-08-31T01:48:03.657 に答える