0

I would like to scrape a web site. It has the following in it's robots.txtfile, but I'm not exactly sure what it is they don't want me to do:

User-agent: *
Disallow: /click

There is no click subdirectory. Or they don't want me to access anything that would normally require clicking (like submitting data via a form)? They sure aren't making it easy in any case - the main page's form GETS to a site that sets a cookie that is read by a third page.

4

1 に答える 1

2

It means that no bot should crawl any URLs whose paths start with the string click.

For example, the following URLs should be blocked:

  • example.com/click
  • example.com/click.html
  • example.com/click/
  • example.com/click/foo/bar
  • example.com/clicker

The following URLs would still be allowed:

  • example.com/foo/click
  • example.com/fooclick
  • example.com/clic

You can find the original robots.txt specification at http://www.robotstxt.org/wc/robots.html.

于 2013-01-16T17:19:28.933 に答える