linux - wget スパイダーはすべての URL を 2 回返します -- バグはどこにあるのでしょうか?

Question

サイトマップの URL リストを作成するスクリプトを探していたところ、次のスクリプトが見つかりました。

wget --spider --force-html -r -l1 http://sld.tld 2>&1 \
  | grep '^--' | awk '{ print $3 }' \
  | grep -v '\.\(css\|js\|png\|gif\|jpg\|ico\|txt\)$' \
  > urllist.txt

結果は次のとおりです。

http://sld.tld/
http://sld.tld/
http://sld.tld/home.html
http://sld.tld/home.html
http://sld.tld/news.html
http://sld.tld/news.html
...

すべての URL エントリは 2 回保存されます。これを修正するには、スクリプトをどのように変更する必要がありますか?

score 0 · Accepted Answer

フラグを使用したときに wget の出力を見ると、次の--spiderようになります。

Spider mode enabled. Check if remote file exists.
--2013-04-12 22:01:03--  http://www.google.com/intl/en/about/products/
Connecting to www.google.com|173.194.75.103|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Remote file exists and could contain links to other resources -- retrieving.

--2013-04-12 22:01:03--  http://www.google.com/intl/en/about/products/
Reusing existing connection to www.google.com:80.
HTTP request sent, awaiting response... 200 OK

リンクが存在するかどうかをチェックし (したがってを出力します--)、それをダウンロードして追加のリンクを探す必要があります (したがって 2 番目の--)。これが、を使用すると (少なくとも 2 回) 表示される理由です--spider。

それをなしと比較してください--spider：

Location: http://www.google.com/intl/en/about/products/ [following]
--2013-04-12 22:00:49--  http://www.google.com/intl/en/about/products/
Reusing existing connection to www.google.com:80.

したがって、で始まる 1 行のみが取得されます--。

--spiderフラグを削除することはできますが、重複する可能性があります。本当に重複したくない場合は| sort | uniq、コマンドに a を追加します。

wget --spider --force-html -r -l1 http://sld.tld 2>&1 \
  | grep '^--' | awk '{ print $3 }' \
  | grep -v '\.\(css\|js\|png\|gif\|jpg\|ico\|txt\)$' \
  | sort | uniq > urllist.txt

linux - wget スパイダーはすべての URL を 2 回返します -- バグはどこにあるのでしょうか?

1 に答える 1

Related

Reference