text - wget: ID 番号と URL を含むリストから読み取る

Question

.txt ファイルには、次のように、ID 番号と Web サイトのホームページ URL を含む 500 行があります。

id_345  http://www.example1.com
id_367  http://www.example2.org
...
id_10452 http://www.example3.net

wget と -i オプションを使用して、これらの Web サイトの一部を再帰的にダウンロードしようとしていますが、ID 番号にリンクされた方法でファイルを保存したいと考えています (ID 番号のようなディレクトリにファイルを保存する)。または-最良のオプションですが、達成するのが最も難しいと思います-ID番号のように呼ばれる単一のtxtファイルにhtmlコンテンツを保存します）。残念ながら、オプション -i は、私が使用しているようなファイルを読み取ることができません。Web サイトのコンテンツを接続 ID にリンクするにはどうすればよいですか?

ありがとう

Ps: そのためには、wget から「出て」、スクリプトを介して呼び出す必要があると思います。もしそうなら、私はこの分野の初心者であり（Pythonの経験があるだけです）、特にbashスクリプトのロジックとコードをまだ理解できないことを考慮してください。したがって、ダミーの段階的な説明はどういたしまして。

score 1 · Accepted Answer

wget -P ... -r -l ...並列処理を使用して Python で再帰的にサイトを取得します (要旨はこちら):

import multiprocessing, subprocess, re

def getSiteRecursive(id, url, depth=2):
  cmd =  "wget -P " + id + " -r -l " + str(depth) + " " + url
  subprocess.call(cmd, shell=True)

input_file = "site_list.txt"
jobs = []
max_jobs = multiprocessing.cpu_count() * 2 + 1
with open(input_file) as f:
  for line in f:
    id_url = re.compile("\s+").split(line)
    if len(id_url) >= 2:
      try:
        print "Grabbing " + id_url[1] + " into " + id_url[0] + " recursively..."
        if len(jobs) >= max_jobs:
          jobs[0].join()
          del jobs[0]
        p = multiprocessing.Process(target=getSiteRecursive,args=(id_url[0],id_url[1],2,))
        jobs.append(p)
        p.start()
      except Exception, e:
        print "Error for " + id_url[1] + ": " + str(e)
        pass
  for j in jobs:
    j.join()

Python を使用して単一ページを名前付きファイルに取得します。

import urllib2, re
input_file = "site_list.txt"
#open the site list file
with open(input_file) as f:
  # loop through lines
  for line in f:
    # split out the id and url
    id_url = re.compile("\s+").split(line)
    print "Grabbing " + id_url[1] + " into " + id_url[0] + ".html..."
    try:
      # try to get the web page
      u = urllib2.urlopen(id_url[1])
      # save the GET response data to the id file (appended with "html")
      localFile = open(id_url[0]+".html", 'wb+')
      localFile.write(u.read())
      localFile.close()
      print "got " + id_url[0] + "!"
    except:
      print "Could not get " + id_url[0] + "!"
      pass

例 site_list.txt:

id_345  http://www.stackoverflow.com
id_367  http://stats.stackexchange.com

出力：

Grabbing http://www.stackoverflow.com into id_345.html...
got id_345!
Grabbing http://stats.stackexchange.com into id_367.html...
got id_367!

ディレクトリのリスト:

get_urls.py
id_345.html
id_367.html
site_list.txt

また、コマンドラインまたはシェルスクリプトを使用する場合はawk、スペースでデフォルト分割された各行を読み取り、それをループにパイプして、バッククォートで実行するために使用できます。

awk '{print "wget -O " $1 ".html " $2}' site_list.txt | while read line ; do `$line` ; done

壊す...

awk '{print "wget -O " $1 ".html " $2}' site_list.txt |

このツールを使用してawk、site_list.txt ファイルの各行を読み取り、スペース (デフォルト) で各行を変数 ( $1、$2、$3など) に分割して、ID がに$1、URL がにあるようにします$2。
printの呼び出しを作成する AWK コマンドを追加しwgetます。
パイプ演算子を追加して|、出力を次のコマンドに送信します

次にwget呼び出しを行います。

while read line ; do `$line` ; done

前のコマンド出力を 1 行ずつループして$line変数に格納し、バックティック演算子を使用して実行し、テキストを解釈してコマンドとして実行します。

text - wget: ID 番号と URL を含むリストから読み取る

1 に答える 1

壊す...

Related

Reference