regex - グーグル検索から最初の1000枚の画像をダウンロードする

Question

私はグーグル画像にいくつかの検索を行います

http://www.google.com/search?hl=en&q=panda&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&biw=1287&bih=672&um=1&ie=UTF-8&tbm=isch&source=og&sa = N＆tab = wi＆ei = qW4FUJigJ4jWtAbToInABg

その結果、何千枚もの写真ができあがります。n1000や500などの最初の画像をダウンロードするシェルスクリプトを探しています。

これどうやってするの？

高度な正規表現などが必要だと思います。私は多くのことを試みていましたが、役に立たなかったので、誰かが私を助けてくれますか？

score 18 · Accepted Answer

アップデート4： PhantomJSは廃止されました。SeleniumとChromeのヘッドレスを使用して、Pythonで新しいスクリプトgoogle-images.pyを作成しました。詳細については、こちらをご覧ください：https ://stackoverflow.com/a/61982397/218294

アップデート3： phantomjs2.xで動作するようにスクリプトを修正しました。

更新2： phantomjsを使用するようにスクリプトを変更しました。インストールするのは難しいですが、少なくともそれは再び機能します。http://sam.nipl.net/b/google-images http://sam.nipl.net/b/google-images.js

アップデート1：残念ながらこれは機能しなくなりました。画像が配置されている場所を見つけるには、Javascriptやその他の魔法が必要になっているようです。yahoo画像検索用のスクリプトのバージョンは次のとおりです。http：//sam.nipl.net/code/nipl-tools/bin/yimg

元の答え：私はこれのために何かを一緒にハッキングしました。私は通常、小さなツールを作成して一緒に使用しますが、3ダースではなく1つのシェルスクリプトを要求しました。これは意図的に高密度のコードです。

http://sam.nipl.net/code/nipl-tools/bin/google-images

これまでのところ非常にうまく機能しているようです。改善できるかどうか、またはより良いコーディング手法を提案できるかどうかをお知らせください（シェルスクリプトの場合）。

#!/bin/bash
[ $# = 0 ] && { prog=`basename "$0"`;
echo >&2 "usage: $prog query count parallel safe opts timeout tries agent1 agent2
e.g. : $prog ostrich
       $prog nipl 100 20 on isz:l,itp:clipart 5 10"; exit 2; }
query=$1 count=${2:-20} parallel=${3:-10} safe=$4 opts=$5 timeout=${6:-10} tries=${7:-2}
agent1=${8:-Mozilla/5.0} agent2=${9:-Googlebot-Image/1.0}
query_esc=`perl -e 'use URI::Escape; print uri_escape($ARGV[0]);' "$query"`
dir=`echo "$query_esc" | sed 's/%20/-/g'`; mkdir "$dir" || exit 2; cd "$dir"
url="http://www.google.com/search?tbm=isch&safe=$safe&tbs=$opts&q=$query_esc" procs=0
echo >.URL "$url" ; for A; do echo >>.args "$A"; done
htmlsplit() { tr '\n\r \t' ' ' | sed 's/</\n</g; s/>/>\n/g; s/\n *\n/\n/g; s/^ *\n//; s/ $//;'; }
for start in `seq 0 20 $[$count-1]`; do
wget -U"$agent1" -T"$timeout" --tries="$tries" -O- "$url&start=$start" | htmlsplit
done | perl -ne 'use HTML::Entities; /^<a .*?href="(.*?)"/ and print decode_entities($1), "\n";' | grep '/imgres?' |
perl -ne 'use URI::Escape; ($img, $ref) = map { uri_unescape($_) } /imgurl=(.*?)&imgrefurl=(.*?)&/;
$ext = $img; for ($ext) { s,.*[/.],,; s/[^a-z0-9].*//i; $_ ||= "img"; }
$save = sprintf("%04d.$ext", ++$i); print join("\t", $save, $img, $ref), "\n";' |
tee -a .images.tsv |
while IFS=$'\t' read -r save img ref; do
wget -U"$agent2" -T"$timeout" --tries="$tries" --referer="$ref" -O "$save" "$img" || rm "$save" &
procs=$[$procs + 1]; [ $procs = $parallel ] && { wait; procs=0; }
done ; wait

特徴：

1500バイト未満
引数なしで実行した場合の使用法を説明します
完全な画像を並行してダウンロードします
セーフサーチオプション
画像のサイズ、タイプなどは文字列を選択します
タイムアウト/再試行オプション
googlebotになりすまして、すべての画像を取得します
番号画像ファイル
メタデータを保存します

モジュラーバージョンを投稿して、シェルスクリプトとシンプルなツールのセットで非常にうまく実行できることを示します。

score 6 · Accepted Answer

正規表現だけを使用してタスク全体を達成できるとは思いません。この問題には3つの部分があります-

1.すべての画像のリンクを抽出します----->正規表現で実行できません。これには、Webベースの言語を使用する必要があります。Googleには、これをプログラムで実行するためのAPIがあります。こことここをチェックしてください。

2. Webベースの言語で最初のステップに成功したと仮定すると、先読みを使用して正確な画像URLを抽出する次の正規表現を使用できます。

(?<=imgurl=).*?(?=&)

上記の正規表現は次のように述べています-シンボルに遭遇した後から遭遇するまで、すべてをimgurl=&取得します。検索結果の最初の画像のURLを取得し、画像のURLを抽出した例については、こちらをご覧ください。

上記の正規表現にどのように到達しましたか？画像検索で見つかった画像のリンクを調べることによって。

3.画像のURLを取得したので、Webベースの言語/ツールを使用して画像をダウンロードします。

score 2 · Accepted Answer

2

于 2012-07-17T15:19:12.010 に答える

score 1 · Accepted Answer

そんなに多くのワークロード？Bulk Image Downloaderを使用してみませんか？100枚の画像制限があります。

また、Java画像ビューアを備えたサイトのコーディングが必要です。

score 0 · Accepted Answer

画像の高さと幅が必要な場合は、PavanManjunathの応答で

(?<=imgurl=)(?<imgurl>.*?)(?=&).*?(?<=h=)(?<height>.*?)(?=&).*?(?<=w=)(?<width>.*?)(?=&)

情報付きの3つの正規表現グループimgurl、height＆widthを取得します。

score 0 · Accepted Answer

Rather than attempt to parse the HTML (which is very hard and likely to break), consider the API's highlighted by @Paven in his answer.

Additionally, consider using a tool that already tries to do something similar. WGET (web-get) has a spider like feature for following the links (specifically for specified file types). See this answer to a StackOverflow question 'how do i use wget to download all images into a single folder'.

Regex is wonderfully useful, but I don't think it is in this context - remember the Regex mantra:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

-- Jamie Zawinski

score 0 · Accepted Answer

I found an easier way to do with this tool I can confirm that it works well as of this post. screenshot

Feature Requests to the developer:

Get a preview of the image(s) to verify that it's correct.
Allow input of multiple terms sequentially (i.e. batch processing).

score 0 · Accepted Answer

Python script: to download full resolution images form Google Image Search currently it downloads 100 images per query

from bs4 import BeautifulSoup
import requests
import re
import urllib2
import os
import cookielib
import json

def get_soup(url,header):
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),"html.parser")


query = raw_input("query image")# you can change the query for the image  here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.co.in/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="C:\\Users\\Rishabh\\Pictures\\"+query.split('+')[0]+"\\"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)


ActualImages=[]# contains the link for Large original images, type of  image
for a in soup.find_all("div",{"class":"rg_meta"}):
    link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
    ActualImages.append((link,Type))

print  "there are total" , len(ActualImages),"images"


###print images
for i , (img , Type) in enumerate( ActualImages):
    try:
        req = urllib2.Request(img, headers={'User-Agent' : header})
        raw_img = urllib2.urlopen(req).read()
        if not os.path.exists(DIR):
            os.mkdir(DIR)
        cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
        print cntr
        if len(Type)==0:
            f = open(DIR + image_type + "_"+ str(cntr)+".jpg", 'wb')
        else :
            f = open(DIR + image_type + "_"+ str(cntr)+"."+Type, 'wb')


        f.write(raw_img)
        f.close()
    except Exception as e:
        print "could not load : "+img
        print e

i am re posting my solution here the original solution i had posted on the following question https://stackoverflow.com/a/28487500/2875380

score 0 · Accepted Answer

How about using this library?google-images-download

For anyone still looking for a decent way to download 100s of images, can use this command line argument code.

score 0 · Accepted Answer

I used this to download 1000 images and it 100% worked for me: atif93/google_image_downloader

after you download it open terminal and install Selenium

$ pip install selenium --user

then check your python version

$ python --version

If running python 2.7 then to down download 1000 images of pizza run:

$ python image_download_python2.py 'pizza' '1000'

If running python 3 then to down download 1000 images of pizza run:

$ python image_download_python3.py 'pizza' '1000'

The breakdown is:

python image_download_python2.py <query> <number of images>
python image_download_python3.py <query> <number of images>

query is the image name your looking for and the number of images is 1000. In my example above my query is pizza and I want 1000 images of it

score -1 · Accepted Answer

there's other libraries on github - this looks quite good https://github.com/Achillefs/google-cse

g = GoogleCSE.image_search('Ian Kilminster')
img = g.fetch.results.first.link
file = img.split('/').last
File.open(file,'w') {|f| f.write(open(img).read)} 
`open -a Preview #{file}`

regex - グーグル検索から最初の1000枚の画像をダウンロードする

11 に答える 11

Related

Reference