python - Selenium/PhantomJSでロードされたリソースを一覧表示するには?

Question

Web ページを読み込み、そのページに読み込まれたすべてのリソース (javascript/images/css) を一覧表示したいと考えています。このコードを使用してページをロードします。

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get('http://example.com')

上記のコードは完全に機能し、HTML ページに対して何らかの処理を行うことができます。問題は、そのページによって読み込まれたすべてのリソースを一覧表示するにはどうすればよいかということです。私はこのようなものが欲しい：

['http://example.com/img/logo.png',
 'http://example.com/css/style.css',
 'http://example.com/js/jquery.js',
 'http://www.google-analytics.com/ga.js']

PySide.QWebViewモジュールの使用など、他のソリューションにも対応しています。ページごとに読み込まれたリソースを一覧表示したいだけです。

score 4 · Accepted Answer

これは Selenium ソリューションではありませんが、python と PhantomJS でうまく機能します。

アイデアは、Chrome 開発者ツールの [ネットワーク] タブとまったく同じことを行うことです。そのためには、Web ページからのすべてのリクエストをリッスンする必要があります。

Javascript/Phantomjs 部分

phantomjs を使用すると、このスクリプトを使用してこれを行うことができます。自分の都合で使用してください。

// getResources.js
// Usage: 
// ./phantomjs --ssl-protocol=any --web-security=false getResources.js your_url
// the ssl-protocol and web-security flags are added to dismiss SSL errors

var page = require('webpage').create();
var system = require('system');
var urls = Array();

// function to check if the requested resource is an image
function isImg(url) {
  var acceptedExts = ['jpg', 'jpeg', 'png'];
  var baseUrl = url.split('?')[0];
  var ext = baseUrl.split('.').pop().toLowerCase();
  if (acceptedExts.indexOf(ext) > -1) {
    return true;
  } else {
    return false;
  }
}

// function to check if an url has a given extension
function isExt(url, ext) {
  var baseUrl = url.split('?')[0];
  var fileExt = baseUrl.split('.').pop().toLowerCase();
  if (ext == fileExt) {
    return true;
  } else {
    return false;
  }
}

// Listen for all requests made by the webpage, 
// (like the 'Network' tab of Chrome developper tools)
// and add them to an array
page.onResourceRequested = function(request, networkRequest) { 
  // If the requested url if the one of the webpage, do nothing
  // to allow other ressource requests
  if (system.args[1] == request.url) {
    return;
  } else if (isImg(request.url) || isExt(request.url, 'js') || isExt(request.url, 'css')) {
    // The url is an image, css or js file 
    // add it to the array
    urls.push(request.url)
    // abort the request for a better response time
    // can be omitted for collecting asynchronous loaded files
    networkRequest.abort(); 
  }
};

// When all requests are made, output the array to the console
page.onLoadFinished = function(status) {
  console.log(JSON.stringify(urls));
  phantom.exit();
};

// If an error occur, dismiss it
page.onResourceError = function(){
  return false;
}
page.onError = function(){
  return false;
}

// Open the web page
page.open(system.args[1]);

パイソン部分

そして今、Pythonでコードを呼び出します:

from subprocess import check_output
import json

out = check_output(['./phantomjs', '--ssl-protocol=any', \
    '--web-security=false', 'getResources.js', your_url])
data = json.loads(out)

お役に立てれば

score 1 · Accepted Answer

webdribver には、Web ページにあるすべてのリソースを返す関数はありませんが、次のようなことができます。

from selenium.webdriver.common.by import By
images = driver.find_elements(By.TAG_NAME, "img")

スクリプトとリンクについても同じです。

python - Selenium/PhantomJSでロードされたリソースを一覧表示するには?

3 に答える 3

Javascript/Phantomjs 部分

パイソン部分

Related

Reference