python - ウェブサイトの値が変更されたかどうかを確認する方法

Question

基本的に、Webサイトの値が変更された場合は、コード（Python 3.2）を実行しようとしています。それ以外の場合は、少し待ってから後で確認してください。

最初に、値を変数に保存して、次にスクリプトを実行するときにフェッチされた新しい値と比較できると思いました。しかし、スクリプトが再度実行されてその変数が初期化されるときに値が上書きされるため、すぐに問題が発生しました。

そこで、Webページのhtmlをファイルとして保存し、次にスクリプトが実行されたときに呼び出されるhtmlと比較してみました。変更がない場合でもFalseが発生し続けたため、幸運もありませんでした。

次は、Webページをピクルスにして、それをhtmlと比較しようとしました。興味深いことに、それはスクリプト内でも機能しませんでした。ただし、スクリプトの実行後にfile = pickle.load（open（'D：\ Download \ htmlString.p'、'rb'））と入力し、次にfile == htmlと入力すると、実行されていない場合はTrueと表示されます。変更。

スクリプトの実行時になぜ機能しないのか少し混乱していますが、上記を実行すると正しい答えが表示されます。

編集：これまでの皆さんの回答に感謝します。私が持っている質問は、これを実行する他の方法についてではなく（タスクを実行するためのより多くの方法を学ぶことは常に良いことですが！）、スクリプトとして実行したときに以下のコードが機能しない理由ですが、私がスクリプトの実行後にプロンプトでpickleオブジェクトをリロードし、それをhtmlに対してテストすると、変更がない場合はTrueが返されます。

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'rb'))
    if pickle.load( open( 'D:\\Download\\htmlString.p', 'rb')) == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )  
        print('Saving')
except: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )
    print('ERROR')

score 9 · Accepted Answer

編集：私はあなたがあなたのスクリプトの問題を探しているだけだとは気づいていませんでした。これが私が問題だと思うものであり、その後にあなたが解決しようとしているより大きな問題への別のアプローチに対処する私の元の答えが続きます。

スクリプトは、包括的なexceptステートメントを使用することの危険性の良い例です。すべてをキャッチします。この場合、あなたのを含みますsys.exit(0)。

まだ存在しないtryケースを捕まえるためにあなたがブロックしていると思います。D:\Download\htmlString.pそのエラーはと呼ばれIOError、具体的にはexcept IOError:

これがあなたのスクリプトとそれを実行する前の少しのコードであり、あなたのexcept問題のために修正されています：

import sys
import pickle
import urllib2

request = urllib2.Request('http://www.iana.org/domains/example/')
response = urllib2.urlopen(request) # Make the request
htmlString = response.read()

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'rb'))
    if file == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )  
        print('Saving')
except IOError: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "wb" ) )
    print('Created new file.')

ちなみに、ファイルパスに使用することを検討しos.pathてください。後で別のプラットフォームでスクリプトを使用したい人に役立ち、醜い二重のバックスラッシュを節約できます。

編集2：特定のURLに適合。

そのページの広告には動的に生成される番号があり、ページが読み込まれるたびに変化します。すべてのコンテンツの終わり近くにあるので、その時点でHTML文字列を分割して前半を取り、動的な番号の部分を破棄することができます。

import sys
import pickle
import urllib2

request = urllib2.Request('http://ecal.forexpros.com/e_cal.php?duration=weekly')
response = urllib2.urlopen(request) # Make the request
# Grab everything before the dynabic double-click link
htmlString = response.read().split('<iframe src="http://fls.doubleclick')[0]

try: 
    file = pickle.load( open( 'D:\\Download\\htmlString.p', 'r'))
    if pickle.load( open( 'D:\\Download\\htmlString.p', 'r')) == htmlString:
        print("Values haven't changed!")
        sys.exit(0)
    else:
        pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "w" ) )  
        print('Saving')
except IOError: 
    pickle.dump( htmlString, open( 'D:\\Download\\htmlString.p', "w" ) )
    print('Created new file.')

それが重要である場合、あなたの文字列はもはや有効なHTMLドキュメントではありません。もしそうなら、あなたはその行か何かを削除するかもしれません。これを行うにはおそらくもっとエレガントな方法があります-おそらく正規表現で番号を削除します-しかしこれは少なくともあなたの質問を満たします。

元の回答-問題に対する代替アプローチ。

Webサーバーからの応答ヘッダーはどのように見えますか？HTTPはLast-Modified、コンテンツが変更されたかどうかを確認するために使用できるプロパティを指定します（サーバーが真実を伝えていると仮定します）。HEAD宇久が答えたように、これをリクエストに使ってください。帯域幅を節約し、ポーリングしているサーバーに適している場合。

そして、If-Modified-Sinceあなたが探しているかもしれないもののように聞こえるヘッダーもあります。

それらを組み合わせると、次のようなものが思い浮かぶかもしれません。

import sys
import os.path
import urllib2

url = 'http://www.iana.org/domains/example/'
saved_time_file = 'last time check.txt'

request = urllib2.Request(url)
if os.path.exists(saved_time_file):
    """ If we've previously stored a time, get it and add it to the request"""
    last_time = open(saved_time_file, 'r').read()
    request.add_header("If-Modified-Since", last_time)

try:
    response = urllib2.urlopen(request) # Make the request
except urllib2.HTTPError, err:
    if err.code == 304:
        print "Nothing new."
        sys.exit(0)
    raise   # some other http error (like 404 not found etc); re-raise it.

last_modified = response.info().get('Last-Modified', False)
if last_modified:
    open(saved_time_file, 'w').write(last_modified)
else:
    print("Server did not provide a last-modified property. Continuing...")
    """
    Alternately, you could save the current time in HTTP-date format here:
    http://www.w3.org/Protocols/rfc2616/rfc2616-sec3.html#sec3.3
    This might work for some servers that don't provide Last-Modified, but do
    respect If-Modified-Since.
    """

"""
You should get here if the server won't confirm the content is old.
Hopefully, that means it's new.
HTML should be in response.read().
"""

また、いくつかのインスピレーションを提供する可能性のあるStiiによるこのブログ投稿もチェックしてください。それらを私の例に入れるのに十分なことはわかりませんETagsが、彼のコードはそれらもチェックします。

score 4 · Accepted Answer

両方のコンテンツをハッシュすることで、ローカルに保存されたファイルとリモートの間のデータ内の変更をいつでも知ることができます。これは、ダウンロードされたデータの信憑性を検証するために一般的に使用されます。継続的なチェックには、whileループが必要です。

import hashlib
import urllib
    
num_checks = 20
last_check = 1
while last_check != num_checks:
    remote_data = urllib.urlopen('http://remoteurl').read()
    remote_hash = hashlib.md5(remote_data).hexdigest()

    local_data = open('localfilepath').read()
    local_hash = hashlib.md5(local_data).hexdigest()
    if remote_hash == local_hash:
        print('right now, we match!')
    else:
        print('right now, we are different')

実際のデータをローカルに保存する必要がない場合は、md5ハッシュを保存し、チェック時にその場で計算するだけです。

score 4 · Accepted Answer

HEAD リクエストを実行してドキュメントの Content-Length を確認する方が効率的です。

import urllib2
"""
read old length from file into variable
"""
request = urllib2.Request('http://www.yahoo.com')
request.get_method = lambda : 'HEAD'

response = urllib2.urlopen(request)
new_length = response.info()["Content-Length"]
if old_length != new_length:
    print "something has changed"

content-length がまったく同じになる可能性は低いですが、同時に最も効率的な方法であることに注意してください。この方法は、予想される変更の種類に応じて、適している場合と適していない場合があります。

score 0 · Accepted Answer

Web サイトが変更されたかどうかを確認したいだけなのか、それとも Web サイトのデータをさらに活用するつもりなのか、完全にはわかりませんでした。前者の場合は、前述のように必ずハッシュします。完全な古い html と新しい html を比較する実際の (Mac 上の python 2.6.1) 例を次に示します。必要に応じて、ハッシュまたは Web サイトの特定の部分のみを使用するように、簡単に変更できる必要があります。うまくいけば、コメントと docstrings によってすべてが明確になります。

import urllib2

def getFilename(url):
    '''
    Input: url
    Return: a (string) filename to be used later for storing the urls contents
    '''
    return str(url).lstrip('http://').replace("/",":")+'.OLD'


def getOld(url):
    '''
    Input: url- a string containing a url
    Return: a string containing the old html, or None if there is no old file
    (checks if there already is a url.OLD file, and make an empty one if there isn't to handle the case that this is the first run)
    Note: the file created with the old html is the format url(with : for /).OLD
    '''
    oldFilename = getFilename(url)
    oldHTML = ""
    try:
        oldHTMLfile = open(oldFilename,'r')
    except:
        # file doesn't exit! so make it
        with open(oldFilename,'w') as oldHTMLfile:
            oldHTMLfile.write("")
        return None
    else:
        oldHTML = oldHTMLfile.read()
        oldHTMLfile.close()

    return oldHTML

class ConnectionError(Exception):
    def __init__(self, value):
        if type(value) != type(''):
            self.value = str(value)
        else:
            self.value = value
    def __str__(self):
        return 'ConnectionError: ' + self.value       


def htmlHasChanged(url):
    '''
    Input: url- a string containing a url
    Return: a boolean stating whether the website at url has changed
    '''

    try:
        fileRecvd = urllib2.urlopen(url).read()
    except:
        print 'Could not connect to %s, sorry!' % url
        #handle bad connection error...
        raise ConnectionError("urlopen() failed to open " + str(url))
    else:
        oldHTML = getOld(url)
        if oldHTML == fileRecvd:
            hasChanged = False
        else:
            hasChanged = True

        # rewrite file
        with open(getFilename(url),'w') as f:
            f.write(fileRecvd)

        return hasChanged

if __name__ == '__main__':
    # test it out with whatismyip.com
    try:
        print htmlHasChanged("http://automation.whatismyip.com/n09230945.asp")
    except ConnectionError,e:
        print e

python - ウェブサイトの値が変更されたかどうかを確認する方法

5 に答える 5

Related

Reference