python - ツイートのライブストリームでのキーワードの追跡

Question

tweepy をインストールして試してみました。現在、次の機能を使用しています。

api.public_timeline（）

カスタムユーザーアイコンを設定した保護されていないユーザーから、最新の 20 個のステータスを返します。パブリックタイムラインは 60 秒間キャッシュされるため、それ以上頻繁にリクエストするとリソースが無駄になります。

ただし、完全なライブストリームから特定の正規表現に一致するすべてのツイートを抽出したいと考えています。public_timeline()ループ内に入れることはできwhile Trueますが、レート制限で問題が発生する可能性があります。いずれにせよ、現在のすべてのツイートをカバーできるとは思えません。

どうすればそれができますか？すべてのツイートではない場合は、特定のキーワードに一致するツイートをできるだけ多く抽出したいと考えています。

score 2 · Accepted Answer

ストリーミングAPIはあなたが望むものです。tweetstreamというライブラリを使用しています。これが私の基本的なリスニング機能です：

def retrieve_tweets(numtweets=10, *args):
"""
This function optionally takes one or more arguments as keywords to filter tweets.
It iterates through tweets from the stream that meet the given criteria and sends them 
to the database population function on a per-instance basis, so as to avoid disaster 
if the stream is disconnected.

Both SampleStream and FilterStream methods access Twitter's stream of status elements.
For status element documentation, (including proper arguments for tweet['arg'] as seen
below) see https://dev.twitter.com/docs/api/1/get/statuses/show/%3Aid.
"""   
filters = []
for key in args:
    filters.append(str(key))
if len(filters) == 0:
    stream = tweetstream.SampleStream(username, password)  
else:
    stream = tweetstream.FilterStream(username, password, track=filters)
try:
    count = 0
    while count < numtweets:       
        for tweet in stream:
            # a check is needed on text as some "tweets" are actually just API operations
            # the language selection doesn't really work but it's better than nothing(?)
            if tweet.get('text') and tweet['user']['lang'] == 'en':   
                if tweet['retweet_count'] == 0:
                    # bundle up the features I want and send them to the db population function
                    bundle = (tweet['id'], tweet['user']['screen_name'], tweet['retweet_count'], tweet['text'])
                    db_initpop(bundle)
                    break
                else:
                    # a RT has a different structure.  This bundles the original tweet.  Getting  the
                    # retweets comes later, after the stream is de-accessed.
                    bundle = (tweet['retweeted_status']['id'], tweet['retweeted_status']['user']['screen_name'], \
                              tweet['retweet_count'], tweet['retweeted_status']['text'])
                    db_initpop(bundle)
                    break
        count += 1
except tweetstream.ConnectionError, e:
    print 'Disconnected from Twitter at '+time.strftime("%d %b %Y %H:%M:%S", time.localtime()) \
    +'.  Reason: ', e.reason

しばらく調べていませんが、このライブラリは（消防ホースではなく）サンプルストリームにアクセスしているだけだと確信しています。HTH。

編集して追加：「完全なライブストリーム」、別名消防ホースが必要だと言います。それは財政的および技術的に高価であり、非常に大規模な企業のみがそれを所有することが許可されています。ドキュメントを見ると、サンプルが基本的に代表的なものであることがわかります。

score 1 · Accepted Answer

ストリーミング APIを見てください。定義した単語のリストを購読することもでき、それらの単語に一致するツイートのみが返されます。

ストリーミング API レート制限の動作は異なります。IP ごとに 1 つの接続と、1 秒あたりの最大イベント数を取得します。それよりも多くのイベントが発生した場合は、レート制限のために見逃したイベントの数に関する通知とともに、とにかく最大数のみを取得します。

私の理解では、ストリーミング API は、ユーザーが直接アクセスするのではなく、必要に応じてユーザーにコンテンツを再配布するサーバーに最も適しているということです。継続的な接続は高価であり、Twitter は接続に失敗した回数が多すぎて IP をブラックリストに登録し始めます。接続、およびおそらくその後の API キー。

python - ツイートのライブ ストリームでのキーワードの追跡

2 に答える 2

Related

Reference

python - ツイートのライブストリームでのキーワードの追跡