r - RでtwitteRパッケージのツイートジオコードを抽出する方法

Question

最近、Edwin Chen は、飲酒の文脈でこれらの単語を含むジオコーディングされたツイートから作成された、ソーダ、ポップス、コーラの地域的な使用法を示す素晴らしいマップを投稿しました。http://blog.echen.me/2012/07/06/soda-vs-pop-with-twitter/

彼は、R で Jeff Gentry によって作成された twitteR パッケージを使用したと述べています。案の定、特定の単語を使用するツイートを収集してデータフレームに入れるのは簡単です。

require(twitteR)
require(plyr)
cat.tweets<-searchTwitter("cats",n=1000)
tweets.df = ldply(cat.tweets, function(t) t$toDataFrame() )

データフレーム (tweets.df) には、各ツイートのユーザー ID、ツイートテキストなどが含まれますが、ジオコードは含まれていないようです。Rでそれを取得する方法について何か考えはありますか?

score 3 · Accepted Answer

私は R 関数をいじくり回してきました。検索テキスト、検索サイトの数、および各サイトの半径を入力します。たとえばtwitterMap("#rstats",10,"10mi")、コードは次のとおりです。

twitterMap <- function(searchtext,locations,radius){
require(ggplot2)
require(maps)
require(twitteR)
#radius from randomly chosen location
radius=radius
lat<-runif(n=locations,min=24.446667, max=49.384472)
long<-runif(n=locations,min=-124.733056, max=-66.949778)
#generate data fram with random longitude, latitude and chosen radius
coordinates<-as.data.frame(cbind(lat,long,radius))
coordinates$lat<-lat
coordinates$long<-long
#create a string of the lat, long, and radius for entry into searchTwitter()
for(i in 1:length(coordinates$lat)){
coordinates$search.twitter.entry[i]<-toString(c(coordinates$lat[i],
coordinates$long[i],radius))
}
# take out spaces in the string
coordinates$search.twitter.entry<-gsub(" ","", coordinates$search.twitter.entry ,
fixed=TRUE)

#Search twitter at each location, check how many tweets and put into dataframe
for(i in 1:length(coordinates$lat)){
coordinates$number.of.tweets[i]<-
 length(searchTwitter(searchString=searchtext,n=1000,geocode=coordinates$search.twitter.entry[i]))
}
#making the US map
all_states <- map_data("state")
#plot all points on the map
p <- ggplot()
p <- p + geom_polygon( data=all_states, aes(x=long, y=lat, group = group),colour="grey",     fill=NA )

p<-p + geom_point( data=coordinates, aes(x=long, y=lat,color=number.of.tweets
                                     )) + scale_size(name="# of tweets")
p
}
# Example
searchTwitter("dolphin",15,"10mi")

サンプルマップ

私が遭遇したいくつかの大きな問題があり、対処方法がわかりません。最初に、記述されているように、コードはランダムに生成された 15 の異なる場所を検索します。これらの場所は、米国の東経の最大値から西経最大値、および北極から南極までの緯度までの均一な分布から生成されます。これには、カナダのミネソタ州の森の湖のすぐ東など、米国以外の場所が含まれます。生成された場所が米国内にあるかどうかをランダムにチェックし、そうでない場合は破棄する機能が必要です。さらに重要なことは、何千もの場所を検索したいのですが、Twitter はそれを好まず、420 error enhance your calm. したがって、数時間ごとに検索し、ゆっくりとデータベースを構築し、重複するツイートを削除するのが最善の方法です。最後に、リモートで人気のあるトピックを選択すると、R は次のようなエラーを返しますError in function (type, msg, asError = TRUE) : transfer closed with 43756 bytes remaining to read。この問題を回避する方法が少しわかりません。

score 2 · Accepted Answer

呼び出しごとに 100 個のツイートしか抽出できないと仮定した場合のおもちゃの例を次に示します。

require(twitteR)
require(plyr)
URL = paste('http://search.twitter.com/search.atom? 
      q=','&geocode=39.724089,-104.820557,3mi','&rpp=100&page=', page, sep='') #Aurora,CO with radii of 3mi
XML = htmlTreeParse(URL, useInternal=TRUE)
entry = getNodeSet(XML, "//entry")
tweets = c()

for (i in 1:99){ 
    t = unlist(xpathApply(entry[[i]], "//title", xmlValue))
    tweets = c(tweets,t)
}

このソリューションはあまり洗練されていないかもしれませんが、特定のジオコードを指定してツイートを取得できました。

r - RでtwitteRパッケージのツイートジオコードを抽出する方法

3 に答える 3

Related

Reference