xml - ループ機能でエラーを回避 (Twitter からデータを抽出するために使用)

Question

検索 API を使用して一定間隔 (たとえば 5 分ごと) にツイートを抽出するループ関数を作成しました。この関数は、Twitter に接続し、特定のキーワードを含むツイートを抽出し、それらを csv ファイルに保存するという想定どおりのことを行います。ただし、時折 (1 日に 2 ～ 3 回)、次の 2 つのエラーのいずれかが原因でループが停止します。

htmlTreeParse(URL, useInternal = TRUE) のエラー: http://search.twitter.com/search.atom?q= 6.95322e-310tst&rpp=100&page=10 のパーサー作成エラー
UseMethod("xmlNamespaceDefinitions") のエラー: クラス "NULL" のオブジェクトに適用される 'xmlNamespaceDefinitions' に適用可能なメソッドがありません

私の質問のいくつかに答えて、これらのエラーに対処するのを手伝ってくれることを願っています:

これらのエラーが発生する原因は何ですか?
これらのエラーを回避するためにコードを調整するにはどうすればよいですか?
エラーが発生した場合にループを強制的に実行し続けるにはどうすればよいですか (Try 関数を使用するなど)。

私の機能（オンラインで見つかったいくつかのスクリプトに基づく）は次のとおりです。

    library(XML)   # htmlTreeParse

    twitter.search <- "Keyword"

    QUERY <- URLencode(twitter.search)

    # Set time loop (in seconds)
    d_time = 300
    number_of_times = 3000

    for(i in 1:number_of_times){

    tweets <- NULL
    tweet.count <- 0
    page <- 1
    read.more <- TRUE

    while (read.more)
    {
    # construct Twitter search URL
    URL <- paste('http://search.twitter.com/search.atom?q=',QUERY,'&rpp=100&page=', page, sep='')
    # fetch remote URL and parse
    XML <- htmlTreeParse(URL, useInternal=TRUE, error = function(...){})

    # Extract list of "entry" nodes
    entry     <- getNodeSet(XML, "//entry")

    read.more <- (length(entry) > 0)
    if (read.more)
    {
    for (i in 1:length(entry))
    {
    subdoc     <- xmlDoc(entry[[i]])   # put entry in separate object to manipulate

    published  <- unlist(xpathApply(subdoc, "//published", xmlValue))

    published  <- gsub("Z"," ", gsub("T"," ",published) )

    # Convert from GMT to central time
    time.gmt   <- as.POSIXct(published,"GMT")
    local.time <- format(time.gmt, tz="Europe/Amsterdam")

    title  <- unlist(xpathApply(subdoc, "//title", xmlValue))

    author <- unlist(xpathApply(subdoc, "//author/name",  xmlValue))

    tweet  <-  paste(local.time, " @", author, ":  ", title, sep="")

    entry.frame <- data.frame(tweet, author, local.time, stringsAsFactors=FALSE)
    tweet.count <- tweet.count + 1
    rownames(entry.frame) <- tweet.count
    tweets <- rbind(tweets, entry.frame)
    }
    page <- page + 1
    read.more <- (page <= 15)   # Seems to be 15 page limit
    }
    }

    names(tweets)

    # top 15 tweeters
    #sort(table(tweets$author),decreasing=TRUE)[1:15]

    write.table(tweets, file=paste("Twitts - ", format(Sys.time(), "%a %b %d %H_%M_%S %Y"), ".csv"), sep = ";")

    Sys.sleep(d_time)

    } # end if

score 1 · Accepted Answer

tryこれは、Twitter API での同様の問題に対する私の解決策です。

Twitter API に、Twitter ユーザーの長いリストの各ユーザーのフォロワー数を尋ねていました。ユーザーのアカウントが保護されている場合、エラーが発生し、関数を挿入する前にループが中断されtryます。を使用tryすると、リストの次の人にスキップしてループが機能し続けることができました。

セットアップはこちら

# load library
library(twitteR)
#
# Search Twitter for your term
s <- searchTwitter('#rstats', n=1500) 
# convert search results to a data frame
df <- do.call("rbind", lapply(s, as.data.frame)) 
# extract the usernames
users <- unique(df$screenName)
users <- sapply(users, as.character)
# make a data frame for the loop to work with 
users.df <- data.frame(users = users, 
                       followers = "", stringsAsFactors = FALSE)

tryTwitter API から取得したフォロワー数を users$followers に入力する際にエラーを処理するためのループは次のとおりです。

for (i in 1:nrow(users.df)) 
    {
    # tell the loop to skip a user if their account is protected 
    # or some other error occurs  
    result <- try(getUser(users.df$users[i])$followersCount, silent = TRUE);
    if(class(result) == "try-error") next;
    # get the number of followers for each user
    users.df$followers[i] <- getUser(users.df$users[i])$followersCount
    # tell the loop to pause for 60 s between iterations to 
    # avoid exceeding the Twitter API request limit
    print('Sleeping for 60 seconds...')
    Sys.sleep(60); 
    }
#
# Now inspect users.df to see the follower data

score 0 · Accepted Answer

私の推測では、あなたの問題は、Twitter (または Web への接続) がダウンしているか、低速であるかなどに対応しているため、悪い結果が得られていると思われます。設定してみましたか

options(error = recover)

そうすれば、次にエラーが発生したときに、いいブラウザ環境が起動して、ちょっと調べてみることができます。

xml - ループ機能でエラーを回避 (Twitter からデータを抽出するために使用)

2 に答える 2

Related

Reference