r - R でツイートを解析してハッシュタグを抽出する

Question

のツイートからハッシュタグを抽出するための簡単な解決策を誰かが持っているかどうか疑問に思っていましたR. たとえば、次の文字列が与えられた場合、それを解析してハッシュタグを含む単語を抽出するにはどうすればよいでしょうか?

string <- 'Crowdsourcing is awesome. #stackoverflow'

score 6 · Accepted Answer

HTMLとは異なり、おそらく正規表現でハッシュタグを解析できると思います。

library(stringr)
string <- "#hashtag Crowd#sourcing is awesome. #stackoverflow #question"
# I don't use Twitter, so maybe this regex is not right 
# for the set of allowable hashtag characters.
hashtag.regex <- perl("(?<=^|\\s)#\\S+")
hashtags <- str_extract_all(string, hashtag.regex)

どちらが得られますか:

> print(hashtags)
[[1]]
[1] "#hashtag"       "#stackoverflow" "#question"

stringが実際に多くのツイートのベクトルである場合、これも変更なしで機能することに注意してください。文字ベクトルのリストを返します。

score 1 · Accepted Answer

このようなもの？

string <- c('Crowdsourcing is awesome. #stackoverflow #answer', 
    "another #tag in this tweet")
step1 <- strsplit(string, "#")
step2 <- lapply(step1, tail, -1)
result <- lapply(step2, function(x){
  sapply(strsplit(x, " "), head, 1)
})

r - R でツイートを解析してハッシュタグを抽出する

2 に答える 2

Related

Reference