json - ツイート間の Jaccard 距離

Question

現在、データセット内のツイート間の Jaccard 距離を測定しようとしています

これがデータセットの場所です

http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Tweets.json

距離を測定するためにいくつかのことを試しました

これは私がこれまでに持っているものです

リンクされたデータセットを Tweets.json というファイルに保存しました

json_alldata <- fromJSON(sprintf("[%s]", paste(readLines(file("Tweets.json")),collapse=",")))

次に、json_alldata を tweet.features に変換し、geo 列を取り除きました。

# get rid of geo column
tweet.features = json_alldata
tweet.features$geo <- NULL

これらは、最初の2つのツイートがどのように見えるかです

tweet.features$text[1]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"
> tweet.features$text[2]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"

最初に試したのstringdistは、stringdist ライブラリの下にあるメソッドを使用することでした

install.packages("stringdist")
library(stringdist)

#This works?
#
stringdist(tweet.features$text[1], tweet.features$text[2], method = "jaccard")

それを実行すると、

[1] 0.1621622

しかし、それが正しいかどうかはわかりません。A 交差点 B = 23、A ユニオン B = 25。Jaccard 距離は、A 交差点 B/A ユニオン B です。私の計算では、Jaccard 距離は 0.92 になるはずです。

だから私はセットでそれを行うことができると考えました. 交点と和を計算して割るだけ

これは私が試したものです

# Jaccard distance is the intersection of A and B divided by the Union of A and B
#
#create set for First Tweet
A1 <- as.set(tweet.features$text[1])
A2 <- as.set(tweet.features$text[2])

交差点を作ろうとすると、次のようになります: The output is just list()

 Intersection <- intersect(A1, A2)
 list()

ユニオンを試すと、次のようになります。

ユニオン(A1、A2)

[[1]]
[1] "RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston"

[[2]]
[1] "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston"

これは、単語を単一のセットにグループ化していないようです。

ユニオンで交差点を分割できると思いました。しかし、各セットの数または単語を数え、計算を行うプログラムが必要になると思います。

言うまでもなく、私は少し立ち往生しており、正しい軌道に乗っているかどうか確信が持てません。

どんな助けでも大歓迎です。ありがとうございました。

score 3 · Accepted Answer

intersectそしてunionベクトルを期待します（as.set存在しません）。使えるように単語を比較したいと思いますstrsplitが、分割の仕方はあなたのものです。以下に例を示します。

tweet.features <- list(tweet1="RT @ItsJennaMarbles: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims. #PrayforBoston",
                       tweet2=          "RT @NBCSN: Reports of Marathon Runners that crossed finish line and continued to run to Mass General Hospital to give blood to victims #PrayforBoston")

jaccard_i <- function(tw1, tw2){
  tw1 <- unlist(strsplit(tw1, " |\\."))
  tw2 <- unlist(strsplit(tw2, " |\\."))
  i <- length(intersect(tw1, tw2))
  u <- length(union(tw1, tw2))
  list(i=i, u=u, j=i/u)
}

jaccard_i(tweet.features[[1]], tweet.features[[2]])

$i
[1] 20

$u
[1] 23

$j
[1] 0.8695652

これはあなたが欲しいですか？

ここstrsplitでは、すべてのスペースまたはドットに対して行われます。splitから引数を絞り込み、より具体的なものstrsplitに置き換えることができます (「」を参照)。" |\\."?regex

json - ツイート間の Jaccard 距離

1 に答える 1

Related

Reference