r - テキスト間の距離を視覚化する

Question

私は学校の研究プロジェクトに取り組んでいます。コレクション内の法的テキストを分析し、それらがどの程度類似しているかを示すスコアを吐き出すテキストマイニングソフトウェアを作成しました。プログラムを実行して、各テキストを他のすべてのテキストと比較したところ、次のようなデータが得られました (ただし、さらに多くのポイントがあります)。

codeofhammurabi.txt crete.txt      0.570737
codeofhammurabi.txt iraqi.txt      1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt         1.25546
crete.txt iraqi.txt                0.329545
crete.txt magnacarta.txt           0.589786
crete.txt us.txt                   0.491903
iraqi.txt magnacarta.txt           0.834488
iraqi.txt us.txt                   1.37718
magnacarta.txt us.txt              1.09582

次に、それらをグラフにプロットする必要があります。小さい値は類似したテキストを示し、大きい値は類似していないテキストを示すように、スコアを簡単に反転できます。この値は、テキストを表すグラフ上のポイント間の距離である可能性があります。

codeofhammurabi.txt crete.txt      1.75212
codeofhammurabi.txt iraqi.txt      0.8812
codeofhammurabi.txt magnacarta.txt 1.0573
codeofhammurabi.txt us.txt         0.7965
crete.txt iraqi.txt                3.0344
crete.txt magnacarta.txt           1.6955
crete.txt us.txt                   2.0329
iraqi.txt magnacarta.txt           1.1983
iraqi.txt us.txt                   0.7261
magnacarta.txt us.txt              0.9125

短いバージョン: すぐ上の値は、散布図上のポイント間の距離です (1.75212 はコードオブハンムラビポイントとクレタポイントの間の距離です)。点間の距離を円で表す大きな連立方程式を想像できます。このグラフを作成する最良の方法は何ですか? MATLAB、R、Excel があり、必要なほとんどすべてのソフトウェアにアクセスできます。

もしあなたが私に方向性を示すことさえできれば、私は無限に感謝します.

score 11 · Accepted Answer

質問が「この男がしたようなことをどうすればできるか」である場合 (xiii1408 のコメントから質問へ)、答えは、ドキュメントトピックの事後確率のユークリッド距離でGephi の組み込みの Force Atlas 2 アルゴリズムを使用することです。

「この男」は、デジタル人文科学の革新的な学者である Matt Jockers です。Jockersは主にで作業し、彼のコードの一部を共有しています。彼の基本的なワークフローは次のようです。R

プレーンテキストを 1000 語のチャンクに分割し、
ストップワードを削除する (ステム処理しない)、
品詞のタグ付けを行い、名詞のみを保持します。
トピックモデルの構築 (LDA を使用)、
トピックの比率に基づいてドキュメント間のユークリッド距離を計算し、距離をサブセット化して特定のしきい値を下回るものだけを保持し、次に
力有向グラフで視覚化する

RJockers が行ったことに近い可能性がある (Gephi へのエクスポートを使用した)の小規模で再現可能な例を次に示します。

#### prepare workspace
# delete current objects and clear RAM
rm(list = ls(all.names = TRUE))
gc()

データを取得...

#### import text
# working from the topicmodels package vignette
# using collection of abstracts of the Journal of Statistical Software (JSS) (up to 2010-08-05).
install.packages("corpus.JSS.papers", repos = "http://datacube.wu.ac.at/", type = "source")
data("JSS_papers", package = "corpus.JSS.papers")
# For reproducibility of results we use only abstracts published up to 2010-08-05 
JSS_papers <- JSS_papers[JSS_papers[,"date"] < "2010-08-05",]

掃除して模様替え…

#### clean and reshape data
# Omit abstracts containing non-ASCII characters in the abstracts
JSS_papers <- JSS_papers[sapply(JSS_papers[, "description"], Encoding) == "unknown",]
# remove greek characters (from math notation, etc.)
library("tm")
library("XML")
remove_HTML_markup <- function(s) tryCatch({
    doc <- htmlTreeParse(paste("<!DOCTYPE html>", s),
                         asText = TRUE, trim = FALSE)
                         xmlValue(xmlRoot(doc))
                         }, error = function(s) s)
# create corpus
corpus <- Corpus(VectorSource(sapply(JSS_papers[, "description"], remove_HTML_markup)))
# clean corpus by removing stopwords, numbers, punctuation, whitespaces, words <3 characters long..
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus_clean <- tm_map(corpus, wordLengths=c(3,Inf), FUN = tm_reduce, tmFuns = funcs)

品詞のタグ付けと名詞のサブセット化...

#### Part-of-speach tagging to extract nouns only
library("openNLP", "NLP")
# function for POS tagging
tagPOS <-  function(x) {

  s <- NLP::as.String(x)
  ## Need sentence and word token annotations.

  a1 <- NLP::Annotation(1L, "sentence", 1L, nchar(s))
  a2 <- NLP::annotate(s, openNLP::Maxent_Word_Token_Annotator(), a1)
  a3 <- NLP::annotate(s,  openNLP::Maxent_POS_Tag_Annotator(), a2)

  ## Determine the distribution of POS tags for word tokens.
  a3w <- a3[a3$type == "word"]
  POStags <- unlist(lapply(a3w$features, `[[`, "POS"))

  ## Extract token/POS pairs (all of them): easy - not needed
  # POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
  return(unlist(POStags))
} 
# a  loop to do POS tagging on each document and do garbage cleaning after each document
# first prepare vector to hold results (for optimal loop speed)
corpus_clean_tagged <- vector(mode = "list",  length = length(corpus_clean))
# then loop through each doc and do POS tagging
# warning: this may take some time!
for(i in 1:length(corpus_clean)){
  corpus_clean_tagged[[i]] <- tagPOS(corpus_clean[[i]])
  print(i) # nice to see what we're up to
  gc()
}

# subset nouns
wrds <- lapply(unlist(corpus_clean), function(i) unlist(strsplit(i, split = " ")))
NN <- lapply(corpus_clean_tagged, function(i) i == "NN")
Noun_strings <- lapply(1:length(wrds), function(i) unlist(wrds[i])[unlist(NN[i])])
Noun_strings <- lapply(Noun_strings, function(i) paste(i, collapse = " "))
# have a look to see what we've got
Noun_strings[[1]]
[8] "variogram model splus user quality variogram model pairs locations measurements variogram nonstationarity outliers variogram fit sets soil nitrogen concentration"

潜在的ディリクレ配分によるトピックモデリング...

#### topic modelling with LDA (Jockers uses the lda package and MALLET, maybe topicmodels also, I'm not sure. I'm most familiar with the topicmodels package, so here it is. Note that MALLET can be run from R: https://gist.github.com/benmarwick/4537873
# put the cleaned documents back into a corpus for topic modelling
corpus <- Corpus(VectorSource(Noun_strings))
# create document term matrix 
JSS_dtm <- DocumentTermMatrix(corpus)
# generate topic model 
library("topicmodels")
k = 30 # arbitrary number of topics (they are ways to optimise this)
JSS_TM <- LDA(JSS_dtm, k) # make topic model
# make data frame where rows are documents, columns are topics and cells 
# are posterior probabilities of topics
JSS_topic_df <- setNames(as.data.frame(JSS_TM@gamma),  paste0("topic_",1:k))
# add row names that link each document to a human-readble bit of data
# in this case we'll just use a few words of the title of each paper
row.names(JSS_topic_df) <- lapply(1:length(JSS_papers[,1]), function(i) gsub("\\s","_",substr(JSS_papers[,1][[i]], 1, 60)))

ドキュメントの「DNA」としてトピック確率を使用して、あるドキュメントと別のドキュメントのユークリッド距離を計算します

#### Euclidean distance matrix
library(cluster)
JSS_topic_df_dist <-  as.matrix(daisy(JSS_topic_df, metric =  "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# This is how Jockers subsets the distance matrix to keep only 
# closely related documents and avoid a dense spagetti diagram 
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
JSS_topic_df_dist[ sweep(JSS_topic_df_dist, 1, (apply(JSS_topic_df_dist,1,min) + apply(JSS_topic_df_dist,1,sd) )) > 0 ] <- 0

力有向グラフを使用して視覚化...

#### network diagram using Fruchterman & Reingold algorithm (Jockers uses the ForceAtlas2 algorithm which is unique to Gephi)
library(igraph)
g <- as.undirected(graph.adjacency(JSS_topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1,  vertex.color= "grey", edge.arrow.size = 0.1, vertex.label.dist=0.5, vertex.label = NA)

ここに画像の説明を入力 Gephi で Force Atlas 2 アルゴリズムを使用する場合は、Rグラフオブジェクトをgraphmlファイルにエクスポートし、それを Gephi で開き、レイアウトを Force Atlas 2 に設定します。

# this line will export from R and make the file 'JSS.graphml' in your working directory ready to open with Gephi
write.graph(g, file="JSS.graphml", format="graphml")

Force Atlas 2 アルゴリズムを使用した Gephi プロットは次のとおりです。ここに画像の説明を入力

score 10 · Accepted Answer

あなたのデータは、実際には、ドキュメントに含まれる単語のコーパスにまたがる多変量空間内の (何らかの形の) 距離です。これらのような非類似度データは、非類似度の最良のk -d マッピングを提供するために調整されることがよくあります。主座標分析と非メトリック多次元スケーリングは、そのような 2 つの方法です。これらの方法のいずれかをデータに適用した結果をプロットすることをお勧めします。以下に両方の例を示します。

まず、提供したデータを読み込みます (この段階ではラベルはありません)。

con <- textConnection("1.75212
0.8812
1.0573
0.7965
3.0344
1.6955
2.0329
1.1983
0.7261
0.9125
")
vec <- scan(con)
close(con)

あなたが実際に持っているのは、次の距離行列です。

mat <- matrix(ncol = 5, nrow = 5)
mat[lower.tri(mat)] <- vec
colnames(mat) <- rownames(mat) <-
  c("codeofhammurabi","crete","iraqi","magnacarta","us")

> mat
                codeofhammurabi  crete  iraqi magnacarta us
codeofhammurabi              NA     NA     NA         NA NA
crete                   1.75212     NA     NA         NA NA
iraqi                   0.88120 3.0344     NA         NA NA
magnacarta              1.05730 1.6955 1.1983         NA NA
us                      0.79650 2.0329 0.7261     0.9125 NA

R は一般に、クラスの非類似度オブジェクトを必要とします"dist"。now を使用as.dist(mat)してそのようなオブジェクトを取得することも、作成をスキップして次のようなオブジェクトmatに直接進むこともできます。"dist"

class(vec) <- "dist"
attr(vec, "Labels") <- c("codeofhammurabi","crete","iraqi","magnacarta","us")
attr(vec, "Size") <- 5
attr(vec, "Diag") <- FALSE
attr(vec, "Upper") <- FALSE

> vec
           codeofhammurabi   crete   iraqi magnacarta
crete              1.75212                           
iraqi              0.88120 3.03440                   
magnacarta         1.05730 1.69550 1.19830           
us                 0.79650 2.03290 0.72610    0.91250

これで、適切なタイプのオブジェクトを調整できます。R にはこれを行うための多くのパッケージと関数があります ( CRANの多変量または環境メトリクスタスクビューを参照) 。

require("vegan")

主な座標

最初に、 veganを使用してデータの主座標分析を行う方法を説明します。

pco <- capscale(vec ~ 1, add = TRUE)
pco

> pco
Call: capscale(formula = vec ~ 1, add = TRUE)

              Inertia Rank
Total           10.42     
Unconstrained   10.42    3
Inertia is squared Unknown distance (euclidified) 

Eigenvalues for unconstrained axes:
 MDS1  MDS2  MDS3 
7.648 1.672 1.098 

Constant added to distances: 0.7667353

最初の PCO 軸は、固有値によって示されるように、テキスト間の違いを説明する上で最も重要です。plotメソッドを使用して、PCO の固有ベクトルをプロットすることにより、順序付けプロットを生成できるようになりました。

plot(pco)

生産する

ここに画像の説明を入力

非メトリック多次元スケーリング

非計量多次元尺度法 (nMDS) は、ユークリッド空間で元の距離の低次元表現を見つけようとしません。代わりに、観測間の距離の順位を最もよく保持するk次元のマッピングを見つけようとします。(上記で適用された PCO とは異なり) この問題に対する閉じた形式の解決策はなく、解決策を提供するには反復アルゴリズムが必要です。ランダムな開始は、アルゴリズムが最適ではない、局所的に最適なソリューションに収束していないことを確認するためにお勧めします。ビーガンの機能には、これらの機能やその他の機能が組み込まれています。普通の古い nMDS が必要な場合は、パッケージMASSを参照してください。metaMDSisoMDS

set.seed(42)
sol <- metaMDS(vec)

> sol

Call:
metaMDS(comm = vec) 

global Multidimensional Scaling using monoMDS

Data:     vec 
Distance: user supplied 

Dimensions: 2 
Stress:     0 
Stress type 1, weak ties
No convergent solutions - best solution after 20 tries
Scaling: centring, PC rotation 
Species: scores missing

この小さなデータセットを使用すると、非類似度の順位を本質的に完全に表すことができます (したがって、警告は示されていません)。plotメソッドを使用してプロットを実現できます

plot(sol, type = "text", display = "sites")

生産する

ここに画像の説明を入力

どちらの場合も、サンプル間のプロット上の距離は、それらの非類似度の最適な 2 次元近似です。PCO プロットの場合、これは実際の非類似度の 2 次元近似です (すべての非類似度を完全に表すには 3 次元が必要です)。一方、nMDS プロットでは、プロット上のサンプル間の距離がランクの非類似度を反映します。観測間の実際の相違点ではありません。しかし、基本的に、プロット上の距離は計算された非類似度を表します。近くにあるテキストは最も類似しており、プロット上で離れているテキストは互いに最も類似していません。

score 2 · Accepted Answer

igraph を使用してネットワークグラフを作成できます。Fruchterman-Reingold レイアウトには、エッジの重みを提供するパラメーターがあります。ウェイトが 1 より大きいと、エッジに沿った「引き付け」が多くなり、ウェイトが 1 未満の場合は逆になります。あなたの例では、crete.txt の距離が最も短く、中央にあり、他の頂点へのエッジが小さくなっています。実際、これは iraqi.txt に近いものです。正しい距離を得るには、E(g)$weight のデータを反転する必要があることに注意してください。

data1 <- read.table(text="
codeofhammurabi.txt crete.txt      0.570737
codeofhammurabi.txt iraqi.txt      1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt         1.25546
crete.txt iraqi.txt                0.329545
crete.txt magnacarta.txt           0.589786
crete.txt us.txt                   0.491903
iraqi.txt magnacarta.txt           0.834488
iraqi.txt us.txt                   1.37718
magnacarta.txt us.txt              1.09582")
par(mar=c(3,7,3.5,5), las=1)

library(igraph)
g <- graph.data.frame(data1, directed = FALSE)
E(g)$weight <- 1/data1[,3] #inversed, high weights = more attraction along the edges
l <- layout.fruchterman.reingold(g, weights=E(g)$weight)
plot(g, layout=l)

ここに画像の説明を入力

score 0 · Accepted Answer

すべてのペアワイズ比較を行っていますか? 距離（類似度）の計算方法にもよりますが、そのような散布図を作成できるかどうかはわかりません。そのため、検討するテキストファイルが 3 つしかない場合、散布図は簡単に作成できます (辺が距離に等しい三角形)。しかし、4 番目の点を追加するとき、既存の 3 点までの距離がすべての制約を満たす場所に配置できない場合があります。

しかし、それができれば、解決策があるよりも、新しいポイントをどんどん追加してください....私は思う...または、散布図の距離を正確にする必要がない場合は、単純にウェブを作成し、距離にラベルを付けます。

score 0 · Accepted Answer

Matlab の潜在的な解決策は次のとおりです。

データを形式的な 5x5 類似度行列Sに配置できます。要素S(i,j)は、ドキュメントiとドキュメントjの間の類似度 (または非類似度) を表します。距離測定が実際のmetricであると仮定すると、 mdscale(S,2)を介してこの行列に多次元スケーリングを適用できます。

この関数は、高次元で見つかったクラス間の類似性 (または非類似性) を維持するデータの 5x2 次元表現を見つけようとします。次に、このデータを 5 つの点の散布図として視覚化できます。

mdscale(S,3) を使用してこれを試して、5x3 次元の行列に射影し、それを plot3() で視覚化することもできます。

score 0 · Accepted Answer

ポイント間の距離を表す円が必要な場合、これは R で機能します (例では最初のテーブルを使用しました)。

data1 <- read.table(text="
codeofhammurabi.txt crete.txt      0.570737
codeofhammurabi.txt iraqi.txt      1.13475
codeofhammurabi.txt magnacarta.txt 0.945746
codeofhammurabi.txt us.txt         1.25546
crete.txt iraqi.txt                0.329545
crete.txt magnacarta.txt           0.589786
crete.txt us.txt                   0.491903
iraqi.txt magnacarta.txt           0.834488
iraqi.txt us.txt                   1.37718
magnacarta.txt us.txt              1.09582")
par(mar=c(3,7,3.5,5), las=1)

symbols(data1[,1],data1[,2], circles=data1[,3], inches=0.55, bg="lightblue", xaxt="n", yaxt="n", ylab="")
axis(1, at=data1[,1],labels=data1[,1])
axis(2, at=data1[,2],labels=data1[,2])
text(data1[,1], data1[,2], round(data1[,3],2), cex=0.9)

ここに画像の説明を入力

r - テキスト間の距離を視覚化する

7 に答える 7

主な座標

非メトリック多次元スケーリング

Related

Reference