質問が「この男がしたようなことをどうすればできるか」である場合 (xiii1408 のコメントから質問へ)、答えは、ドキュメント トピックの事後確率のユークリッド距離でGephi の組み込みの Force Atlas 2 アルゴリズムを使用することです。
「この男」は、デジタル人文科学の革新的な学者である Matt Jockers です。Jockersは主に で作業し、彼のコードの一部を共有しています 。彼の基本的なワークフローは次のようです。R
- プレーンテキストを 1000 語のチャンクに分割し、
- ストップワードを削除する (ステム処理しない)、
- 品詞のタグ付けを行い、名詞のみを保持します。
- トピック モデルの構築 (LDA を使用)、
- トピックの比率に基づいてドキュメント間のユークリッド距離を計算し、距離をサブセット化して特定のしきい値を下回るものだけを保持し、次に
- 力有向グラフで視覚化する
R
Jockers が行ったことに近い可能性がある (Gephi へのエクスポートを使用した)の小規模で再現可能な例を次に示します。
#### prepare workspace
# delete current objects and clear RAM
rm(list = ls(all.names = TRUE))
gc()
データを取得...
#### import text
# working from the topicmodels package vignette
# using collection of abstracts of the Journal of Statistical Software (JSS) (up to 2010-08-05).
install.packages("corpus.JSS.papers", repos = "http://datacube.wu.ac.at/", type = "source")
data("JSS_papers", package = "corpus.JSS.papers")
# For reproducibility of results we use only abstracts published up to 2010-08-05
JSS_papers <- JSS_papers[JSS_papers[,"date"] < "2010-08-05",]
掃除して模様替え…
#### clean and reshape data
# Omit abstracts containing non-ASCII characters in the abstracts
JSS_papers <- JSS_papers[sapply(JSS_papers[, "description"], Encoding) == "unknown",]
# remove greek characters (from math notation, etc.)
library("tm")
library("XML")
remove_HTML_markup <- function(s) tryCatch({
doc <- htmlTreeParse(paste("<!DOCTYPE html>", s),
asText = TRUE, trim = FALSE)
xmlValue(xmlRoot(doc))
}, error = function(s) s)
# create corpus
corpus <- Corpus(VectorSource(sapply(JSS_papers[, "description"], remove_HTML_markup)))
# clean corpus by removing stopwords, numbers, punctuation, whitespaces, words <3 characters long..
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
corpus_clean <- tm_map(corpus, wordLengths=c(3,Inf), FUN = tm_reduce, tmFuns = funcs)
品詞のタグ付けと名詞のサブセット化...
#### Part-of-speach tagging to extract nouns only
library("openNLP", "NLP")
# function for POS tagging
tagPOS <- function(x) {
s <- NLP::as.String(x)
## Need sentence and word token annotations.
a1 <- NLP::Annotation(1L, "sentence", 1L, nchar(s))
a2 <- NLP::annotate(s, openNLP::Maxent_Word_Token_Annotator(), a1)
a3 <- NLP::annotate(s, openNLP::Maxent_POS_Tag_Annotator(), a2)
## Determine the distribution of POS tags for word tokens.
a3w <- a3[a3$type == "word"]
POStags <- unlist(lapply(a3w$features, `[[`, "POS"))
## Extract token/POS pairs (all of them): easy - not needed
# POStagged <- paste(sprintf("%s/%s", s[a3w], POStags), collapse = " ")
return(unlist(POStags))
}
# a loop to do POS tagging on each document and do garbage cleaning after each document
# first prepare vector to hold results (for optimal loop speed)
corpus_clean_tagged <- vector(mode = "list", length = length(corpus_clean))
# then loop through each doc and do POS tagging
# warning: this may take some time!
for(i in 1:length(corpus_clean)){
corpus_clean_tagged[[i]] <- tagPOS(corpus_clean[[i]])
print(i) # nice to see what we're up to
gc()
}
# subset nouns
wrds <- lapply(unlist(corpus_clean), function(i) unlist(strsplit(i, split = " ")))
NN <- lapply(corpus_clean_tagged, function(i) i == "NN")
Noun_strings <- lapply(1:length(wrds), function(i) unlist(wrds[i])[unlist(NN[i])])
Noun_strings <- lapply(Noun_strings, function(i) paste(i, collapse = " "))
# have a look to see what we've got
Noun_strings[[1]]
[8] "variogram model splus user quality variogram model pairs locations measurements variogram nonstationarity outliers variogram fit sets soil nitrogen concentration"
潜在的ディリクレ配分によるトピックモデリング...
#### topic modelling with LDA (Jockers uses the lda package and MALLET, maybe topicmodels also, I'm not sure. I'm most familiar with the topicmodels package, so here it is. Note that MALLET can be run from R: https://gist.github.com/benmarwick/4537873
# put the cleaned documents back into a corpus for topic modelling
corpus <- Corpus(VectorSource(Noun_strings))
# create document term matrix
JSS_dtm <- DocumentTermMatrix(corpus)
# generate topic model
library("topicmodels")
k = 30 # arbitrary number of topics (they are ways to optimise this)
JSS_TM <- LDA(JSS_dtm, k) # make topic model
# make data frame where rows are documents, columns are topics and cells
# are posterior probabilities of topics
JSS_topic_df <- setNames(as.data.frame(JSS_TM@gamma), paste0("topic_",1:k))
# add row names that link each document to a human-readble bit of data
# in this case we'll just use a few words of the title of each paper
row.names(JSS_topic_df) <- lapply(1:length(JSS_papers[,1]), function(i) gsub("\\s","_",substr(JSS_papers[,1][[i]], 1, 60)))
ドキュメントの「DNA」としてトピック確率を使用して、あるドキュメントと別のドキュメントのユークリッド距離を計算します
#### Euclidean distance matrix
library(cluster)
JSS_topic_df_dist <- as.matrix(daisy(JSS_topic_df, metric = "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# This is how Jockers subsets the distance matrix to keep only
# closely related documents and avoid a dense spagetti diagram
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
JSS_topic_df_dist[ sweep(JSS_topic_df_dist, 1, (apply(JSS_topic_df_dist,1,min) + apply(JSS_topic_df_dist,1,sd) )) > 0 ] <- 0
力有向グラフを使用して視覚化...
#### network diagram using Fruchterman & Reingold algorithm (Jockers uses the ForceAtlas2 algorithm which is unique to Gephi)
library(igraph)
g <- as.undirected(graph.adjacency(JSS_topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1, vertex.color= "grey", edge.arrow.size = 0.1, vertex.label.dist=0.5, vertex.label = NA)
Gephi で Force Atlas 2 アルゴリズムを使用する場合は、R
グラフ オブジェクトをgraphml
ファイルにエクスポートし、それを Gephi で開き、レイアウトを Force Atlas 2 に設定します。
# this line will export from R and make the file 'JSS.graphml' in your working directory ready to open with Gephi
write.graph(g, file="JSS.graphml", format="graphml")
Force Atlas 2 アルゴリズムを使用した Gephi プロットは次のとおりです。