r - 非構造化テキストデータからデータフレームへ

Question

R を使用して、このテキストデータの複数の行をデータフレームに変換しようとしています。read.delim を効果的に使用できません。これらすべての行を : で区切られた 10 個の固定列に入力する必要があります。

*** 
Type:status
Origin: abc
Text: abc
URL: 
ID: 123
Time: Fri Jul 22 15:07:37 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags: 
***
***
Type:status
Origin: cde
Text: rty
URL: http://ocs
ID: 456
Time: Thu Jul 21 14:09:47 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags:  rty
***
***
.
..
...

score 0 · Accepted Answer

このようなものはうまくいくかもしれません：

a <- readLines(textConnection("
*** 
Type:status
Origin: abc
Text: abc
URL: 
ID: 123
Time: Fri Jul 22 15:07:37 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags: 
***
***
Type:status
Origin: cde
Text: rty
URL: http://ocs
ID: 456
Time: Thu Jul 21 14:09:47 CDT 2011
RetCount: 0
Favorite: false
MentionedEntities: 
Hashtags:  rty
***
***"))


ids <- c("Type", "Origin", "Text", "URL", "ID", "Time", "RetCount", "Favorite", "MentionedEntities", "Hashtags")

sapply(ids, function(id) sapply(strsplit(a[grepl(id[1], a)], ":"), "[[", 2))

score 0 · Accepted Answer

これは、仕事を成し遂げるように見える関数です。区切り文字は使用しませんがreadLines、いくつかの正規表現を使用します。

readData <- function(file, stringsAsFactors = TRUE) 
{
    rl <- readLines(file)                        ## read the file
    rl2 <- rl[!grepl("[*]+", rl)]                ## remove the '***' elements
    sub <- sub("^[A-Za-z]+[:]( ?)+", "", rl2)    ## make the row data
    mat <- matrix(sub, ncol = 10, byrow = TRUE,  ## create a matrix
        dimnames = list(NULL, gsub("[:](.*)", "", rl2[1:10])))  
    as.data.frame(mat, stringsAsFactors = stringsAsFactors)
}

これは、サンプルデータを使用してファイル"new.txt"が作成された、データを使用した実行です。

readData("new.txt")
#     Type Origin Text        URL  ID                         Time RetCount Favorite MentionedEntities Hashtags
# 1 status    abc  abc            123 Fri Jul 22 15:07:37 CDT 2011        0    false                           
# 2 status    cde  rty http://ocs 456 Thu Jul 21 14:09:47 CDT 2011        0    false                        rty

r - 非構造化テキスト データからデータ フレームへ

4 に答える 4

Related

Reference

r - 非構造化テキストデータからデータフレームへ