r - Rでテキストをマトリックスに変換して.csvにする

Question

次のテキストがあります。

Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other
address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:
Atodo - Asociación de todo Address: calle 12 Bogota Colombia
Other address: Phone.: 12-1-23-32  E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.

次のように .csv ファイルとして変換される列名を持つマトリックスを取得したいと思います。

Company, Address, Other Address, Tel, E-mail, Web page, Category, Sector, Notes

そして行：

Anada - Asociación de nada, calle 13 13 Medellin Colombia, 13-13-136131 13-13-13-1313,anada@13.co,,3,Private,,

Atodo - Asociación de todo,calle 12 Bogota Colombia,,12-1-23-32,www.atodoooo.com,99,Public,note that there are missing fields.

Rでどのように行うことができますか？

score 1 · Accepted Answer

以下は、レコードがエントリごとに 1 行にあることを前提としています。つまり、次のようになります。

text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:", 
          "Atodo - Asociación de todo Address: calle 12 Bogota Colombia Other address: Phone.: 12-1-23-32  E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")

そうでない場合でも、" Address:" フィールドが常に最初の行にあると想定できる場合は、次のようにすることができます。

## Starting point
text <- c("Anada - Asociación de nada Address: calle 13 13 Medellin Colombia Other", 
          "address: Phone.: 13-13-136131 13-13-13-1313 E-mail: anada@13.co Web page: Category: 3. Private sector Notes:", 
          "Atodo - Asociación de todo Address: calle 12 Bogota Colombia", 
          "Other address: Phone.: 12-1-23-32  E-mail: Web page: www.atodoooo.com, Category: 99. Public sector Notes: note that there are missing fields.")

## Locate the elements that have "Address:" and use cumsum to get an index
## Use tapply to paste the relevant vector elements together into single strings
text <- tapply(text, 
               cumsum(grepl("Address:", text)), 
               paste, collapse = " ")

そこから、アプローチは基本的に次のようになります。

list「ヘッダー」部分を抽出します。
関連する値を抽出しlistます。
それらをベクトルとしてまとめます。
それらをもう一度分割します。
結果を「長い」形式から「広い」形式に変更します。

使用するツールは次のとおりです。

library(devtools)
library(data.table)
library(reshape2)
source_gist("11380733") ## For cSplit

アプローチは @won782 のものと同様に始まります。

splitlist <- c("Address:", "Other address:", "Phone.:", "E-mail:", "Web page:",
               "Category:", "Public sector Notes:", "Private sector Notes:")
pattern <- paste0(splitlist, collapse = "|")

「stringr」関数のいくつかはやや遅いことがわかったので、ベース R に固執します。

X1 <- regmatches(text, gregexpr(pattern, text))
X2 <- regmatches(text, gregexpr(pattern, text), invert = TRUE)

Combined <- Map(paste0, 
                lapply(X1, append, values = "Company:", after = 0), 
                lapply(X2, data.table:::trim))

これまでのところ、次のとおりです。

Combined
# [[1]]
# [1] "Company:Anada - Asociación de nada"    "Address:calle 13 13 Medellin Colombia"
# [3] "Other address:"                        "Phone.:13-13-136131 13-13-13-1313"    
# [5] "E-mail:anada@13.co"                    "Web page:"                            
# [7] "Category:3."                           "Private sector Notes:"                
# 
# [[2]]
# [1] "Company:Atodo - Asociación de todo"                     
# [2] "Address:calle 12 Bogota Colombia"                       
# [3] "Other address:"                                         
# [4] "Phone.:12-1-23-32"                                      
# [5] "E-mail:"                                                
# [6] "Web page:www.atodoooo.com,"                             
# [7] "Category:99."                                           
# [8] "Public sector Notes:note that there are missing fields."

関数はs とcSplitうまく動作するdata.tableので、それを直接使用しましょう。

DT <- data.table(V1 = unlist(Combined))       ## unlist the values
DT <- cSplit(DT, "V1", ":")                   ## Split by a colon
DT[, V1_1 := gsub("Public sector |Private sector ", "", V1_1)]  ## Just "notes"
DT[, id := cumsum(V1_1 == "Company")]         ## Add an id column

そこから、次のようdcast.data.tableに、データセットを「長い」ものから「広い」ものに変換するために使用できます。

dcast.data.table(DT, id ~ V1_1, value.var = "V1_2")
#    id                       Address Category                    Company
# 1:  1 calle 13 13 Medellin Colombia       3. Anada - Asociación de nada
# 2:  2      calle 12 Bogota Colombia      99. Atodo - Asociación de todo
#         E-mail                               Notes Other address
# 1: anada@13.co                                  NA            NA
# 2:          NA note that there are missing fields.            NA
#                        Phone.          Web page
# 1: 13-13-136131 13-13-13-1313                NA
# 2:                 12-1-23-32 www.atodoooo.com,

r - Rでテキストをマトリックスに変換して.csvにする

2 に答える 2

Related

Reference