r - R: ファイル名の一部を抽出

Question

Rを使用してファイル名の一部を抽出しようとしています。ここからこれを行う方法について漠然とした考えがあります: Rでファイル名の一部を抽出しますが、ファイル名のリストでこれを機能させることはできません

ファイル名の例:

"Species Count (2011-12-15-07-09-39).xls"
"Species Count 0511.xls"
"Species Count 151112.xls" 
"Species Count1011.xls" 
"Species Count2012-01.xls" 
"Species Count201207.xls" 
"Species Count2013-01-15.xls"

ファイル名には、種の数と日付の間にスペースがあるもの、スペースがないもの、長さが異なるもの、括弧が含まれるものがあります。ファイル名の数値部分を抽出し、-も保持したいだけです。たとえば、上記のデータの場合、次のようになります。

期待される出力:

2011-12-15-07-09-39 , 0511 , 151112 , 1011 , 2012-01 , 201207 , 2013-01-15

score 5 · Accepted Answer

関数gsub()を使用して、すべての文字、スペース、ピリオド、および括弧を削除します。次に、数字とハイフンが残ります。例えば、

x <- c("Species Count (2011-12-15-07-09-39).xls", "Species Count 0511.xls", 
    "Species Count 151112.xls", "Species Count1011.xls", "Species Count2012-01.xls", 
    "Species Count201207.xls", "Species Count2013-01-15.xls")

gsub("[A-z \\.\\(\\)]", "", x)

[1] "2011-12-15-07-09-39" "0511"                "151112"             
[4] "1011"                "2012-01"             "201207"             
[7] "2013-01-15"

score 2 · Accepted Answer

速度が気になる場合はsub、後方参照を使用して必要な部分を抽出できます。また、多くの場合、高速であることに注意してくださいperl=TRUE(によると?grep)。

jj <- function() sub("[^0-9]*([0-9].*[0-9])[^0-9]*", "\\1", tt, perl=TRUE)
aa <- function() regmatches(tt, regexpr("[0-9].*[0-9]", tt, perl=TRUE))

# Run on R-2.15.2 on 32-bit Windows
microbenchmark(arun <- aa(), josh <- jj(), times=25)
# Unit: milliseconds
#           expr       min        lq    median        uq       max
# 1 arun <- aa() 2156.5024 2189.5168 2191.9972 2195.4176 2410.3255
# 2 josh <- jj()  390.0142  390.8956  391.6431  394.5439  493.2545
identical(arun, josh)  # TRUE

# Run on R-3.0.1 on 64-bit Ubuntu
microbenchmark(arun <- aa(), josh <- jj(), times=25)
# Unit: seconds
#          expr      min       lq   median       uq      max neval
#  arun <- aa() 1.794522 1.839044 1.858556 1.894946 2.207016    25
#  josh <- jj() 1.003365 1.008424 1.009742 1.059129 1.074057    25
identical(arun, josh)  # still TRUE

score 1 · Accepted Answer

package を使用stringrして、数字のみまたは数字の後にが続くすべての文字列を抽出します-。

library(stringr)
str_extract(ll,'([0-9]|[0-9][-])+')

[1] "2011-12-15-07-09-39" "0511"               
    "151112"              "1011"                "2012-01"            
[6] "201207"              "2013-01-15"

r - R: ファイル名の一部を抽出

4 に答える 4

Related

Reference