regex - R: 文字列から最後の 3 つのドットを削除する

Question

で読み取る可能性が高いテキストデータファイルがありますreadLines。各文字列の最初の部分には、意味不明な部分が多く、その後に必要なデータが続きます。ちんぷんかんぷんとデータは通常、3 つのドットで区切られます。最後の 3 つのドットの後に文字列を分割するか、最後の 3 つのドットをある種のマーカーに置き換えて、これらの 3 つのドットの左側にあるすべてのものを 1 つの列として扱うように R に指示します。

最後のドットを見つける Stackoverflow の同様の投稿を次に示します。

R: 文字列の最後のドットを見つける

ただし、私の場合、一部のデータには小数が含まれているため、最後のドットを見つけるだけでは不十分です。...また、 R では特別な意味があると思いますが、これが問題を複雑にしている可能性があります。もう 1 つの潜在的な問題は、一部のドットが他のドットよりも大きいことです。また、一部の行では、3 つのドットのうちの 1 つがコンマに置き換えられました。

gregexpr上記の投稿に加えて、を使用してみgsubましたが、解決策がわかりません。

以下は、データセットの例と、達成したい結果です。

aa = matrix(c(
'first string of junk... 0.2 0 1', 
'next string ........2 0 2', 
'%%%... ! 1959 ...  0 3 3',
'year .. 2 .,.  7 6 5',
'this_string   is . not fine .•. 4 2 3'), 
nrow=5, byrow=TRUE,
dimnames = list(NULL, c("C1")))

aa <- as.data.frame(aa, stringsAsFactors=F)
aa

# desired result
#                             C1  C2 C3 C4
# 1        first string of junk  0.2  0  1
# 2            next string .....   2  0  2
# 3             %%%... ! 1959      0  3  3
# 4                 year .. 2      7  6  5
# 5 this_string   is . not fine    4  2  3

この質問が具体的すぎると見なされないことを願っています。テキストデータファイルは、R での MSWord ファイルの読み取りに関する昨日の投稿で概説した手順を使用して作成されました。

一部の行には意味不明または 3 つのドットが含まれていませんが、データのみが含まれています。ただし、それはフォローアップの投稿を複雑にする可能性があります。

アドバイスありがとうございます。

score 5 · Accepted Answer

特にエレガントではありませんが、これでうまくいきます...

options(stringsAsFactors = FALSE)


# Search for three consecutive characters of your delimiters, then pull out
# all of the characters after that
# (in parentheses, represented in replace by \\1)
nums <- as.vector(gsub(aa$C1, pattern = "^.*[.,•]{3}\\s*(.*)", replace = "\\1"))

# Use strsplit to break the results apart at spaces and just get the numbers
# Use unlist to conver that into a bare vector of numbers
# Use matrix(, nrow = length(x)) to convert it back into a
# matrix of appropriate length
num.mat <- do.call(rbind, strsplit(nums, split = " "))


# Mash it back together with your original strings
result <- as.data.frame(cbind(aa, num.mat))

# Give it informative names
names(result) <- c("original.string", "num1", "num2", "num3")

score 2 · Accepted Answer

これにより、ほとんどの場合、コンマを含む数値に問題がなくなります。

# First, use a regex to eliminate the bad pattern.  This regex
# eliminates any three-character combination of periods, commas,
# and big dots (•), so long as the combination is followed by 
# 0-2 spaces and then a digit.
aa.sub <- as.matrix(
  apply(aa, 1, function (x) 
    gsub('[•.,]{3}(\\s{0,2}\\d)', '\\1', x, perl = TRUE)))

# Second: it looks as though you want your data split into columns.
# So this regex splits on spaces that are (a) preceded by a letter, 
# digit, or space, and (b) followed by a digit.  The result is a 
# list, each element of which is a list containing the parts of 
# one of the strings in aa.
aa.list <- apply(aa.sub, 1, function (x) 
  strsplit(x, '(?<=[\\w\\d\\s])\\s(?=\\d)', perl = TRUE))  

# Remove the second element in aa.  There is no space before the 
# first data column in this string.  As a result, strsplit() split
# it into three columns, not 4.  That in turn throws off the code
# below.
aa.list <- aa.list[-2]

# Make the data frame.
aa.list <- lapply(aa.list, unlist)  # convert list of lists to list of vectors
aa.df   <- data.frame(aa.list)      
aa.df   <- data.frame(t(aa.df), row.names = NULL, stringsAsFactors = FALSE)

残っている唯一のことはstrsplit()、の 2 番目の文字列を処理できるようにの正規表現を変更することaaです。または、そのようなケースを手動で処理する方が良いかもしれません。

score 0 · Accepted Answer

文字列
を逆にする必要に応じて検索しているパターンを逆にします-あなたの場合ではありません
結果を逆にします

[俳句擬似コード]

a = 'first string of junk... 0.2 0 1' // string to search
b = 'junk' // pattern to match 

ra = reverseString(a) // now equals '1 0 2.0 ...knuj fo gnirts tsrif'
rb = reverseString (b) // now equals 'knuj'

// run your regular expression search / replace - search in 'ra' for 'rb'
// put the result in rResult
// and then unreverse the result
// apologies for not knowing the syntax for 'R' regex

[/ haiku-pseudocode]

regex - R: 文字列から最後の 3 つのドットを削除する

3 に答える 3

Related

Reference