r - Rで区切り文字として複数のスペースを含むテキストファイルを読む

Question

約 94 列と 300 万行からなる大きなデータセットがあります。このファイルには、列間の区切り文字として単一のスペースと複数のスペースがあります。R でこのファイルからいくつかの列を読み取る必要があります。このために、以下のコードで確認できるオプションを指定して read.table() を使用してみました。

### Defining the columns to be read from the file, the first 5 column, then we do not read next 24, after this we read next 5 columns. Last 60 columns are not read in-

    col_classes = c(rep("character",2), rep("numeric", 3), rep("NULL",24), rep("numeric", 5), rep("NULL", 60))   

### Reading first 100 rows of the data

    data <- read.table(file, sep = " ",header = F, nrows = 100, na.strings ="", stringsAsFactors= F)

読み込まなければならないファイルには、いくつかの列の間の区切り文字として複数のスペースがあるため、上記の方法は機能しません。このファイルを効率的に読み取る方法はありますか。

score 108 · Accepted Answer

区切り文字を変更する必要があります。" "1 つの空白文字を参照します。""任意の長さの空白を区切り文字として参照します

 data <- read.table(file, sep = "" , header = F , nrows = 100,
                     na.strings ="", stringsAsFactors= F)

マニュアルから：

sep = "" (read.table のデフォルト) の場合、区切り文字は「空白」、つまり 1 つ以上のスペース、タブ、改行、またはキャリッジリターンです。

また、データファイルが大きい場合は、data.table:::freadデータを直接 data.table にすばやく読み込むことを検討することをお勧めします。私は今朝この機能を使用していました。まだ実験段階ですが、実際に非常にうまく機能することがわかりました。

score 8 · Accepted Answer

tidyverse代わりに(またはそれぞれ) パッケージを使用する場合は、代わりreadrに使用できますread_table。

read_table(file, col_names = TRUE, col_types = NULL,
  locale = default_locale(), na = "NA", skip = 0, n_max = Inf,
  guess_max = min(n_max, 1000), progress = show_progress(), comment = "")

そして、ここの説明を見てください：

read_table() and read_table2() are designed to read the type of textual data where
each column is #' separate by one (or more) columns of space.

score 3 · Accepted Answer

フィールドの幅が固定されている場合は、read.fwf()欠損値をより適切に処理できる方を使用することを検討する必要があります。

r - Rで区切り文字として複数のスペースを含むテキストファイルを読む

3 に答える 3

Related

Reference