r - Rで特定の形式のファイルを読み取る方法は?

Question

各行が日付、テキスト、数値を含むデータセットを表すファイルを読みたいと思います。例：

Fri Dec 11 12:40:01 CET 2015    Uptime: 108491  Threads: 2  Questions: 576603  Slow queries: 10  Opens: 2238  Flush tables: 1  Open tables: 7  Queries per second avg: 5.314
Fri Dec 11 12:50:01 CET 2015    Uptime: 109090  Threads: 2  Questions: 580407  Slow queries: 10  Opens: 2253  Flush tables: 1  Open tables: 6  Queries per second avg: 5.320
Fri Dec 11 13:00:01 CET 2015    Uptime: 109690  Threads: 2  Questions: 583895  Slow queries: 10  Opens: 2268  Flush tables: 1  Open tables: 8  Queries per second avg: 5.323
Fri Dec 11 13:10:01 CET 2015    Uptime: 110290  Threads: 1  Questions: 586891  Slow queries: 10  Opens: 2279  Flush tables: 1  Open tables: 6  Queries per second avg: 5.321
Fri Dec 11 13:20:01 CET 2015    Uptime: 110890  Threads: 2  Questions: 590871  Slow queries: 10  Opens: 2292  Flush tables: 1  Open tables: 5  Queries per second avg: 5.328

一般的な区切り文字 (CSV のような) はありませんが、タブ、文字、およびテキストを使用できるため、形式はかなり適切に記述できます。

%DATESTRING%\tUptime: %uptime%  Threads: %threads%  Questions: %questions%  Slow queries: %slow%  Opens: %opens%  Flush tables: %flush%  Open tables: %otables%  Queries per second avg: %qps%

フォーマットとファイルの説明を受け取り、data.frame に指定されたデータを入力する関数はありますか?

score 0 · Accepted Answer

さらに 2 つのオプション:

txt <- "Fri Dec 11 12:40:01 CET 2015    Uptime: 108491  Threads: 2  Questions: 576603  Slow queries: 10  Opens: 2238  Flush tables: 1  Open tables: 7  Queries per second avg: 5.314
Fri Dec 11 12:50:01 CET 2015    Uptime: 109090  Threads: 2  Questions: 580407  Slow queries: 10  Opens: 2253  Flush tables: 1  Open tables: 6  Queries per second avg: 5.320
Fri Dec 11 13:00:01 CET 2015    Uptime: 109690  Threads: 2  Questions: 583895  Slow queries: 10  Opens: 2268  Flush tables: 1  Open tables: 8  Queries per second avg: 5.323
Fri Dec 11 13:10:01 CET 2015    Uptime: 110290  Threads: 1  Questions: 586891  Slow queries: 10  Opens: 2279  Flush tables: 1  Open tables: 6  Queries per second avg: 5.321
Fri Dec 11 13:20:01 CET 2015    Uptime: 110890  Threads: 2  Questions: 590871  Slow queries: 10  Opens: 2292  Flush tables: 1  Open tables: 5  Queries per second avg: 5.328"

## first just tack on the date label
txt <- gsub('^', 'Date: ', readLines(textConnection(txt)))

オプション1

sp <- strsplit(txt, '\\s{2,}')
out <- lapply(sp, function(x) gsub('([\\w ]+:)\\s+(.*)$', '\\2', x, perl = TRUE))
dd <- setNames(do.call('rbind.data.frame', out),
               gsub('([\\w ]+):\\s+(.*)$', '\\1', sp[[1]], perl = TRUE))
dd[, -1] <- lapply(dd[, -1], function(x) as.numeric(as.character(x)))
dd

オプション 2: これはyamlパッケージを使用しますが、はるかに簡単で、型変換を行います

yml <- gsub('\\s{2,}', '\n', txt)
do.call('rbind.data.frame', lapply(yml, yaml::yaml.load))

#                    Date Uptime Threads Questions Slow queries Opens Flush tables
# 1 Fri Dec 11 12:40:01 CET 2015 108491       2    576603           10  2238            1
# 2 Fri Dec 11 12:50:01 CET 2015 109090       2    580407           10  2253            1
# 3 Fri Dec 11 13:00:01 CET 2015 109690       2    583895           10  2268            1
# 4 Fri Dec 11 13:10:01 CET 2015 110290       1    586891           10  2279            1
# 5 Fri Dec 11 13:20:01 CET 2015 110890       2    590871           10  2292            1
#   Open tables Queries per second avg
# 1           7                  5.314
# 2           6                  5.320
# 3           8                  5.323
# 4           6                  5.321
# 5           5                  5.328

score 0 · Accepted Answer

パッケージtidyrには、これに役立つ可能性のあるユーティリティ関数がいくつか含まれていますが、このジョブ用に構築された専用ツールが他にもあったとしても驚かないでしょう。

この場合は文字列からデータをロードすることから始めます

raw <- 'Fri Dec 11 12:40:01 CET 2015    Uptime: 108491  Threads: 2     Questions: 576603  Slow queries: 10  Opens: 2238  Flush tables: 1  Open tables: 7  Queries per second avg: 5.314
Fri Dec 11 12:50:01 CET 2015    Uptime: 109090  Threads: 2  Questions: 580407  Slow queries: 10  Opens: 2253  Flush tables: 1  Open tables: 6  Queries per second avg: 5.320
Fri Dec 11 13:00:01 CET 2015    Uptime: 109690  Threads: 2  Questions: 583895  Slow queries: 10  Opens: 2268  Flush tables: 1  Open tables: 8  Queries per second avg: 5.323
Fri Dec 11 13:10:01 CET 2015    Uptime: 110290  Threads: 1  Questions: 586891  Slow queries: 10  Opens: 2279  Flush tables: 1  Open tables: 6  Queries per second avg: 5.321
Fri Dec 11 13:20:01 CET 2015    Uptime: 110890  Threads: 2  Questions: 590871  Slow queries: 10  Opens: 2292  Flush tables: 1  Open tables: 5  Queries per second avg: 5.328'

df <- read.csv(textConnection(raw), header=F)

ここではread.csv、データフレームとして取得するために使用readLinesしましたが、自分で使用してフレームに追加することもできます。

それから私達はそれを処理します

library(tidyr)
> processed <- df %>% extract(V1,
  c("Date", "Uptime", "Threads", "Questions"),
  "(.*) *Uptime: (\\d+) *Threads: (\\d+) *Questions: (\\d+)")
> processed
                              Date Uptime Threads Questions
1 Fri Dec 11 12:40:01 CET 2015     108491       2    576603
2 Fri Dec 11 12:50:01 CET 2015     109090       2    580407
3 Fri Dec 11 13:00:01 CET 2015     109690       2    583895
4 Fri Dec 11 13:10:01 CET 2015     110290       1    586891
5 Fri Dec 11 13:20:01 CET 2015     110890       2    590871

ここから残りの列を抽出する方法は明らかです。

r - Rで特定の形式のファイルを読み取る方法は?

2 に答える 2

Related

Reference