r - 奇妙なフォーマッター CSV ファイルの読み取り

Question

私は、statistics.gov.scot Web サイトからいくつかのデータをダウンロードしようとしています。たとえば、入院率に関するデータを入手したいと思います。関心のあるデータテーブルを取得するためのクエリの形式は次のとおりです。

http://statistics.gov.scot/slice/observations.csv?&dataset=http%3A%2F%2Fstatistics.gov.scot%2Fdata%2Freconvictions&http%3A%2F%2Fpurl.org%2Flinked-data%2Fcube%23measureType=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fmeasure-properties%2Fratio&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fage=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fage%2Fall&http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fdimension%2Fgender=http%3A%2F%2Fstatistics.gov.scot%2Fdef%2Fconcept%2Fgender%2Fall

試してみたい人は、このリンクからアクセスできます。クエリ*.CSVは関連情報を含むファイルを生成しますが、ファイルの形式にはいくつかの課題があります。

ファイル例

ファイルの内容は次のようになります。

Generated by http://statistics.gov.scot,2016-03-15T10:41:28+00:00
http://statistics.gov.scot/data/hospital-admissions,Hospital Admissions
measure type,""
Admission Type,""
Age,""
Gender,""
Measure (cell values): ,"Ratio (Rate Per 100,000 Population)"

,,http://reference.data.gov.uk/id/year/2002,http://reference.data.gov.uk/id/year/2003,http://reference.data.gov.uk/id/year/2004,http://reference.data.gov.uk/id/year/2005,http://reference.data.gov.uk/id/year/2006,http://reference.data.gov.uk/id/year/2007,http://reference.data.gov.uk/id/year/2008,http://reference.data.gov.uk/id/year/2009,http://reference.data.gov.uk/id/year/2010,http://reference.data.gov.uk/id/year/2011,http://reference.data.gov.uk/id/year/2012
http://purl.org/linked-data/sdmx/2009/dimension#refArea,Reference Area,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012
http://statistics.gov.scot/id/statistical-geography/S92000003,Scotland,"9,351","9,262","9,261","9,347","9,723","10,517","10,293","10,150","10,024","10,232","10,194"

Excel にインポートする場合:

ただし、経由でRにインポートするread.csvと、次のようになります。

> head(problematicFile)
                                                   V1                        V2
1             Generated by http://statistics.gov.scot 2016-03-15T10:36:29+00:00
2 http://statistics.gov.scot/data/hospital-admissions       Hospital Admissions
3                                        measure type                          
4                                      Admission Type                          
5                                                 Age                          
6                                              Gender

問題

read.csvインポートは 2 つの列のみを返します。問題は、最初の列の一部が空であることに関連していると推測しています。このファイルを、Excel で図解されたインポートと同様の方法で読みたいと考えています。ポイントは、列Aと列Bで行7の値を使用するつもりであり、当然、以下のデータテーブルを使用することです。を生成するという点では、空のセルがある場合に値を含めたいと思いますが、Excel の値と同等の次元になります。私は試した：data.frameNA

read.csv(file = link, header = FALSE, na.strings = "",
                               fill = TRUE)

しかし、私は同じ問題に到達し続けます。

望ましい結果

望ましい結果は次のようになります(手動で生成された抜粋)。

Generated by http://statistics.gov.scot 2016-03-15T10:41:28+00:00   NA  NA  NA  NA  NA  NA  NA
http://statistics.gov.scot/data/hospital-admissions Hospital Admissions NA  NA  NA  NA  NA  NA  NA
measure type    NA  NA  NA  NA  NA  NA  NA  NA
Admission Type  NA  NA  NA  NA  NA  NA  NA  NA
Age NA  NA  NA  NA  NA  NA  NA  NA
Gender  NA  NA  NA  NA  NA  NA  NA  NA
Measure (cell values):  Ratio (Rate Per 100,000 Population)         NA  NA  NA  NA  NA
NA  NA  NA  NA  NA  NA  NA  NA  NA
NA  NA  http://reference.data.gov.uk/id/year/2002   http://reference.data.gov.uk/id/year/2003   http://reference.data.gov.uk/id/year/2004   http://reference.data.gov.uk/id/year/2005   http://reference.data.gov.uk/id/year/2006   http://reference.data.gov.uk/id/year/2007   http://reference.data.gov.uk/id/year/2008
http://purl.org/linked-data/sdmx/2009/dimension#refArea Reference Area  2002    2003    2004    2005    2006    2007    2008
http://statistics.gov.scot/id/statistical-geography/S92000003   Scotland    9,351   9,262   9,261   9,347   9,723   10,517  10,293
http://statistics.gov.scot/id/statistical-geography/S16000082   Angus South 8,236   8,500   8,523   8,371   8,616   8,978   9,325
http://statistics.gov.scot/id/statistical-geography/S16000106   Edinburgh Northern and Leith    9,040   8,040   7,925   9,042   10,355  11,833  8,916
http://statistics.gov.scot/id/statistical-geography/S16000140   Renfrewshire South  9,391   9,122   9,491   9,586   10,425  10,900  11,065
http://statistics.gov.scot/id/statistical-geography/S16000108   Edinburgh Southern  5,878   5,910   6,101   6,035   7,426   9,343   6,766
http://statistics.gov.scot/id/statistical-geography/S16000075   Aberdeen Donside    10,047  10,963  10,629  10,512  10,383  10,787  10,685
http://statistics.gov.scot/id/statistical-geography/S16000137   Perthshire North    9,388   9,524   7,799   9,350   9,543   9,791   9,991
http://statistics.gov.scot/id/statistical-geography/S16000077   Aberdeenshire East  7,211   7,300   7,153   7,411   7,435   7,268   7,547
http://statistics.gov.scot/id/statistical-geography/S16000114   Galloway and West Dumfries  9,861   9,165   8,143   9,258   7,508   10,213  10,399
http://statistics.gov.scot/id/statistical-geography/S16000096   Dumbarton   8,703   8,570   8,727   9,310   9,389   9,885   10,237

スクリーンショット

さらに説明するために、ディメンションを維持し、欠損値にNAsを入力します。

score 1 · Accepted Answer

col.namesread.csv に複数の列を読み取らせるには、手動でを指定する必要があります。またna.strings、空の文字列として指定すると、空の列に値が保持NAされます。

read.csv(<parameters>, col.names=c("Col1","Col2".....), na.strings="")

score 0 · Accepted Answer

read.table と列名を使用して、列の数を指定できます。

read.table(file = link, 
           fill = TRUE,
           sep = ",",
           na.strings = "",
           col.names = paste("c", 1:12, sep = ""))

ただし、事前に列の数を知る必要があるため、これが良い解決策であるかどうかはわかりません。

別の方法は、csv 全体を文字列として読み取ることです。次に、ヘッダーを別のオブジェクト（リストなど）に保存して前処理し、「テーブル部分」をデータフレームとして使用できます。

r - 奇妙なフォーマッター CSV ファイルの読み取り

ファイル例

問題

望ましい結果

スクリーンショット

3 に答える 3

Related

Reference