r - R 選択された行番号によるデータフレームの動的分割/サブセット- textgrid praat の解析

Question

.TextGrid（Praatプログラムによって生成された）と呼ばれる「セグメンテーションファイル」を処理しようとしています。)

元の形式は次のようになります。

File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0 
xmax = 243.761375 
tiers? <exists> 
size = 17 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "phones" 
        xmin = 0 
        xmax = 243.761 
        intervals: size = 2505 
        intervals [1]:
            xmin = 0 
            xmax = 0.4274939687384032 
            text = "_" 
        intervals [2]:
            xmin = 0.4274939687384032 
            xmax = 0.472 
            text = "v" 
        intervals [3]:
[...]

(これは、ファイル内の n 項目 (注釈のレイヤー) に対して [3 から n] の間隔で、EOF まで繰り返されます。

誰かがrPython R package を使用したソリューションを提案しました。

不運にも：

私はPythonについて十分な知識を持っていません
rPython のバージョンは、R.3.0.2 (私が使用しています) では使用できません。
私の目的は、このパーサーを R 環境でのみ分析するために開発することです。

現在、私の目的は、このファイルを複数のデータフレームに分割することです。各データフレームには、1 つの項目 (注釈のレイヤー) が含まれている必要があります。

# Load the Data
txtgrid <- read.delim("./xxx_01_xx.textgrid", sep=c("=","\n"), dec=".", header=FALSE)
# Erase White spaces (use stringr package)
txtgrid[,1] <- str_trim(txtgrid[,1])
# Convert row.names to numeric 
num.row<- as.numeric(row.names(txtgrid))
# Redefine the original textgrid and add those rows (I want to "keep them in case for later process)
txtgrid <- data.frame(num.row,txtgrid)
colnames(txtgrid) <- c("num.row","object", "value")
head(txtgrid)

の出力head(txtgrid)は非常に生なので、ここに textgrid の最初の 20 行を示しますtxtgrid[1:20,]。

   num.row          object                value
1        1       File type           ooTextFile
2        2    Object class             TextGrid
3        3            xmin                   0 
4        4            xmax          243.761375 
5        5 tiers? <exists>                     
6        6            size                  17 
7        7        item []:                     
8        8       item [1]:                     
9        9           class        IntervalTier 
10      10            name              phones 
11      11            xmin                   0 
12      12            xmax             243.761 
13      13 intervals: size                2505 
14      14  intervals [1]:                     
15      15            xmin                   0 
16      16            xmax  0.4274939687384032 
17      17            text                   _ 
18      18  intervals [2]:                     
19      19            xmin  0.4274939687384032 
20      20            xmax               0.472

前処理したので、次のことができます。

# Find the number of the rows where I want to split (i.e. Item)
tier.begining <- txtgrid[grep("item", txtgrid$object, perl=TRUE), ]
# And save those numbers in a variable
x <- as.numeric(row.names(tier.begining))

この変数xは、データを複数のデータフレームに分割する必要がある数値-1 を提供します。

私は18個のアイテムを持っています-1（最初のアイテムはitem []で、他のすべてのアイテムが含まれています。したがって、ベクトルxは次のとおりです。

     x
    [1]     7     8 10034 14624 19214 22444 25674 28904 31910 35140 38146 38156 38566 39040 39778 40222 44800
[18] 45018

Rにどのように伝えることができますか:このデータフレームを複数のデータフレームに分割textgrids$nameoftheItemして、アイテムと同じ数のデータフレームを取得するようにするにはどうすればよいですか?たとえば:

textgrid$phones
         item [1]:
            class = "IntervalTier" 
            name = "phones" 
            xmin = 0 
            xmax = 243.761 
            intervals: size = 2505 
            intervals [1]:
            xmin = 0 
            xmax = 0.4274939687384032 
            text = "_" 
            intervals [2]:
            xmin = 0.4274939687384032 
            xmax = 0.472 
            text = "v" 
            [...]
            intervals [n]:
textgrid$syllable
    item [2]:
            class = "IntervalTier" 
            name = "syllable" 
            xmin = 0 
            xmax = 243.761 
            intervals: size = 1200
            intervals [1]:
            xmin = 0 
            xmax = 0.500
            text = "ve" 
            intervals [2]:
            [...]
            intervals [n]:
    textgrid$item[n]

使いたかった

txtgrid.new <- split(txtgrid, f=x)

しかし、このメッセージは正しいです:

Warning message: In split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...) : data length is not a multiple of split variable

目的の出力が得られません。行番号が連続しておらず、ファイルがすべて混同されているようです。

which, daply (from plyr) &関数もいくつか試しsubsetましたが、正しく動作しませんでした!

このデータを適切かつ効率的に構造化するためのアイデアを歓迎します。理想的には、項目 (注釈のレイヤー) をそれらの間 (異なるレイヤーの xmin & xmax) と複数の textgrid ファイルにリンクできるようにする必要がありますが、これはほんの始まりにすぎません。

score 2 · Accepted Answer

ベクトルの長さは、splitの行数と等しくなければなりませんdata.frame。

次のことを試してください。

txtgrid.sub <- txtgrid[-(1:grep("item", txtgrid$object)[1]), ]

grep("item", txtgrid.sub$object)[-1]

splits <- unlist(mapply(rep, seq_along(grep("item", txtgrid.sub$object)),
                        diff(c(grep("item", txtgrid.sub$object), 
                               nrow(txtgrid.sub) + 1))))

df.list <- split(txtgrid.sub, list(splits))

編集：

次に、次のようにしてデータを単純化できます。

l <- lapply(df.list, function(x) {
  tmp <- as.data.frame(t(x[, 3, drop=FALSE]), stringsAsFactors=FALSE)
  names(tmp) <- make.unique(make.names(x[, 2]))
  tmp
})

library(plyr)
do.call(rbind.fill, l)


  item..1..        class     name xmin    xmax intervals..size
1      <NA> IntervalTier   phones    0 243.761            2505
2      <NA> IntervalTier syllable    0 243.761            2505
  intervals..1.. xmin.1             xmax.1 text intervals..2..
1           <NA>      0 0.4274939687384032    _           <NA>
2           <NA>      0 0.4274939687384032    _           <NA>
              xmin.2 xmax.2
1 0.4274939687384032  0.472
2               <NA>   <NA>

注意：上記にはダミーデータを使用しました。

score 0 · Accepted Answer

あなたは他の場所で良い解決策を見つけたようですが、参照用にこれをここに置くこともできると思いました:

私は最近、これに使用できる Praat オブジェクト用の JSON コンバーターの最初の作業バージョンを完成させました。このプラグインsave_as_json.praatに含まれているスクリプトを使用して、TextGrid を JSON ファイルとして保存できます(繰り返しますが、私はそのプラグインの作成者です)。

this other answer to a similar questionからコピーされたプラグインをインストールしたらSave、Praatのメニューからスクリプトを使用するか、別のスクリプトから次のように実行できます。

runScript: preferencesDirectory$ + "/plugin_jjatools/save_as_json.praat",
  ..."/output/path", "Pretty printed"

それが完了したら、次のようにR使用して読み取ることができます。rjson

> library(rjson)
> tg <- fromJSON(file='/path/to/your_textgrid.json')
> str(tg)
List of 5
$ File type   : chr "json"
$ Object class: chr "TextGrid"
$ start       : num 0
$ end         : num 1.82
$ tiers       :List of 2
    ..$ :List of 5
    .. ..$ class    : chr "IntervalTier"
    .. ..$ name     : chr "keyword"
    .. ..$ start    : num 0
    .. ..$ end      : num 1.82
    .. ..$ intervals:List of 3
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 0
    .. .. .. ..$ end  : num 0.995
    .. .. .. ..$ label: chr ""
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 0.995
    .. .. .. ..$ end  : num 1.5
    .. .. .. ..$ label: chr "limite"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.5
    .. .. .. ..$ end  : num 1.82
    .. .. .. ..$ label: chr ""
    ..$ :List of 5
    .. ..$ class    : chr "IntervalTier"
    .. ..$ name     : chr "segments"
    .. ..$ start    : num 0
    .. ..$ end      : num 1.82
    .. ..$ intervals:List of 8
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 0
    .. .. .. ..$ end  : num 0.995
    .. .. .. ..$ label: chr ""
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 0.995
    .. .. .. ..$ end  : num 1.07
    .. .. .. ..$ label: chr "l"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.07
    .. .. .. ..$ end  : num 1.15
    .. .. .. ..$ label: chr "i"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.15
    .. .. .. ..$ end  : num 1.23
    .. .. .. ..$ label: chr "m"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.23
    .. .. .. ..$ end  : num 1.28
    .. .. .. ..$ label: chr "i"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.28
    .. .. .. ..$ end  : num 1.37
    .. .. .. ..$ label: chr "t"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.37
    .. .. .. ..$ end  : num 1.5
    .. .. .. ..$ label: chr "e"
    .. .. ..$ :List of 3
    .. .. .. ..$ start: num 1.5
    .. .. .. ..$ end  : num 1.82
    .. .. .. ..$ label: chr ""

または、たとえば、を使用しtg$tiers[[tier_number]]$intervals[[interval_number]]ます。

r - R 選択された行番号によるデータフレームの動的分割/サブセット- textgrid praat の解析

2 に答える 2

Related

Reference