r - R: srt (字幕) ファイルから時刻を抽出しています

Question

字幕の各行の読み上げ速度を計算する必要があります。srt (字幕) ファイルの内容は次のようになります。

1
00:00:19,000 --> 00:00:21,989
I'm Annita McVeigh and welcome to Election Today where we'll bring you

2
00:00:22,000 --> 00:00:23,989
the latest from the campaign trail, plus debate and analysis.

3
00:00:24,000 --> 00:00:28,989
The Liberal Democrats promise to protect the pay of millions

たとえば、「自由民主党は数百万の支払いを守ると約束する」という10 の単語を言うのに4 秒 989 ミリ秒かかります。これら 10 語の平均発話速度は、1 語あたり 498.9 ミリ秒です。

以下のように、列としてstartTime、endTime、textString、およびwordCountを持つデータフレームと行として字幕の行を持つことができるように、srt ファイルを読み取るにはどうすればよいですか?

startTime<-c("00:00:19,000", "00:00:22,000", "00:00:24,000")

endTime<-c("00:00:21,989", "00:00:23,989", "00:00:28,989")

textString<-c("I'm Annita McVeigh and welcome to Election Today where we'll bring you", "the latest from the campaign trail, plus debate and analysis.", "The Liberal Democrats promise to protect the pay of millions")

wordCount<-c(12,10,10)

rate.df<-data.frame(startTime, endTime, textString, wordCount)

時間が時:分:秒,ミリ秒の形式で表示されている場合、RでendTimeからstartTimeを減算するにはどうすればよいですか?

score 2 · Accepted Answer

考えられる解決策は次のとおりです（コードはかなり自明です）：

text="

1
00:00:19,000 --> 00:00:21,989
I'm Annita McVeigh and welcome to Election Today where we'll bring you

2
00:00:22,000 --> 00:00:23,989
the latest from the campaign trail, 
plus debate 
and analysis.



3
00:00:24,000 --> 00:00:28,989
The Liberal Democrats promise to protect 
the pay of millions"

con<-textConnection(text)
lines <- readLines(con) 

# the previous lines of code are just to replicate you case, and
# they should be replaced by the following single line in the real case
# lines <- readLines(srtFileName)

listOfEntries <- 
lapply(split(1:length(lines),cumsum(grepl("^\\s*$",lines))),function(blockIdx){
    block <- lines[blockIdx]
    block <- block[!grepl("^\\s*$",block)]
    if(length(block) == 0){
      return(NULL)
    }
    if(length(block) < 3){
      warning("a block not respecting srt standards has been found")
    }
    return(data.frame(id=block[1], 
                      times=block[2], 
                      textString=paste0(block[3:length(block)],collapse="\n"),
                      stringsAsFactors = FALSE))
  })
m <- do.call(rbind,listOfEntries)


# split start and end times
tmp <- do.call(rbind,strsplit(m[,'times'],' --> '))
m$startTime <- tmp[,1]
m$endTime <- tmp[,2]

# parse times
tmp <- do.call(rbind,lapply(strsplit(m$startTime,':|,'),as.numeric))
m$fromSeconds  <- tmp %*% c(60*60,60,1,1/1000)

tmp <- do.call(rbind,lapply(strsplit(m$endTime,':|,'),as.numeric))
m$toSeconds  <- tmp %*% c(60*60,60,1,1/1000)

# compute time difference in seconds
m$timeDiffInSecs <- m$toSeconds - m$fromSeconds

# word count
m$wordCount <- vapply(gregexpr("\\W+",m$textString),length,0) + 1

# or if you consider "I'm" a single word you can remove the occurrencies of ', e.g. :
#m$wordCount <- vapply(gregexpr("\\W+",gsub("'","",m$textString)),length,0) + 1

m$millisecsPerWord <- m$timeDiffInSecs * 1000 / m$wordCount

結果：

> m
  id                         times                                                             textString
2  1 00:00:19,000 --> 00:00:21,989 I'm Annita McVeigh and welcome to Election Today where we'll bring you
3  2 00:00:22,000 --> 00:00:23,989      the latest from the campaign trail, \nplus debate \nand analysis.
6  3 00:00:24,000 --> 00:00:28,989         The Liberal Democrats promise to protect \nthe pay of millions
     startTime      endTime fromSeconds toSeconds timeDiffInSecs wordCount millisecsPerWord
2 00:00:19,000 00:00:21,989          19    21.989          2.989        14         213.5000
3 00:00:22,000 00:00:23,989          22    23.989          1.989        11         180.8182
6 00:00:24,000 00:00:28,989          24    28.989          4.989        10         498.9000

r - R: srt (字幕) ファイルから時刻を抽出しています

1 に答える 1

Related

Reference