r - How to Count Text Lines in R?

Question

I would like to calculate the number of lines spoken by different speakers from a text using R (it is a transcript of parliamentary speaking records). The basic text looks like:

MR. JOHN: This activity has been going on in Tororo and I took it up with the office of the DPC. He told me that he was not aware of it.
MS. SMITH: Yes, I am aware of that. 
MR. LEHMAN: Therefore, I am seeking your guidance, Madam Speaker, and requesting that you re-assign the duty.  
MR. JOHN: Thank you

In the documents, each speaker has an identifier that begins with MR/MS and is always capitalized. I would like to create a dataset that counts the number of lines spoken for each speaker for each time spoke in a document such that the above text would result in:

MR. JOHN: 2
MS. SMITH: 1
MR. LEHMAN: 2
MR. JOHN: 1

Thanks for pointers using R!

score 10 · Accepted Answer

パターン:を使用して文字列を分割してから、次を使用できますtable。

table(sapply(strsplit(x, ":"), "[[", 1))
#   MR. JOHN MR. LEHMAN  MS. SMITH 
#          2          1          1

strsplit-で文字列を分割し、[[:でリストを作成します-
リストテーブルの最初の部分の要素を選択し
ます-頻度を取得します

編集： OPのコメントに続いて。トランスクリプトをテキストファイルreadLinesに保存し、Rでテキストを読み取るために使用できます。

tt <- readLines("./tmp.txt")

ここで、話している人の名前が付いた行だけを対象に、このテキストをフィルタリングするパターンを見つける必要があります。リンクしたトランスクリプトで見たものに基づいて、2つのアプローチを考えることができます。

aをチェックしてから:、後ろ:を見て、またはのいずれかであるかどうかを確認しますA-Z（[:punct:]つまり、の前にある:文字が大文字または句読点のいずれかである場合-これは、それらの一部に)前が付いているため:です）。

strsplit続いてsapply（以下に示すように）を使用できます

strsplitの使用：

# filter tt by pattern
tt.f <- tt[grepl("(?<=[A-Z[:punct:]]):", tt, perl = TRUE)]
# Now you should only have the required lines, use the command above:

out <- table(sapply(strsplit(tt.f, ":"), "[[", 1))

可能な他のアプローチ（gsub例:)または代替パターンがあります。しかし、これはあなたにアプローチのアイデアを与えるはずです。パターンが異なる場合は、必要なすべての線をキャプチャするようにパターンを変更する必要があります。

もちろん、これは、たとえば次のような他の行がないことを前提としています。

"Mr. Chariman, whatever (bla bla): It is not a problem"

私たちのパターンは。に対してTRUEを与えるからです):。これがテキストで発生した場合は、より適切なパターンを見つける必要があります。

r - How to Count Text Lines in R?

1 に答える 1

Related

Reference