r - 文字列内のすべての単語の数を数える

Question

文字列内の単語数をカウントする機能はありますか？例えば：

str1 <- "How many words are in this sentence"

7の結果を返します。

score 76 · Accepted Answer

正規表現記号\\Wを使用して、単語以外の文字と一致させ、を使用+して行内の 1 つ以上を示しgregexpr、文字列内のすべての一致を検索します。単語は、単語区切りの数に 1 を加えた数です。

lengths(gregexpr("\\W+", str1)) + 1

\\Wこれは、「単語」がの非単語の概念を満たさない場合、文字ベクトルの先頭または末尾に空白文字列があると失敗します(他の正規表現\\S+,[[:alpha:]]などを使用できますが、常に正規表現アプローチのエッジケースである）などstrsplit。各単語にメモリを割り当てるソリューションよりも効率的である可能性があります。正規表現については、で説明されてい?regexます。

更新コメントと @Andri による別の回答で指摘されているように、アプローチは (ゼロ) と 1 語の文字列、および末尾の句読点で失敗します

str1 = c("", "x", "x y", "x y!" , "x y! z")
lengths(gregexpr("[A-z]\\W+", str1)) + 1L
# [1] 2 2 2 3 3

他の回答の多くは、これらまたは同様の（複数のスペースなど）ケースでも失敗します。元の回答の「1つの単語の概念」に関する私の回答の警告は、句読点の問題をカバーしていると思います（解決策：別の正規表現を選択してください、たとえば、[[:space:]]+）が、ゼロと1つの単語のケースは問題です。@Andri のソリューションでは、0 語と 1 語を区別できません。そのため、「ポジティブ」なアプローチで言葉を見つけようとすれば、

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))

につながる

sapply(gregexpr("[[:alpha:]]+", str1), function(x) sum(x > 0))
# [1] 0 1 2 2 3

ここでも、正規表現は「単語」のさまざまな概念に合わせて改良される場合があります。

gregexpr()メモリ効率が良いので、私はの使用が好きです。strsplit()（@ user813966のように、単語を区切るための正規表現を使用）を使用し、単語を区切るという元の概念を利用する代替手段は

lengths(strsplit(str1, "\\W+"))
# [1] 0 1 2 2 3

これは、作成される単語ごとに新しいメモリを割り当て、単語の中間リストに割り当てる必要があります。データが「大きい」場合、これは比較的高価になる可能性がありますが、おそらくほとんどの目的で効果的で理解しやすいものです。

score 57 · Accepted Answer

最も簡単な方法は次のとおりです。

require(stringr)
str_count("one,   two three 4,,,, 5 6", "\\S+")

... スペース以外の文字 ( ) のすべてのシーケンスをカウントします\\S+。

しかし、カウントしたい単語の種類と、ベクトル全体でも機能する単語を決定できる小さな関数はどうでしょうか?

require(stringr)
nwords <- function(string, pseudo=F){
  ifelse( pseudo, 
          pattern <- "\\S+", 
          pattern <- "[[:alpha:]]+" 
        )
  str_count(string, pattern)
}

nwords("one,   two three 4,,,, 5 6")
# 3

nwords("one,   two three 4,,,, 5 6", pseudo=T)
# 6

score 48 · Accepted Answer

str_countライブラリの関数を、次を表すstringrエスケープシーケンスと共に使用します。\w

任意の「単語」文字 (現在のロケールの文字、数字、またはアンダースコア: UTF-8 モードでは、ASCII 文字と数字のみが考慮されます)

例：

> str_count("How many words are in this sentence", '\\w+')
[1] 7

私がテストできた他の9つの回答のうち、これまでにここに提示されたすべての入力に対して機能したのは2つ（Vincent Zoonekindとpetermeissnerによる）だけでしたが、stringr.

ただし、これまでに提示されたすべての入力に加えて"foo+bar+baz~spam+eggs"またはなどの入力で機能するのは、このソリューションのみ"Combien de mots sont dans cette phrase ?"です。

基準：

library(stringr)

questions <-
  c(
    "", "x", "x y", "x y!", "x y! z",
    "foo+bar+baz~spam+eggs",
    "one,   two three 4,,,, 5 6",
    "How many words are in this sentence",
    "How  many words    are in this   sentence",
    "Combien de mots sont dans cette phrase ?",
    "
    Day after day, day after day,
    We stuck, nor breath nor motion;
    "
  )

answers <- c(0, 1, 2, 2, 3, 5, 6, 7, 7, 7, 12)

score <- function(f) sum(unlist(lapply(questions, f)) == answers)

funs <-
  c(
    function(s) sapply(gregexpr("\\W+", s), length) + 1,
    function(s) sapply(gregexpr("[[:alpha:]]+", s), function(x) sum(x > 0)),
    function(s) vapply(strsplit(s, "\\W+"), length, integer(1)),
    function(s) length(strsplit(gsub(' {2,}', ' ', s), ' ')[[1]]),
    function(s) length(str_match_all(s, "\\S+")[[1]]),
    function(s) str_count(s, "\\S+"),
    function(s) sapply(gregexpr("\\W+", s), function(x) sum(x > 0)) + 1,
    function(s) length(unlist(strsplit(s," "))),
    function(s) sapply(strsplit(s, " "), length),
    function(s) str_count(s, '\\w+')
  )

unlist(lapply(funs, score))

出力 (11 は可能な最大スコアです):

6 10 10  8  9  9  7  6  6 11

score 29 · Accepted Answer

29

使用strsplitできるsapply機能

sapply(strsplit(str1, " "), length)

于 2012-07-17T04:46:15.883 に答える

score 15 · Accepted Answer

str2 <- gsub(' {2,}',' ',str1)
length(strsplit(str2,' ')[[1]])

はgsub(' {2,}',' ',str1)、2つ以上のスペースの出現をすべて1つのスペースに置き換えることにより、すべての単語が1つのスペースだけで区切られるようにします。

分割はすべてのstrsplit(str,' ')スペースで文を分割し、結果をリストで返します。その[[1]]リストから単語のベクトルを取得します。length単語数をカウントアップします。

> str1 <- "How many words are in this     sentence"
> str2 <- gsub(' {2,}',' ',str1)
> str2
[1] "How many words are in this sentence"
> strsplit(str2,' ')
[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> strsplit(str2,' ')[[1]]
[1] "How"      "many"     "words"    "are"      "in"       "this"     "sentence"
> length(strsplit(str2,' ')[[1]])
[1] 7

score 13 · Accepted Answer

str_match_allを、単語を識別する正規表現とともに使用できます。以下は、最初、最後、および重複したスペースで機能します。

library(stringr)
s <-  "
  Day after day, day after day,
  We stuck, nor breath nor motion;
"
m <- str_match_all( s, "\\S+" )  # Sequences of non-spaces
length(m[[1]])

score 11 · Accepted Answer

stringiパッケージからこの機能を試してください

   require(stringi)
   > s <- c("Lorem ipsum dolor sit amet, consectetur adipisicing elit.",
    +        "nibh augue, suscipit a, scelerisque sed, lacinia in, mi.",
    +        "Cras vel lorem. Etiam pellentesque aliquet tellus.",
    +        "")
    > stri_stats_latex(s)
        CharsWord CharsCmdEnvir    CharsWhite         Words          Cmds        Envirs 
              133             0            30            24             0             0

score 7 · Accepted Answer

ライブラリqdapでwc関数を使用できます。

> str1 <- "How many words are in this sentence"
> wc(str1)
[1] 7

score 6 · Accepted Answer

二重スペースを削除" "し、文字列内の数を数えて単語数を取得できます。stringrとrm_white{ qdapRegex }を使用する

str_count(rm_white(s), " ") +1

score 5 · Accepted Answer

5

これを試して

length(unlist(strsplit(str1," ")))

于 2014-07-04T06:38:32.883 に答える

score 4 · Accepted Answer

ソリューション 7 では、単語が 1 つしかない場合、正しい結果が得られません。gregexpr の結果 (一致しない場合は -1) の要素をカウントするだけでなく、要素 > 0 をカウントする必要があります。

エルゴ:

sapply(gregexpr("\\W+", str1), function(x) sum(x>0) ) + 1

score 2 · Accepted Answer

句読点や余分なスペースを無視して単語の境界を認識する stringr 関数 str_split() およびboundary() を使用できます。

sapply(str_split("It's 12 o'clock already", boundary("word")), length)
#[1] 4
sapply(str_split("  It's  >12  o'clock already ?! ", boundary("word")), length)
#[1] 4

score 1 · Accepted Answer

次の関数と正規表現は単語カウントに役立ちます。特に、シングルハイフンとダブルハイフンを処理する場合に役立ちます。一方、二重ハイフンは、空白で区切られていない句読点の区切り文字です (括弧内のコメントなど)。

txt <- "Don't you think e-mail is one word--and not two!" #10 words
words <- function(txt) { 
length(attributes(gregexpr("(\\w|\\w\\-\\w|\\w\\'\\w)+",txt)[[1]])$match.length) 
}

words(txt) #10 words

Stringi は便利なパッケージです。ただし、この例では、ハイフンが原因で単語が過大にカウントされています。

stringi::stri_count_words(txt) #11 words

score 1 · Accepted Answer

使用するnchar

文字列のベクトルが呼び出された場合x

(nchar(x) - nchar(gsub(' ','',x))) + 1

スペースの数を調べてから 1 つ追加します

r - 文字列内のすべての単語の数を数える

18 に答える 18

Related

Reference