ruby - 正規表現を使用して単語のリストでRubyの文字列を分割する

Question

ストップワードのリストに基づいて、Ruby の文字列を小さなサブ文字列またはフレーズに分割しようとしています。分割方法は、正規表現パターンを直接定義すると機能します。ただし、分割メソッド自体で評価してパターンを定義しようとすると機能しません。

実際には、ストップワードの外部ファイルを読み取り、それを使用して文を分割したいと考えています。そのため、パターンを直接指定するのではなく、外部ファイルからパターンを構築できるようにしたいと考えています。また、「pp」と「puts」を使用すると、動作が大きく異なることに気付きましたが、その理由はわかりません。Windows で Ruby 2.0 と Notepad++ を使用しています。

 require 'pp'
 str = "The force be with you."     
 pp str.split(/(?:\bthe\b|\bwith\b)/i)
 => ["", " force be ", " you."]
 pp str.split(/(?:\bthe\b|\bwith\b)/i).collect(&:strip).reject(&:empty?)
 => ["force be", "you."]

上記の最終的な配列は、私の望ましい結果です。ただし、これは以下では機能しません。

 require 'pp'
 stop_array = ["the", "with"]
 str = "The force be with you." 
 pattern = "(?:" + stop_array.map{|i| "\b#{i}\b" }.join("|") + ")"
 puts pattern
 => (?thwit)
 puts str.split(/#{pattern}/i)
 => The force be with you.
 pp pattern
 => "(?:\bthe\b|\bwith\b)"
 pp str.split(/#{pattern}/i)
 => ["The force be with you."]

更新:以下のコメントを使用して、元のスクリプトを変更しました。文字列を分割するメソッドも作成しました。

 require 'pp'

 class String
      def splitstop(stopwords=[])
      stopwords_regex = /\b(?:#{ Regexp.union(*stopwords).source })\b/i
      return split(stopwords_regex).collect(&:strip).reject(&:empty?)
      end
 end

 stop_array = ["the", "with", "over"]

 pp "The force be with you.".splitstop stop_array
 => ["force be", "you."]
 pp "The quick brown fox jumps over the lazy dog.".splitstop stop_array
 => ["quick brown fox jumps", "lazy dog."]

score 4 · Accepted Answer

私はこのようにします：

str = "The force be with you."     
stop_array = %w[the with]
stopwords_regex = /(?:#{ Regexp.union(stop_array).source })/i
str.split(stopwords_regex).map(&:strip) # => ["", "force be", "you."]

を使用する場合Regexp.union、生成される実際のパターンに注意することが重要です。

/(?:#{ Regexp.union(stop_array) })/i
=> /(?:(?-mix:the|with))/i

埋め込み(?-mix:は、パターン内の大文字と小文字を区別しないフラグをオフにします。これにより、パターンが壊れ、間違ったものを取得する可能性があります。代わりに、フラグなしでパターンのみを返すようにエンジンに指示する必要があります。

/(?:#{ Regexp.union(stop_array).source })/i
=> /(?:the|with)/i

pattern = "(?:\bthe\b|\bwith\b)"動作しない理由は次のとおりです。

/#{pattern}/i # => /(?:\x08the\x08|\x08with\x08)/i

Ruby は"\b"バックスペース文字として認識します。代わりに次を使用します。

pattern = "(?:\\bthe\\b|\\bwith\\b)"
/#{pattern}/i # => /(?:\bthe\b|\bwith\b)/i

score 0 · Accepted Answer

バックスラッシュをマスクする必要があります。

"\\b#{i}\\b"

すなわち

pattern = "(?:" + stop_array.map{|i| "\\b#{i}\\b" }.join("|") + ")"

そしてマイナーな改善/簡素化:

pattern = "\\b(?:" + stop_array.join("|") + ")\\b"

それで：

str.split(/#{pattern}/i) # => ["", " force be ", " you."]

ストップリストが短い場合は、これが正しいアプローチだと思います。

score 0 · Accepted Answer

s = "the force be with you."
stop_words = %w|the with is|
# dynamically create a case-insensitive regexp
regexp = Regexp.new stop_words.join('|'), true
result = []
while(match = regexp.match(s))
  word = match.pre_match unless match.pre_match.empty?
  result << word
  s = match.post_match
end
# the last unmatched content, if any
result << s
result.compact!.map(&:strip!)

pp result
=> ["force be", "you."]

score 0 · Accepted Answer

stop_array = ["the", "with"]
re = Regexp.union(stop_array.map{|w| /\s*\b#{Regexp.escape(w)}\b\s*/i})

"The force be with you.".split(re) # =>
[
  "",
  "force be",
  "you."
]

ruby - 正規表現を使用して単語のリストでRubyの文字列を分割する

4 に答える 4

Related

Reference