ruby - フォルダ内のすべてのテキストファイルを解析し、正規表現検索を囲むテキストを保存します

Question

ディレクトリ内のすべてのテキストファイルを反復処理し、特定の正規表現の出現を検索しながらそれらを解析し、前後の20語程度を保存するコードを作成しようとしています。

dir.globを使用してすべての.txtファイルを選択し、これらすべてのテキストファイルのコードをループさせ（それぞれ実行します）、正規表現を使用して単語の出現を検索します（line.match？File.find_all？、および次に、単語とその周囲の選択範囲をベースファイルに出力します。

私はそれをすべて一緒にパズルしようとしていますが、私はそれほど遠くまで、またはそれ以上進んでいないと思います。どんな助けでも大歓迎です。

これは私が持っているものです：

    Dir::mkdir("summaries") unless File.exists?("summaries")
    Dir.chdir("summaries")
    all_text_files = Dir.glob("*.txt")

    all_text_files.each do |textfile|
        puts "currently summarizing " + textfile + "..."
        File.readlines(#{textfile}, "r").each do |line|
            if line.match /trail/ #does line.match work?
            if line =~ /trail/ #would this work?
                return true
                #save line to base textfile while referencing name of searchfile
            end
        end
    end

score 2 · Accepted Answer

あなたのコードは非常にずさんなように見えます。それは間違いでいっぱいです。ここにいくつかあります（もっとあるかもしれません）：

あなたは+ここに欠けています：

puts "currently summarizing " textfile + "..."

そのはず：

puts "currently summarizing " + textfile + "..."

#{}使用できるのは二重引用符の内側のみなので、次の代わりに使用できます。

File.open(#{textfile}, "r")

ただ行う：

File.open(textfile, "r")

これはまったく意味がありません。

File.open(#{textfile}, "r")
textfile.each do line

そのはず：

File.open(textfile, "r").each do |line|

これも意味がありません。

return true
print line

lineの直後に印刷されることはありませんreturn true。

編集：

新しい質問については、どちらも機能しますがmatch、=~戻り値が異なります。それはあなたが正確に何をしたいかによります。

foo = "foo trail bar"
foo.match /trail/ # => #<MatchData "trail">
foo =~ /trail/ # => 4

score 2 · Accepted Answer

以下のコードは、ディレクトリ内の各.txtファイルを調べて、決定した正規表現のすべての出現箇所を、base.txt見つかったファイルの名前とともにファイルに出力します。scan別の正規表現である方法を使用することを選択しました。一致する結果の配列を返す利用可能なメソッド。スキャンのrubydocについては、こちらをご覧ください。各ファイルに1回だけ出現する場合は、コードを変更することもできます。

##
# This method takes a string, int and string as an argument.
# The method will return the indices that are padded on either side
# of the passed in index by 20 (in our case) but not padded by more
# then the size of the passed in text. The word parameter is used to
# decide the top index as we do not want to include the word in our
# padding calculation. 
#
# = Example
#
#  indices("hello bob how are you?", 5, "bob") 
#      # => [0, 13] since the text length is less than 40
#
#  indices("this is a string of text that is long enough for a good example", 31, "is")
#      # => [11, 53] The extra 2 account for the length of the word 'is'.
#    
    def indices text, index, word
    #here's where you get the text from around the word you are interested in.
    #I have set the padding to 20 but you can change that as you see fit.
    padding = 20
    #Here we are getting the lowest point at which we can retrieve a substring.
    #We don't want to try and get an index before the beginning of our string.
    bottom_i = index - padding < 0 ? 0 : index - padding

    #Same concept as bottom except at the top end of the string.
    top_i = index + word.length + padding > text.length ? text.length : index + word.length + padding
    return bottom_i, top_i
end

#Script start.
base_text = File.open("base.txt", 'w')
Dir::mkdir("summaries") unless File.exists?("summaries")
Dir.chdir("summaries")

Dir.glob("*.txt").each do |textfile|
    whole_file = File.open(textfile, 'r').read
    puts "Currently summarizing " + textfile + "..."
    #This is a placeholder for the 'current' index we are looking at.
    curr_i = 0
    str = nil
    #This will go through the entire file and find each occurance of the specified regex. 
    whole_file.scan(/trail/).each do |match|
      #This is the index of the matching string looking from the curr_i index onward.
      #We do this so that we don't find and report things twice.
      if i_match = whole_file.index(match, curr_i)
        top_bottom = indices(whole_file, i_match, match)
        base_text.puts(whole_file[top_bottom[0]..top_bottom[1]] + " : " + File.path(textfile))
        #We set our current index to be the index at which we found the match so when
        #we ask for the matching index from curr_i onward, we don't get the same index
        #again.
        curr_i += i_match         
        #If you only want one occurrance break here            
      end
    end
    puts "Done summarizing " + textfile + "."
end
base_text.close

ruby - フォルダ内のすべてのテキストファイルを解析し、正規表現検索を囲むテキストを保存します

2 に答える 2

Related

Reference