ruby - How do I improve process performance converting a file to array of hashes?

Question

I am using this method to process a single text file that has about 220,000 lines. It takes a few minutes to process one, but I have lots of them. Are there any recommendations to make this process faster?

def parse_list(file_path,import=false)
# Parse the fixed-length fields
   if File.exist?(file_path)
     result=[]
     File.readlines(file_path)[5..-1].each do |rs|
        if rs.length > 140
          r=rs.strip
          unless r=='' 
            filing={
                  'name' => r[0..50].strip,
                  'form' => r[51..70].strip,
                  'type'  => r[71..80].strip,
                  'date' => r[81..90].strip,
                  'location' => r[91..-1].strip
                  }     
              result.push(filing)
          end
        end
     end
     return result
   else
     return false
   end
end

Update:

Originally, I thought there was massive time savings from using Nex and thetinman's methods so I went on to test them keeping the parsing method consistent.

Using my original r[].strip parsing method, but with Nex's each_line block method and thetinman's foreach methods:

Rehearsal ---------------------------------------------
Nex         8.260000   0.130000   8.390000 (  8.394067)
Thetinman   9.740000   0.120000   9.860000 (  9.862880)
----------------------------------- total: 18.250000sec

                user     system      total        real
Nex        14.270000   0.140000  14.410000 ( 14.397286)
Thetinman  19.030000   0.080000  19.110000 ( 19.118621)

Running again using thetinman's unpack.map parsing method:

Rehearsal ---------------------------------------------
Nex         9.580000   0.120000   9.700000 (  9.694327)
Thetinman  11.470000   0.090000  11.560000 ( 11.567294)
----------------------------------- total: 21.260000sec

                user     system      total        real
Nex        15.480000   0.120000  15.600000 ( 15.599319)
Thetinman  18.150000   0.070000  18.220000 ( 18.217744)

unpack.map(&:strip) vs r[].strip: unpack with map does not seem to increase speed, but is an interesting method to use in the future.

I found a different issue: With the substantial time savings found, I thought, I went on to run Nex and thetinman's methods manually using pry. This is where I found my computer hanging, just like my original code. So I went on to test again, but with my original code.

Rehearsal ---------------------------------------------
Original    7.980000   0.140000   8.120000 (  8.118340)
Nex         9.460000   0.080000   9.540000 (  9.546889)
Thetinman  10.980000   0.070000  11.050000 ( 11.042459)
----------------------------------- total: 28.710000sec

                user     system      total        real
Original   16.280000   0.140000  16.420000 ( 16.414070)
Nex        15.370000   0.080000  15.450000 ( 15.454174)
Thetinman  20.100000   0.090000  20.190000 ( 20.195533)

My code, Nex, and thetinman's methods seem comparable, with Nex being the fastest using Benchmark. However, Benchmark does not seem to tell the whole story because using pry to test the code manually gets all the methods to take substantially longer, so long that I cancel out before getting the result back.

I have some remaining questions:

Is there something specific about running something like this in IRB/Pry that would produce these strange results, making the code run massively slower?
If I run original_method.count, nex_method.count, or thetinmans_method.count, they all seem to return quickly.
Due to memory issues and scalability, it is recommended by thetinman and nex that the original method should not be used. However, in the future are there ways to test memory usage with something like benchmark?

Update for NEX, using activerecord-import:

@nex, is this what you mean? This seems to run slow for me still, but I'm not sure what you mean when you say:

import one set of data inside that block.

How do you recommend modifying it?

def parse_line(line)
   filing={
   'name' => line[0..50].strip,
   'form' => line[51..70].strip,
   'type'  => line[71..80].strip,
   'date' => line[81..90].strip,
    'location' => line[91..-1].strip
    }    
end

def import_files
 result=[]
 parse_list_nix(file_path){|line|
    filing=parse_line(line)    
    result.push(Filing.new(filing))

 }
 Filing.import result   #result is an array of new records that are all imported at once
end

Results from the activerecord-import method are, as you can see, substantially slower:

Rehearsal ------------------------------------------
import 534.840000   1.860000 536.700000 (553.507644)
------------------------------- total: 536.700000sec

             user     system      total        real
import 263.220000   1.320000 264.540000 (282.751891)

Does this slow import process seem normal?

It just seems super slow to me. I'm trying to figure out how to speed this up, but I am out of ideas.

score 2 · Accepted Answer

問題は、メモリがいっぱいになっていることです。その結果どうするの？全体としてメモリに残る必要がありますか、それともブロックで行ごとに処理するオプションになりますか?

また、ここでは readlines を使用しないでください。列挙子を使用するため、代わりに次のようにします。

def parse_list(file_path, import=false)
  i = 0
  File.open(file_path,'r').each_line do |line|
    line.strip!
    next if (i+=1) < 5 || line.length < 141
    filing = { 'name' => r[0..50].strip,
               'form' => r[51..70].strip,
               'type'  => r[71..80].strip,
               'date' => r[81..90].strip,
               'location' => r[91..-1].strip }
    yield(filling) if block_given?
  end
end

# and calling it like this:
parse_list('/tmp/foobar'){ |filling|
  Filing.new(filing).import
}

score 2 · Accepted Answer

サンプルデータがないとこれを確認するのは難しいですが、元のコードに基づいて、おそらく次のように記述します。

require 'english'

# Parse the fixed-length fields
def parse_list(file_path,import=false)

  return false unless File.exist?(file_path)

  result=[]
  File.foreach(file_path) do |rs|
    next unless $INPUT_LINE_NUMBER > 5
    next unless rs.length > 140

    r = rs.strip
    if r > '' 
      name, form, type, date, location = r.unpack('A51 A20 A10 A10 A*').map(&:strip)
      result << {
        'name'     => name,
        'form'     => form,
        'type'     => type,
        'date'     => date,
        'location' => location
      }
    end
  end

  result
end

220,000 行というのは、私の出身地では大きなファイルではありません。午前中までに 3 倍のログファイルを取得するため、ファイルを丸呑みするファイル I/O を使用することはできません。Ruby の IO クラスには、行ごとの I/O 用の 2 つのメソッドと、配列を返す数値用のメソッドがあります。スケーラブルであるため、前者が必要です。読み込まれているファイルが Ruby のメモリに快適に収まることを保証できない限り、後者は避けてください。

ruby - How do I improve process performance converting a file to array of hashes?

2 に答える 2

Related

Reference