I am using this method to process a single text file that has about 220,000 lines. It takes a few minutes to process one, but I have lots of them. Are there any recommendations to make this process faster?
def parse_list(file_path,import=false)
# Parse the fixed-length fields
if File.exist?(file_path)
result=[]
File.readlines(file_path)[5..-1].each do |rs|
if rs.length > 140
r=rs.strip
unless r==''
filing={
'name' => r[0..50].strip,
'form' => r[51..70].strip,
'type' => r[71..80].strip,
'date' => r[81..90].strip,
'location' => r[91..-1].strip
}
result.push(filing)
end
end
end
return result
else
return false
end
end
Update:
Originally, I thought there was massive time savings from using Nex and thetinman's methods so I went on to test them keeping the parsing method consistent.
Using my original r[].strip
parsing method, but with Nex's each_line
block method and thetinman's foreach
methods:
Rehearsal ---------------------------------------------
Nex 8.260000 0.130000 8.390000 ( 8.394067)
Thetinman 9.740000 0.120000 9.860000 ( 9.862880)
----------------------------------- total: 18.250000sec
user system total real
Nex 14.270000 0.140000 14.410000 ( 14.397286)
Thetinman 19.030000 0.080000 19.110000 ( 19.118621)
Running again using thetinman's unpack.map
parsing method:
Rehearsal ---------------------------------------------
Nex 9.580000 0.120000 9.700000 ( 9.694327)
Thetinman 11.470000 0.090000 11.560000 ( 11.567294)
----------------------------------- total: 21.260000sec
user system total real
Nex 15.480000 0.120000 15.600000 ( 15.599319)
Thetinman 18.150000 0.070000 18.220000 ( 18.217744)
unpack.map(&:strip)
vs r[].strip
: unpack
with map
does not seem to increase speed, but is an interesting method to use in the future.
I found a different issue: With the substantial time savings found, I thought, I went on to run Nex and thetinman's methods manually using pry. This is where I found my computer hanging, just like my original code. So I went on to test again, but with my original code.
Rehearsal ---------------------------------------------
Original 7.980000 0.140000 8.120000 ( 8.118340)
Nex 9.460000 0.080000 9.540000 ( 9.546889)
Thetinman 10.980000 0.070000 11.050000 ( 11.042459)
----------------------------------- total: 28.710000sec
user system total real
Original 16.280000 0.140000 16.420000 ( 16.414070)
Nex 15.370000 0.080000 15.450000 ( 15.454174)
Thetinman 20.100000 0.090000 20.190000 ( 20.195533)
My code, Nex, and thetinman's methods seem comparable, with Nex being the fastest using Benchmark. However, Benchmark does not seem to tell the whole story because using pry to test the code manually gets all the methods to take substantially longer, so long that I cancel out before getting the result back.
I have some remaining questions:
- Is there something specific about running something like this in IRB/Pry that would produce these strange results, making the code run massively slower?
- If I run
original_method.count
,nex_method.count
, orthetinmans_method.count
, they all seem to return quickly. - Due to memory issues and scalability, it is recommended by thetinman and nex that the original method should not be used. However, in the future are there ways to test memory usage with something like benchmark?
Update for NEX, using activerecord-import
:
@nex, is this what you mean? This seems to run slow for me still, but I'm not sure what you mean when you say:
import one set of data inside that block.
How do you recommend modifying it?
def parse_line(line)
filing={
'name' => line[0..50].strip,
'form' => line[51..70].strip,
'type' => line[71..80].strip,
'date' => line[81..90].strip,
'location' => line[91..-1].strip
}
end
def import_files
result=[]
parse_list_nix(file_path){|line|
filing=parse_line(line)
result.push(Filing.new(filing))
}
Filing.import result #result is an array of new records that are all imported at once
end
Results from the activerecord-import
method are, as you can see, substantially slower:
Rehearsal ------------------------------------------
import 534.840000 1.860000 536.700000 (553.507644)
------------------------------- total: 536.700000sec
user system total real
import 263.220000 1.320000 264.540000 (282.751891)
Does this slow import process seem normal?
It just seems super slow to me. I'm trying to figure out how to speed this up, but I am out of ideas.