I have a list of 1.6 millon lines that looks like this:
N123HN /var/foo/bar/baz/A/Alpha.file.1234.bin
N123HN /var/foo/bar/baz/A/Alpha.file.1235.bin
N123KL /var/foo/bar/baz/A/Alpha.file.1236.bin
I have a Perl script that basically just greps this data on the second column, as a way of looking up the value in the first column (then it does other magic with the "N123HN" value, etc.). As it is now, my app spends about 4 minutes ingesting the file and loading it into a huge hash (key/value array). While the grep-like functions themselves are slow for obvious reasons, the slowest part of running this script is this huge ingest of data each time it runs.
Anyone have any clever ideas how to access this data more quickly? Since it is just a list of two columns, a relational database seems pretty heavyweight for this use case.
I'm re-editing the original question here since pasting source code into the comments boxes is pretty ugly.
The algorithm I'm using to ingest the huge file is this:
while(<HUGEFILE>)
{
# hugefile format:
# nln N123HN ---- 1 0 1c44f5.4a6ee12 17671854355 /var/foo/bar/baz/A/Alpha.file.1234.bin 0
next if /^(\s)*$/; # skip blank lines
chomp; # remove trailing newline characters
@auditrows = split; # an array of entire rows, split on whitespace
my $file_url = $auditrows[7]; # /var/foo/bar/baz/A/Alpha.file.1234.bin
my $tapenum = "$auditrows[1] "; # N123HN
$tapenumbers{ $file_url } = $tapenum; # key = "/var/foo/bar/baz/A/Alpha.file.1234.bin"
} # value = "N123HN"