java - Java: read groups of lines with same prefix from very large text file

Question

I have a large (~100GB) text file structured like this:

A,foobar
A,barfoo
A,foobar
B,barfoo
B,barfoo
C,foobar

Each line is a comma-separated pair of values. The file is sorted by the first value in the pair. The lines are variable length. Define a group as being all lines with a common first value, i.e. with the example quoted above all lines starting with "A," would be a group, all lines starting with "B," would be another group.

The entire file is too large to fit into memory, but if you took all the lines from any individual group will always fit into memory.

I have a routine for processing a single such group of lines and writing to a text file. My problem is that I don't know how best to read the file a group at a time. All the groups are of arbitrary, unknown size. I have considered two ways:

1) Scan the file using a BufferedReader, accumulating the lines from a group in a String or array. Whenever a line is encountered that belongs to a new group, hold that line in a temporary variable, process the previous group. Clear the accumulator, add the temporary and then continue reading the new group starting from the second line.

2) Scan the file using a BufferedReader, whenever a line is encountered that belongs to a new group, somehow reset the cursor so that when readLine() is next invoked it starts from the first line of the group instead of the second. I have looked into mark() and reset() but these require knowing the byte-position of the start of the line.

I'm going to go with (1) at the moment, but I would be very grateful if someone could suggest a method that smells less.

score 2 · Accepted Answer

I think a PushbackReader would work:

 if (lineBelongsToNewGroup){
     reader.unread(lastLine.toCharArray());
     // probably also unread a newline
 }

score 1 · Accepted Answer

I think option 1 is the simplest. I would parse the text yourself, rather than use BufferedReader as it will take a lone time to parse 100 GB.

The only option which is likely to be faster is to use a binary search accessing the file using RandomAccessFile. You can memory map 100 GB on a 64-bit JVM. This avoids the need to parse every line which is pretty expensive. An advantage of this approach is that you can use multiple threads Its is far, far more complicated to implement, but should be much faster. Once you have each boundary, you can copy the raw data in bulk without having to parse all the lines.

java - Java: read groups of lines with same prefix from very large text file

2 に答える 2

Related

Reference