java - 行サイズが固定されていない UTF-8 でエンコードされたテキストファイルに対する Java を使用したバイナリ検索

Question

レコードが 1 つのフィールドでソートされているタブ区切りの UTF-8 ファイルがあります。ただし、行のサイズは固定されていないため、特定の位置に直接ジャンプすることはできません。これでバイナリ検索を実行するにはどうすればよいですか?

例：

1行目: アルフレッド・ブレンデル /m/011hww /m/0crsgs6,/m/0crvt9h,/m/0cs5n_1,/m/0crtj4t,/m/0crwpnw,/m/0cr_n2s,/m/0crsgyh

2 行目: ルパート・シェルドレイク /m/011ybj /m/0crtszs

score 3 · Accepted Answer

ホールファイルに含まれるバイト数はわかっています。としましょうn -> search-interval [l, r]with l=0, r=n.

search-interval の中間を見積もりますm=(r-l)/2。この位置で、タブ文字 (byte==9 (9 はタブの ASCII および UTF8 コード) )が見つかるまで左 (右も機能します) にできるだけ多くのバイトを移動し、[この位置に名前を付けますmReal] をデコードしますそのタブを開始する 1 行。
次の検索ステップのために、前半 (=> 新しい検索間隔[l, mReal]) または後半 (=> 新しい検索間隔) のどちらを取る必要があるかを決定します。[mReal, r]

score 0 · Accepted Answer

public class YourTokenizer {

    public static final String EPF_EOL = "\t";

    public static final int READ_SIZE = 4 * 1024 ;

    /** The EPF stream buffer. */
    private StringBuilder buffer = new StringBuilder();

    /** The EPF stream. */
    private InputStream stream = null;

    public YourTokenizer(final InputStream stream) {
        this.stream = stream;
    }

    private String getNextLine() throws IOException {
        int pos = buffer.indexOf(EPF_EOL);
        if (pos == -1) {
            // eof-of-line sequence isn't available yet, read more of the file
            final byte[] bytes = new byte[READ_SIZE];
            final int readSize = stream.read(bytes, 0, READ_SIZE);


            buffer.append(new String(bytes));
            pos = buffer.indexOf(EPF_EOL);
            if (pos == -1) {
                if (readSize < READ_SIZE) {
                    // we have reached the end of the stream and what we're looking for still can't be found
                    throw new IOException("Premature end of stream");
                }
                return getNextLine();
            }
        }

        final String data = buffer.substring(0, pos);
        pos += EPF_EOL.length();
        buffer = buffer.delete(0, pos);
        return data;
    }

}

end in main :

final InputStream stream = new FileInputStream(file);
 final YourTokenizer tokenizer = new YourTokenizer(stream);

 String line = tokenizer.getNextLine();
 while(line != line) {
   //do something
   line = tokenizer.getNextLine();
 }

score 0 · Accepted Answer

You can jump to the middle of bytes. From there you can find the end of that line and you can read the next line from that point. If you need to search back, take a one quarter point, or three quarters and find the line each time. Eventually you will narrow it down to one line.

score 0 · Accepted Answer

ファイルサイズから行の長さを推測できると思います

それでも、線の長さを推測することさえできない場合は、乱数の生成から選択する方がよいと思います。

java - 行サイズが固定されていない UTF-8 でエンコードされたテキスト ファイルに対する Java を使用したバイナリ検索

4 に答える 4

Related

Reference

java - 行サイズが固定されていない UTF-8 でエンコードされたテキストファイルに対する Java を使用したバイナリ検索