java - テキストファイルのバイナリ検索を実行する方法

Question

Android アプリケーションで使用する大きなテキストファイル (5Mb) があります。事前に並べ替えられた文字列のリストとしてファイルを作成しますが、ファイルは作成されると変更されません。行ごとに読み取って一致する文字列を見つけることなく、このファイルの内容に対してバイナリ検索を実行するにはどうすればよいですか?

score 7 · Accepted Answer

ファイルの内容は変わらないので、ファイルを複数に分割できます。AG、HN、0-T、UZ と言ってください。これにより、最初の文字を確認し、可能なセットを元のサイズの 4 分の 1 にすぐにカットできます。これで、線形検索にそれほど時間がかからなくなります。または、ファイル全体の読み取りがオプションになる可能性があります。n/4 がまだ大きすぎる場合は、このプロセスを拡張できますが、考え方は同じです。すべてをメモリ内で実行しようとするのではなく、検索の内訳をファイル構造に組み込みます。

score 1 · Accepted Answer

5MB のファイルはそれほど大きくありません。各行をString[]配列に読み込むことができ、それを使用java.util.Arrays.binarySearch()して必要な行を見つけることができます。これが私の推奨するアプローチです。

ファイル全体をアプリに読み込みたくない場合は、さらに複雑になります。ファイルの各行が同じ長さで、ファイルが既にソートされている場合は、RandomAccessFile でファイルを開き、次のseek()ように使用して自分でバイナリ検索を実行できます...

// open the file for reading
RandomAccessFile raf = new RandomAccessFile("myfile.txt","r");
String searchValue = "myline";
int lineSize = 50;
int numberOfLines = raf.length() / lineSize;

// perform the binary search...
byte[] lineBuffer = new byte[lineSize];
int bottom = 0;
int top = numberOfLines;
int middle;
while (bottom <= top){
  middle = (bottom+top)/2;
  raf.seek(middle*lineSize); // jump to this line in the file
  raf.read(lineBuffer); // read the line from the file
  String line = new String(lineBuffer); // convert the line to a String

  int comparison = line.compareTo(searchValue);
  if (comparison == 0){
    // found it
    break;
    }
  else if (comparison < 0){
    // line comes before searchValue
    bottom = middle + 1;
    }
  else {
    // line comes after searchValue
    top = middle - 1;
    }
  }

raf.close(); // close the file when you're finished

ただし、ファイルに固定幅の行が含まれていない場合、固定幅の場合のようにファイル内の特定の行にすばやくジャンプできないため、最初にメモリにロードしないとバイナリ検索を簡単に実行できません。 -幅の線。

score 1 · Accepted Answer

均一な文字長のテキストファイルでは、問題の文字単位で間隔の中央までシークし、区切り記号に到達するまで文字の読み取りを開始し、その後の文字列を要素ごとの中央の近似値として使用できます。ただし、Androidでこれを行う際の問題は、リソースへのランダムアクセスが明らかにできないことです（ただし、毎回それを再度開くことができると思います）。さらに、この手法はマップや他のタイプのセットには一般化されません。

もう 1 つのオプションは、( RandomAccessFileを使用して) int の「配列」をファイルの先頭に書き込み、ファイルの先頭に戻って、対応する文字列の場所でそれらを更新することです。繰り返しますが、検索には飛び回る必要があります。

私がすること (そして自分のアプリで行ったこと) は、ファイルにハッシュセットを実装することです。これは、ツリーとのチェーンを分離します。

import java.io.BufferedInputStream;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.ArrayList;
import java.util.Collections;
import java.util.LinkedList;
import java.util.Set;

class StringFileSet {

    private static final double loadFactor = 0.75;

    public static void makeFile(String fileName, String comment, Set<String> set) throws IOException {
        new File(fileName).delete();
        RandomAccessFile fout = new RandomAccessFile(fileName, "rw");

        //Write comment
        fout.writeUTF(comment);

        //Make bucket array
        int numBuckets = (int)(set.size()/loadFactor);

        ArrayList<ArrayList<String>> bucketArray = new ArrayList<ArrayList<String>>(numBuckets);
        for (int ii = 0; ii < numBuckets; ii++){
            bucketArray.add(new ArrayList<String>());
        }

        for (String key : set){
            bucketArray.get(Math.abs(key.hashCode()%numBuckets)).add(key);
        }

        //Sort key lists in preparation for creating trees
        for (ArrayList<String> keyList : bucketArray){
            Collections.sort(keyList);
        }

        //Make queues in preparation for creating trees
        class NodeInfo{

            public final int lower;
            public final int upper;
            public final long callingOffset;

            public NodeInfo(int lower, int upper, long callingOffset){
                this.lower = lower;
                this.upper = upper;
                this.callingOffset = callingOffset;
            }

        }

        ArrayList<LinkedList<NodeInfo>> queueList = new ArrayList<LinkedList<NodeInfo>>(numBuckets);
        for (int ii = 0; ii < numBuckets; ii++){
            queueList.add(new LinkedList<NodeInfo>());
        }

        //Write bucket array
        fout.writeInt(numBuckets);
        for (int index = 0; index < numBuckets; index++){
            queueList.get(index).add(new NodeInfo(0, bucketArray.get(index).size()-1, fout.getFilePointer()));
            fout.writeInt(-1);
        }

        //Write trees
        for (int bucketIndex = 0; bucketIndex < numBuckets; bucketIndex++){
            while (queueList.get(bucketIndex).size() != 0){
                NodeInfo nodeInfo = queueList.get(bucketIndex).poll();
                if (nodeInfo.lower <= nodeInfo.upper){
                    //Set respective pointer in parent node
                    fout.seek(nodeInfo.callingOffset);
                    fout.writeInt((int)(fout.length() - (nodeInfo.callingOffset + 4))); //Distance instead of absolute position so that the get method can use a DataInputStream
                    fout.seek(fout.length());

                    int middle = (nodeInfo.lower + nodeInfo.upper)/2;

                    //Key
                    fout.writeUTF(bucketArray.get(bucketIndex).get(middle));

                    //Left child
                    queueList.get(bucketIndex).add(new NodeInfo(nodeInfo.lower, middle-1, fout.getFilePointer()));
                    fout.writeInt(-1);

                    //Right child
                    queueList.get(bucketIndex).add(new NodeInfo(middle+1, nodeInfo.upper, fout.getFilePointer()));
                    fout.writeInt(-1);
                }
            }
        }

        fout.close();
    }

    private final String fileName;
    private final int numBuckets;
    private final int bucketArrayOffset;

    public StringFileSet(String fileName) throws IOException {
        this.fileName = fileName;

        DataInputStream fin = new DataInputStream(new BufferedInputStream(new FileInputStream(fileName)));

        short numBytes = fin.readShort();
        fin.skipBytes(numBytes);
        this.numBuckets = fin.readInt();
        this.bucketArrayOffset = numBytes + 6;

        fin.close();
    }

    public boolean contains(String key) throws IOException {
        boolean containsKey = false;

        DataInputStream fin = new DataInputStream(new BufferedInputStream(new FileInputStream(this.fileName)));

        fin.skipBytes(4*(Math.abs(key.hashCode()%this.numBuckets)) + this.bucketArrayOffset);

        int distance = fin.readInt();
        while (distance != -1){
            fin.skipBytes(distance);

            String candidate = fin.readUTF();
            if (key.compareTo(candidate) < 0){
                distance = fin.readInt();
            }else if (key.compareTo(candidate) > 0){
                fin.skipBytes(4);
                distance = fin.readInt();
            }else{
                fin.skipBytes(8);
                containsKey = true;
                break;
            }
        }

        fin.close();

        return containsKey;
    }

}

テストプログラム

import java.io.File;
import java.io.IOException;
import java.util.HashSet;

class Test {
    public static void main(String[] args) throws IOException {
        HashSet<String> stringMemorySet = new HashSet<String>();

        stringMemorySet.add("red");
        stringMemorySet.add("yellow");
        stringMemorySet.add("blue");

        StringFileSet.makeFile("stringSet", "Provided under ... included in all copies and derivatives ...", stringMemorySet);
        StringFileSet stringFileSet = new StringFileSet("stringSet");

        System.out.println("orange -> " + stringFileSet.contains("orange"));
        System.out.println("red -> " + stringFileSet.contains("red"));
        System.out.println("yellow -> " + stringFileSet.contains("yellow"));
        System.out.println("blue -> " + stringFileSet.contains("blue"));

        new File("stringSet").delete();

        System.out.println();
    }
}

getResources() メソッドにアクセスできるように、Android 用に変更する場合は、Contextを渡す必要もあります。

また、Android ビルドツールによるファイルの圧縮を停止したい場合もあります。これは、GUI を使用している場合は、ファイルの拡張子を jpg などに変更することによってのみ実行できるようです。これにより、私のアプリではプロセスが約 100 倍から 300 倍速くなりました。

NDKを使用してメモリを増やすことも検討してください。

score 0 · Accepted Answer

やり過ぎのように聞こえるかもしれませんが、これを行うために必要なデータをフラットファイルとして保存しないでください。データベースを作成し、データベース内のデータに対してクエリを実行します。これは効果的かつ高速である必要があります。

java - テキスト ファイルのバイナリ検索を実行する方法

6 に答える 6

Related

Reference

java - テキストファイルのバイナリ検索を実行する方法