java - Java のファイルの行数

Question

私は巨大なデータファイルを使用します。時には、これらのファイルの行数だけを知る必要がある場合もあります。通常は、ファイルを開いて、ファイルの最後に到達するまで 1 行ずつ読み取ります。

それを行うためのよりスマートな方法があるかどうか疑問に思っていました

score 249 · Accepted Answer

これは私がこれまでに見つけた最速のバージョンで、readLines よりも約 6 倍高速です。readLines() を使用した場合は 2.40 秒ですが、150MB のログファイルでは 0.35 秒かかります。余談ですが、Linux の wc -l コマンドには 0.15 秒かかります。

public static int countLinesOld(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean empty = true;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
        }
        return (count == 0 && !empty) ? 1 : count;
    } finally {
        is.close();
    }
}

編集、9年半後:私は実質的にJavaの経験がありませんが、とにかくLineNumberReader、誰もそれをしなかったことが気になったので、以下のソリューションに対してこのコードをベンチマークしようとしました。特に大きなファイルの場合、私のソリューションの方が高速であるようです。オプティマイザーがまともな仕事をするまで、数回の実行が必要なようですが。私はコードで少し遊んで、一貫して最速の新しいバージョンを作成しました:

public static int countLinesNew(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];

        int readChars = is.read(c);
        if (readChars == -1) {
            // bail out if nothing to read
            return 0;
        }

        // make it easy for the optimizer to tune this loop
        int count = 0;
        while (readChars == 1024) {
            for (int i=0; i<1024;) {
                if (c[i++] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }

        // count remaining characters
        while (readChars != -1) {
            System.out.println(readChars);
            for (int i=0; i<readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }

        return count == 0 ? 1 : count;
    } finally {
        is.close();
    }
}

1.3 GB のテキストファイルのベンチマーク結果、y 軸は秒。同じファイルで 100 回の実行を実行し、各実行をで測定しましたSystem.nanoTime()。countLinesOld外れ値がいくつかあり、外れ値がないことがわかります。わずかcountLinesNewに高速ですが、その差は統計的に有意です。LineNumberReader明らかに遅いです。

score 202 · Accepted Answer

私はこの問題に対して別の解決策を実装しましたが、行を数えるのにより効率的であることがわかりました。

try
(
   FileReader       input = new FileReader("input.txt");
   LineNumberReader count = new LineNumberReader(input);
)
{
   while (count.skip(Long.MAX_VALUE) > 0)
   {
      // Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
   }

   result = count.getLineNumber() + 1;                                    // +1 because line index starts at 0
}

score 30 · Accepted Answer

受け入れられた回答には、改行で終わらない複数行のファイルのエラーが 1 つずつあります。改行なしで終わる 1 行のファイルは 1 を返しますが、改行なしで終わる 2 行のファイルも 1 を返します。これを修正する受け入れられたソリューションの実装を次に示します。endWithoutNewLine チェックは、最終的な読み取り以外のすべてに無駄が生じますが、関数全体と比較すると時間的には些細なはずです。

public int count(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean endsWithoutNewLine = false;
        while ((readChars = is.read(c)) != -1) {
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n')
                    ++count;
            }
            endsWithoutNewLine = (c[readChars - 1] != '\n');
        }
        if(endsWithoutNewLine) {
            ++count;
        } 
        return count;
    } finally {
        is.close();
    }
}

score 24 · Accepted Answer

java-8では、ストリームを使用できます。

try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
  long numOfLines = lines.count();
  ...
}

score 13 · Accepted Answer

ファイルの末尾に改行がない場合、上記のメソッド count() を使用した回答により、行のミスカウントが発生しました-ファイルの最後の行をカウントできませんでした。

この方法は私にとってはうまくいきます：

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}

cnt = reader.getLineNumber(); 
reader.close();
return cnt;
}

score 8 · Accepted Answer

これは古い質問であることは知っていますが、受け入れられた解決策は、私が必要としていたものと完全には一致しませんでした。そこで、(改行だけでなく) さまざまな行末記号を受け入れ、(ISO-8859- nではなく) 指定された文字エンコーディングを使用するように改良しました。オールインワンメソッド（必要に応じてリファクタリング）：

public static long getLinesCount(String fileName, String encodingName) throws IOException {
    long linesCount = 0;
    File file = new File(fileName);
    FileInputStream fileIn = new FileInputStream(file);
    try {
        Charset encoding = Charset.forName(encodingName);
        Reader fileReader = new InputStreamReader(fileIn, encoding);
        int bufferSize = 4096;
        Reader reader = new BufferedReader(fileReader, bufferSize);
        char[] buffer = new char[bufferSize];
        int prevChar = -1;
        int readCount = reader.read(buffer);
        while (readCount != -1) {
            for (int i = 0; i < readCount; i++) {
                int nextChar = buffer[i];
                switch (nextChar) {
                    case '\r': {
                        // The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
                        linesCount++;
                        break;
                    }
                    case '\n': {
                        if (prevChar == '\r') {
                            // The current line is terminated by a carriage return immediately followed by a line feed.
                            // The line has already been counted.
                        } else {
                            // The current line is terminated by a line feed.
                            linesCount++;
                        }
                        break;
                    }
                }
                prevChar = nextChar;
            }
            readCount = reader.read(buffer);
        }
        if (prevCh != -1) {
            switch (prevCh) {
                case '\r':
                case '\n': {
                    // The last line is terminated by a line terminator.
                    // The last line has already been counted.
                    break;
                }
                default: {
                    // The last line is terminated by end-of-file.
                    linesCount++;
                }
            }
        }
    } finally {
        fileIn.close();
    }
    return linesCount;
}

このソリューションは、受け入れられているソリューションと速度が同等で、私のテストでは約 4% 遅くなります (ただし、Java でのタイミングテストは信頼できないことで有名です)。

score 3 · Accepted Answer

私は、改行をカウントする :s メソッドは問題ないと結論付けwc -lましたが、最後の行が改行で終わっていないファイルでは直感的でない結果を返します。

LineNumberReader に基づく @er.vikas ソリューションですが、行数に 1 を追加すると、最後の行が改行で終わるファイルで直感的でない結果が返されました。

したがって、次のように処理するアルゴリズムを作成しました。

@Test
public void empty() throws IOException {
    assertEquals(0, count(""));
}

@Test
public void singleNewline() throws IOException {
    assertEquals(1, count("\n"));
}

@Test
public void dataWithoutNewline() throws IOException {
    assertEquals(1, count("one"));
}

@Test
public void oneCompleteLine() throws IOException {
    assertEquals(1, count("one\n"));
}

@Test
public void twoCompleteLines() throws IOException {
    assertEquals(2, count("one\ntwo\n"));
}

@Test
public void twoLinesWithoutNewlineAtEnd() throws IOException {
    assertEquals(2, count("one\ntwo"));
}

@Test
public void aFewLines() throws IOException {
    assertEquals(5, count("one\ntwo\nthree\nfour\nfive\n"));
}

次のようになります。

static long countLines(InputStream is) throws IOException {
    try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) {
        char[] buf = new char[8192];
        int n, previousN = -1;
        //Read will return at least one byte, no need to buffer more
        while((n = lnr.read(buf)) != -1) {
            previousN = n;
        }
        int ln = lnr.getLineNumber();
        if (previousN == -1) {
            //No data read at all, i.e file was empty
            return 0;
        } else {
            char lastChar = buf[previousN - 1];
            if (lastChar == '\n' || lastChar == '\r') {
                //Ending with newline, deduct one
                return ln;
            }
        }
        //normal case, return line number + 1
        return ln + 1;
    }
}

直感的な結果が必要な場合は、これを使用できます。互換性だけが必要な場合はwc -l、単純に @er.vikas ソリューションを使用しますが、結果に追加してスキップを再試行しないでください。

try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) {
    while(lnr.skip(Long.MAX_VALUE) > 0){};
    return lnr.getLineNumber();
}

score 2 · Accepted Answer

Java コード内から Process クラスを使用するのはどうですか? 次に、コマンドの出力を読み取ります。

Process p = Runtime.getRuntime().exec("wc -l " + yourfilename);
p.waitFor();

BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = "";
int lineCount = 0;
while ((line = b.readLine()) != null) {
    System.out.println(line);
    lineCount = Integer.parseInt(line);
}

でも試してみる必要があります。結果を掲載します。

score 0 · Accepted Answer

0

Unix ベースのシステムでは、wcコマンドラインでコマンドを使用します。

于 2009-01-17T09:03:02.633 に答える

score 0 · Accepted Answer

ファイルに何行あるかを知る唯一の方法は、それらを数えることです。もちろん、データからメトリックを作成して 1 行の平均の長さを示し、ファイルサイズを取得してそれを平均で割ることもできます。長さですが、それは正確ではありません。

score 0 · Accepted Answer

インデックス構造がない場合は、完全なファイルの読み取りを回避できません。ただし、行ごとに読み取ることを避け、正規表現を使用してすべての行末記号に一致させることで、最適化できます。

score 0 · Accepted Answer

正規表現を使用したスキャナー:

public int getLineCount() {
    Scanner fileScanner = null;
    int lineCount = 0;
    Pattern lineEndPattern = Pattern.compile("(?m)$");  
    try {
        fileScanner = new Scanner(new File(filename)).useDelimiter(lineEndPattern);
        while (fileScanner.hasNext()) {
            fileScanner.next();
            ++lineCount;
        }   
    }catch(FileNotFoundException e) {
        e.printStackTrace();
        return lineCount;
    }
    fileScanner.close();
    return lineCount;
}

計っていません。

score -2 · Accepted Answer

これを使えば

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
    int cnt = 0;
    String lineRead = "";
    while ((lineRead = reader.readLine()) != null) {}

    cnt = reader.getLineNumber(); 
    reader.close();
    return cnt;
}

reader.getLineNumber からの戻り値が int であるため、100K 行のように大きな数の行を実行することはできません。最大行を処理するには長いタイプのデータが必要です..

java - Java のファイルの行数

19 に答える 19

Related

Reference