python - 同じサイズの2つの大きなファイルの量的な違いを表す情報は何ですか？

Question

通常、2つのバイナリファイルの違いを見つけるために、diffツールとhexdumpツールを使用します。しかし、状況によっては、同じサイズの2つの大きなバイナリファイルが与えられた場合、差異の領域の数、累積的な差異など、それらの量的な差異のみを確認したいと思います。

例：2つのファイルAとB。2つの差分領域があり、累積差は6c-a3 + 6c-11 + 6f-6e+20-22です。

File A = 48 65 6c 6c 6f 2c 20 57
File B = 48 65 a3 11 6e 2c 22 57
              |--------|  |--|
                 reg 1   reg 2

標準のGNUツールとBashを使用してこのような情報を取得するにはどうすればよいですか、それとも単純なPythonスクリプトを使用する方がよいでしょうか。2つのファイルの違いに関する他の統計も役立ちますが、他に何があり、どのように測定できるのかわかりません。エントロピーの違い？分散の違い？

score 1 · Accepted Answer

地域以外のすべてについては、numpyを使用できます。このようなもの（テストされていない）：

import numpy as np
a = np.fromfile("file A", dtype="uint8")
b = np.fromfile("file B", dtype="uint8")

# Compute the number of bytes that are different
different_bytes = np.sum(a != b)

# Compute the sum of the differences
difference = np.sum(a - b)

# Compute the sum of the absolute value of the differences
absolute_difference = np.sum(np.abs(a - b))

# In some cases, the number of bits that have changed is a better
# measurement of change. To compute it we make a lookup array where 
# bitcount_lookup[byte] == number_of_1_bits_in_byte (so
# bitcount_lookup[0:16] == [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4])
bitcount_lookup = np.array(
    [bin(i).count("1") for i in range(256)], dtype="uint8")

# Numpy allows using an array as an index. ^ computes the XOR of
# each pair of bytes. The result is a byte with a 1 bit where the
# bits of the input differed, and a 0 bit otherwise.
bit_diff_count = np.sum(bitcount_lookup[a ^ b])

領域を計算するためのnumpy関数は見つかりませんでしたが、入力として使用して独自の関数を作成するだけで、a != b難しいことではありません。インスピレーションについては、この質問を参照してください。

score 0 · Accepted Answer

頭に浮かぶアプローチの1つは、バイナリ差分アルゴリズムを少しハックすることです。たとえば、rsyncアルゴリズムのPython実装。それから始めると、ファイルが異なるブロック範囲のリストを比較的簡単に取得し、それらのブロックで実行したい統計を実行できます。

python - 同じサイズの2つの大きなファイルの量的な違いを表す情報は何ですか？

2 に答える 2

Related

Reference