string - c++で2つの非常に長い文字列を区別する方法は?

Question

Levenshtein_distance文字列の長さが長すぎるこの問題を解決したいと思います。

Edit2: Bobah
が言ったように、タイトルはミスリーディングなので、questoin のタイトルを更新しました。c++ で 100000x100000 の 2 次元整数を宣言する最初の方法は? c++ で int x[100000][100000] を宣言する方法はあります。グローバルに宣言すると、コンパイラが生成します。 1つの方法は、を使用することです。ただし、割り当てと割り当て解除にはさらに時間がかかります。uisng のような他の方法があります。
title was
Content was

error: size of array ‘x’ is too large
map< pair< int , int > , int > mymap

vector<int> myvec

score 3 · Accepted Answer

これほど大きなメモリブロックの場合、最適な方法は、プロセスに仮想メモリを追加するためのオペレーティングシステムの機能を使用した動的割り当てです。

ただし、割り当てようとしているブロックの大きさを見てください。

 40 000 000 000 bytes

以前のアドバイスを撤回します。これほど大きなブロックの場合、最善のアプローチは、問題を分析してメモリの使用量を減らす方法を見つけることです。

score 2 · Accepted Answer

編集距離マトリックスの入力は、一度に各行で行うことができます。現在の行を計算するには、前の行を覚えていれば十分です。この観測により、スペースの使用量が 2 次から線形に減少します。理にかなっていますか？

score 0 · Accepted Answer

Your question is very interesting, but the title is misleading.

This is what you need in terms of data model (x - first string, y - second string, * - distance matrix).

      y <-- first string (scrolls from top down)

      y
  x  x  x  x  x  x  x  x  <- second string (scrolls from left to right)
      y *  *  *

      y *  *  *

      y *  *  * <-- distance matrix (a donut) scrolls together with strings
                    and grows/shrinks when needed, as explained below
      y

Have two relatively long (but still << N) character buffers and relatively small ( << buffers size) rectangular (start from square) distance matrix.

Make the matrix a donut - bi-dimentional ring buffer (can use the one from boost, or just std::deque).

When string fragments currently covered by the matrix are 100% match shift both buffers by one, rotate the donut around both axes, recalculating one new row/column in the distance matrix.

When match is <100% and is less than configured threshold then grow the size of the both dimensions of the donut without dropping any rows/columns and do it until either match gets above the threshold or you reach the maximum donut size. When match ratio hits the threshold from the below you need to scroll donut discarding head of x and y buffers and at the same time aligning them (only X needs moving by 1 when the distance matrix tells that X[i] does not exist in Y, but X[i+1,i+m] matches Y[j, j+m-1]).

As a result you will have a simple yet very efficient heuristic diff engine with deterministic limited memory footprint and all memory can be pre-allocated at startup so no dynamic allocation will slow it down at runtime.

Apache v2 license, in case you decide to go for it.

string - c++で2つの非常に長い文字列を区別する方法は?

3 に答える 3

Related

Reference