alignment - ドキュメントへのスペースの再導入

Question

参照テキストが手元にあると想像してください

40 と 7 年前、私たちの先祖はこの大陸に新しい国を生み出しました。この国は、自由に構想され、すべての人が平等に作られているという命題に専念しています。今、私たちは大規模な内戦に巻き込まれており、その国、またはそのように考えられ、献身的な国が長く耐えることができるかどうかをテストしています. 私たちはあの戦争の大きな戦場で出会った。私たちは、その国が生きるためにここで命を捧げた人々のための最後の休息の場所として、そのフィールドの一部を捧げるために来ました. 私たちがこれを行うことは、まったく適切で適切なことです。しかし、より大きな意味で、私たちはこの地を奉献することも、奉献することも、神聖にすることもできません。ここで奮闘した生者と死者を問わず、勇敢な男たちは、加減する私たちの貧弱な力をはるかに超えて、それを奉献しました。世界は私たちがここで言ったことにほとんど気付かないでしょうし、長く記憶することもないでしょう。しかし、彼らがここでしたことを決して忘れることはできません。むしろ、ここで戦った彼らがこれまで気高く進めてきた未完の仕事にここで献身するのは、生きている私たちのためです。むしろ、私たちの前に残されている大きな課題に専念すること、つまり、これらの名誉ある死者から、彼らが最後の完全な献身の手段を与えた大義へのさらなる献身を取ることである.この国は、神の下で新たな自由の誕生を迎え、人々の、人々による、人々のための政府は、地球上から滅びることはありません。

スペースや句読点がなく、一部の文字が削除、挿入、置換されたテキストのスニペットが返されます。

ieldasafinalrTstingplaceforwhofoughtheregavetheirliZesthatthatn

参照テキストを使用して、単語の適切なスペースを試すために使用できるいくつかのツール (任意のプログラミング言語で) は何ですか?

ield as a final rTsting place for who fought here gave their liZes that that n

エラーを修正する必要はありません。スペースを空けるだけです

score 1 · Accepted Answer

これは、編集距離を使用して、参照の最小編集距離部分文字列を見つけることで実行できます。ここで、同様の質問に対する私の回答 (PHP 実装) を確認してください。

文字許容誤差が間違っている最長共通部分文字列

上記のリンクの関数を使用すると、shortest_edit_substring()これを追加して、文字以外のすべて (または保持したいもの: 文字、数字など) を取り除いた後に検索を実行し、結果を元のバージョンに正しくマップすることができます。

// map a stripped down substring back to the original version
function map_substring($haystack_letters,$start,$length,$haystack, $regexp)
{
    $r_haystack = str_split($haystack);
    $r_haystack_letters = $r_haystack;
    foreach($r_haystack as $k => $l) 
    {   
        if (preg_match($regexp,$l))
        {       
            unset($r_haystack_letters[$k]);
        }       
    }   
    $key_map = array_keys($r_haystack_letters);
    $real_start = $key_map[$start];
    $real_end = $key_map[$start+$length-1];
    $real_length = $real_end - $real_start + 1;
    return array($real_start,$real_length);
}

$haystack = 'Four score and seven years ago our fathers brought forth on this continent a new nation, conceived in liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation, so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate, we can not consecrate, we can not hallow this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.';

$needle = 'ieldasafinalrTstingplaceforwhofoughtheregavetheirliZesthatthatn';

// strip out all non-letters
$regexp_to_strip_out = '/[^A-Za-z]/';

$haystack_letters = preg_replace($regexp_to_strip_out,'',$haystack);

list($start,$length) = shortest_edit_substring($needle,$haystack_letters);
list($real_start,$real_length) = map_substring($haystack_letters,$start,$length,$haystack,$regexp_to_strip_out);

printf("Found |%s| in |%s|, matching |%s|\n",substr($haystack,$real_start,$real_length),$haystack,$needle);

これにより、エラー修正も行われます。実際、やらないよりはやった方が簡単です。最小編集距離検索は、PHP よりも高速なものが必要な場合、他の言語で実装するのは非常に簡単です。

score 1 · Accepted Answer

あなたがそこに持っている奇妙な問題:)

ヒントの大文字化に頼ることができない場合は、最初からすべてを小文字にしてください。

次に、単語の辞書を取得します。おそらくただの単語リストか、 Wordnetを試すことができます。

そして、適切な間隔で配置された同様の資料のコーパス。必要に応じて、ウィキペディアのダンプをダウンロードします。それをクリーンアップして、ngram に分割する必要があります。3グラムはおそらくタスクに適しています. または、時間を節約して、Google の ngram データを使用してください。Web ngrams (有料) またはbook ngrams (無料っぽい)のいずれかです。

最大語長の上限を設定します。20文字としましょう。

謎の文字列の最初の文字を取り、辞書で調べます。次に、最初の 2 文字を取り出して調べます。20 になるまでこれを続けます。取得したすべての一致を保存しますが、最も長い一致がおそらく最適です。文字列を介して、一度に1文字ずつ開始点を移動します。

有効な単語の一致の配列が得られます。

この新しい配列をループし、各値を次の値とペアにして元の文字列と比較し、重複しないすべての有効な単語の組み合わせを特定します。1 つまたは複数の出力文字列になる可能性があります。

複数ある場合は、各出力文字列を 3 グラムに分割します。次に、新しい ngram データベースを検索して、最も頻繁に使用される組み合わせを確認します。

また、ストップワードから開始し、辞書でチェックし、両側に増分文字を組み合わせて、そこに最初にスペースを追加するなど、時間を節約するテクニックもあるかもしれません。

...または、私は問題全体をやり過ぎており、誰かが私を謙虚にするだろうという厄介な1つのライナーがあります:)

alignment - ドキュメントへのスペースの再導入

2 に答える 2

Related

Reference