python - Python 2/3 用の Google の DiffMatchPatch API の実装

Question

Google の Diff Match Patch APIを使用して、Python で簡単な差分アプリケーションを作成したいと考えています。私は Python にまったく慣れていないので、Diff Match Patch API を使用してテキストの 2 つの段落を意味的に比較する方法の例が必要です。diff_match_patch.pyファイルの使用方法と、そこから何をインポートするかについてはよくわかりません。助けていただければ幸いです。

さらに、difflibを使用してみましたが、大きく異なる文を比較するには効果的ではないことがわかりました。ubuntu 12.04 x64 を使用しています。

score 21 · Accepted Answer

Google のdiff-match-patch APIは、それが実装されているすべての言語 (Java、JavaScript、Dart、C++、C#、Objective C、Lua、および Python 2.x または python 3.x) で同じです。したがって、通常、ターゲット言語以外の言語でサンプルスニペットを使用して、さまざまな diff/match/patch タスクに必要な特定の API 呼び出しを特定できます。

単純な「セマンティック」比較の場合、これが必要です

import diff_match_patch

textA = "the cat in the red hat"
textB = "the feline in the blue hat"

#create a diff_match_patch object
dmp = diff_match_patch.diff_match_patch()

# Depending on the kind of text you work with, in term of overall length
# and complexity, you may want to extend (or here suppress) the
# time_out feature
dmp.Diff_Timeout = 0   # or some other value, default is 1.0 seconds

# All 'diff' jobs start with invoking diff_main()
diffs = dmp.diff_main(textA, textB)

# diff_cleanupSemantic() is used to make the diffs array more "human" readable
dmp.diff_cleanupSemantic(diffs)

# and if you want the results as some ready to display HMTL snippet
htmlSnippet = dmp.diff_prettyHtml(diffs)

diff-match-patch による"セマンティック" 処理についての言葉
このような処理は、人間が見たときに違いを表示するのに便利であることに注意してください。 2 つの異なる単語の中間に共通の文字が含まれています)。ただし、この処理は、レキシコンやその他のセマンティックレベルのデバイスに基づく実際の NLP 処理ではなく、違いの長さや表面パターンなどに基づく単純なヒューリスティックであるため、生成される結果は完全にはほど遠いものです。
たとえば、上記で使用したtextAとの値は、配列textBに対して次の「before-and-after-diff_cleanupSemantic」値を生成しますdiffs

[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'r'), (1, 'blu'), (0, 'e'), (-1, 'd'), (0, ' hat')]
[(0, 'the '), (-1, 'cat'), (1, 'feline'), (0, ' in the '), (-1, 'red'), (1, 'blue'), (0, ' hat')]

良い！赤と青に共通する文字 'e' により、diff_main() はテキストのこの領域を 4 つの編集として認識しますが、cleanupSemantic() は 2 つの編集として修正し、異なる sem 'blue' と '赤'。

ただし、たとえば

textA = "stackoverflow is cool"
textb = "so is very cool"

生成される前後の配列は次のとおりです。

[(0, 's'), (-1, 'tack'), (0, 'o'), (-1, 'verflow'), (0, ' is'), (1, ' very'), (0, ' cool')]
[(0, 's'), (-1, 'tackoverflow is'), (1, 'o is very'), (0, ' cool')]

これは、意味論的に改善されたとされるafterがbefore に比べて不当に「拷問」される可能性があることを示しています。たとえば、先頭の 's' が一致として保持される方法と、追加された 'very' 単語が 'is cool' 表現の一部と混合される方法に注意してください。理想的には、おそらく次のようなものを期待するでしょう

[(-1, 'stackoverflow'), (1, 'so'), (0, ' is '), (-1, 'very'), (0, ' cool')]

python - Python 2/3 用の Google の DiffMatchPatch API の実装

1 に答える 1

Related

Reference