sorting - Efficient Means of Implementing Collation & Sorting?

Question

I'm writing lexicography software, which may theoretically need to sort tens of thousands of strings with arbitrary (dictionary-project-specific) collations. There are two means of specifying custom collations:

a map of graphemes to unicode-style multi-level collation keys.
an array of alphabetic graphemes (possibly including digraphs, etc.) in sorting order, which can be internally transformed into a map of collation keys.

The naive method of comparing strings is to check grapheme-by-grapheme until you find a mismatch, and then look up the collation keys for the mismatched graphemes to compare, but I'm hoping there's a more efficient way of doing it.

The best idea I've got so far depends on noticing that strings of equal length can be treated as little-endian base-n numbers, so I can pre-compute an integer key for each string which turns collation into cheap integer comparison. But, this breaks for strings of different length (a big deal when sorting a dictionary), and there's no bound on the size of integers that could be generated. To account for length differences, I thought I could compute a list of keys for all prefixes of each string, and then just compare the keys for prefixes of length equal to the shorter string being compared. That seems to do pretty well, but key sizes are still unbounded, and storing the keys could use a lot of memory.

Is there a way to improve that approach? Or am I just going about it entirely wrong, and there's a much better means of sorting strings with arbitrary collations?

score 2 · Accepted Answer

書記素ごとの基数ソートはどうですか？Big O n（単語数）* m（最長単語の長さ）の並べ替えを取得します。アイデアはかなり単純で、Aで始まるすべての単語をAバケットに、BsをBバケットに、というように単語の文字を下に配置する必要があります。

score 1 · Accepted Answer

私は専門家ではありませんが、素朴なアプローチとあなたのアプローチの間のある種のハイブリッドを提案するかもしれません. 各文字列の固定バイト数を見る場合は、それをリトルエンディアンの数値として扱い、事前に計算された照合を使用します。次に、それらが同じ場合は、同じ長さの次のセットに移動し、同じことを行います. 注意が必要なのは、可変長の書記素 (UTF-8 や digraph など) の処理です。最も簡単な解決策は、辞書で固定幅表現を使用することですが、今は思いつかない、より洗練された別の解決策があるかもしれません。

短い文字列の最後に到達したら、それをゼロ拡張して次の境界に合わせてから比較を行います。

また、照合のオープンソース実装を調べて、それらがより洗練された処理を行っているかどうかを確認することもできます (たとえば、strcoll C 関数の GNU 実装)。

sorting - Efficient Means of Implementing Collation & Sorting?

2 に答える 2

Related

Reference