c++ - boost::locale::transform のポータブルな使用法

Question

文字列の部分文字列の検索を実装していますが、この検索を「アクセントニュートラル」にしたい、またはラフと呼ばれる可能性があります-「rábano」で「aba」の検索を開始すると、成功するはずです。

#include <locale>
#include <string>
#include <boost/locale.hpp>    
std::string NormalizeString(const std::string & input)
{
    std::locale loc =  boost::locale::generator()("");
    const boost::locale::collator<char>& collator = std::use_facet<boost::locale::collator<char> >(loc);      
    std::string result = collator.transform(boost::locale::collator_base::primary, input);
    return result;
}

このソリューションの唯一の問題は、変換によって文字列の末尾に数バイトが追加されることです。私の場合は「\x1\x1\x1\x1\x0\x0\x0」です。1 バイトといくつかの 0 バイトを含む 4 バイト。もちろん、これらのバイトを消去するのは簡単ですが、そのような微妙な実装の詳細に依存したくありません。(コードはクロスプラットフォームであると想定されています)

もっと信頼できる方法はありますか？

score 0 · Accepted Answer

@R として。Martinho Fernandes 氏は、このような検索をブーストで実装するのは不可能に見えると述べました。クロムソースで解決策を見つけました。ICUを使用しています。

// This class is for speeding up multiple StringSearchIgnoringCaseAndAccents()
// with the same |find_this| argument. |find_this| is passed as the constructor
// argument, and precomputation for searching is done only at that timing.
class CStringSearchIgnoringCaseAndAccents 
{
public:
    explicit CStringSearchIgnoringCaseAndAccents(std::u16string find_this);
    ~CStringSearchIgnoringCaseAndAccents();
    // Returns true if |in_this| contains |find_this|. If |match_index| or
    // |match_length| are non-NULL, they are assigned the start position and total
    // length of the match.
    bool SearchIn(const std::u16string& in_this, size_t* match_index = nullptr, size_t* match_length = nullptr);

private:
    std::u16string _find_this;
    UStringSearch* _search_handle;
};

CStringSearchIgnoringCaseAndAccents::CStringSearchIgnoringCaseAndAccents(std::u16string find_this) : 
    _find_this(std::move(find_this)),
_search_handle(nullptr)
{
    // usearch_open requires a valid string argument to be searched, even if we
    // want to set it by usearch_setText afterwards. So, supplying a dummy text.
    const std::u16string& dummy = _find_this;
    UErrorCode status = U_ZERO_ERROR;
    _search_handle = usearch_open((const UChar*)_find_this.data(), _find_this.size(),
    (const UChar*)dummy.data(), dummy.size(), uloc_getDefault(), NULL, &status);
    if (U_SUCCESS(status)) {
        UCollator* collator = usearch_getCollator(_search_handle);
        ucol_setStrength(collator, UCOL_PRIMARY);
        usearch_reset(_search_handle);
    }
}
CStringSearchIgnoringCaseAndAccents::~CStringSearchIgnoringCaseAndAccents() 
{
    if (_search_handle) usearch_close(_search_handle);
}
bool CStringSearchIgnoringCaseAndAccents::SearchIn(const std::u16string& in_this, size_t* match_index, size_t* match_length) 
{
    UErrorCode status = U_ZERO_ERROR;
    usearch_setText(_search_handle, (const UChar*) in_this.data(), in_this.size(), &status);
    // Default to basic substring search if usearch fails. According to
    // http://icu-project.org/apiref/icu4c/usearch_8h.html, usearch_open will fail
    // if either |find_this| or |in_this| are empty. In either case basic
    // substring search will give the correct return value.
    if (!U_SUCCESS(status)) {
        size_t index = in_this.find(_find_this);
        if (index == std::u16string::npos) {
            return false;
        }
        else {
            if (match_index)
                *match_index = index;
            if (match_length)
                *match_length = _find_this.size();
            return true;
        }
    }
    int32_t index = usearch_first(_search_handle, &status);

    if (!U_SUCCESS(status) || index == USEARCH_DONE) return false;
    if (match_index)
    {
        *match_index = static_cast<size_t>(index);
    }
    if (match_length)
    {
        *match_length = static_cast<size_t>(usearch_getMatchedLength(_search_handle));
    }
    return true;
}

利用方法：

CStringSearchIgnoringCaseAndAccents searcher(a_utf16_string_what.c_str()));
searcher.SearchIn(a_utf16_string_where)

score 0 · Accepted Answer

これは古い質問ですが、私の解決策を投稿することにしました。ブーストテキスト変換メソッドを使用しました。最初に、正規化フォーム分解 (NFD)を適用しました。これにより、分離された文字が得られました。次に、コードが 255 未満のものをフィルタリングしました。次に、単純な小文字変換を行いました。あなたの問題（そして私のもの）にはうまくいきましたが、すべてのケースに当てはまるかどうかはわかりません。解決策は次のとおりです。

#include <iostream>
#include <algorithm>
#include <string>
#include <locale>
#include <boost/locale.hpp>

static std::locale loc =  boost::locale::generator()("en_US.UTF-8");

std::string NormalizeString(const std::string & input)
{
    std::string s_norm = boost::locale::normalize(input, boost::locale::norm_nfd, loc);

    std::string s;
    std::copy_if(s_norm.begin(), s_norm.end(), std::back_inserter(s), [](unsigned int ch){return ch<256;} );

    return boost::locale::to_lower(s, loc);
}

void find_norm(const std::string& input, const std::string& query) {
    if (NormalizeString(input).find(NormalizeString(query)) != std::string::npos)
        std::cout << query << " found in " << input << std::endl;
    else
        std::cout << query << " not found in " << input << std::endl;
}

int main(int argc, char *argv[])
{
    find_norm("rábano", "aba");
    find_norm("rábano", "aaa");

    return EXIT_SUCCESS;
}

c++ - boost::locale::transform のポータブルな使用法

2 に答える 2

Related

Reference