elasticsearch - エラスティックサーチでフィールドがアルファベット順にソートされない

Question

名前フィールドが含まれるドキュメントがいくつかあります。not_analyzed検索と並べ替えの目的で、分析されたバージョンの名前フィールドを使用しています。並べ替えは 1 つのレベルで行われます。つまり、最初は名前がアルファベット順に並べ替えられます。しかし、アルファベットのリスト内では、名前はアルファベット順ではなく辞書順にソートされています。使用したマッピングは次のとおりです。

{
  "mappings": {
    "seing": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }

誰でも同じ解決策を提供できますか?

score 18 · Accepted Answer

Elasticsearch のドキュメントを掘り下げていくと、次のことがわかりました。

並べ替えと照合

大文字と小文字を区別しない並べ替え

名前フィールドにそれぞれ Boffey、BROWN、および bailey が含まれる 3 つのユーザードキュメントがあるとします。最初に、ソートに not_analyzed フィールドを使用する、文字列のソートとマルチフィールドで説明されている手法を適用します。

PUT /my_index
{
  "mappings": {
    "user": {
      "properties": {
        "name": {                    //1
          "type": "string",
          "fields": {
            "raw": {                 //2
              "type":  "string",
              "index": "not_analyzed"
            }
          }
        }
      }
    }
  }
}

analyzed nameフィールドは検索に使用されます。
not_analyzed name.rawフィールドはソートに使用されます。

上記の検索要求は、BROWN、Boffey、bailey の順序でドキュメントを返します。これは、アルファベット順ではなく辞書式順序として知られています。基本的に、大文字を表すために使用されるバイトは、小文字を表すために使用されるバイトよりも小さい値を持つため、名前は最初に最小のバイトでソートされます。

これはコンピューターにとっては理にかなっているかもしれませんが、大文字と小文字に関係なく、これらの名前がアルファベット順にソートされることを合理的に期待する人間にとっては、あまり意味がありません。これを実現するには、バイト順序が必要な並べ替え順序に対応するように、各名前にインデックスを付ける必要があります。

つまり、単一の小文字トークンを発行するアナライザーが必要です。

このロジックに従って、未加工のドキュメントを保存する代わりに、カスタムキーワードアナライザーを使用して小文字にする必要があります。

PUT /my_index
{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "case_insensitive_sort" : {
          "tokenizer" : "keyword",
          "filter" : ["lowercase"]
        }
      }
    }
  },
  "mappings" : {
    "seing" : {
      "properties" : {
        "name" : {
          "type" : "string",
          "fields" : {
            "raw" : {
              "type" : "string",
              "analyzer" : "case_insensitive_sort"
            }
          }
        }
      }
    }
  }
}

order byは、辞書順ではなく、アルファベットname.raw順でソートする必要があります。

Marvel を使用してローカルマシンで行った簡単なテスト:

索引構造:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "case_insensitive_sort": {
          "tokenizer": "keyword",
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "user": {
      "properties": {
        "name": {
          "type": "string",
          "fields": {
            "raw": {
              "type": "string",
              "index": "not_analyzed"
            },
            "keyword": {
              "type": "string",
              "analyzer": "case_insensitive_sort"
            }
          }
        }
      }
    }
  }
}

テストデータ：

PUT /my_index/user/1
{
  "name": "Tim"
}

PUT /my_index/user/2
{
  "name": "TOM"
}

raw フィールドを使用したクエリ:

POST /my_index/user/_search
{
  "sort": "name.raw"
}

結果：

{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "2",
  "_score" : null,
  "_source" : {
    "name" : "TOM"
  },
  "sort" : [
    "TOM"
  ]
},
{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "1",
  "_score" : null,
  "_source" : {
    "name" : "Tim"
  },
  "sort" : [
    "Tim"
  ]
}

小文字の文字列を使用したクエリ:

POST /my_index/user/_search
{
  "sort": "name.keyword"
}

結果：

{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "1",
  "_score" : null,
  "_source" : {
    "name" : "Tim"
  },
  "sort" : [
    "tim"
  ]
},
{
  "_index" : "my_index",
  "_type" : "user",
  "_id" : "2",
  "_score" : null,
  "_source" : {
    "name" : "TOM"
  },
  "sort" : [
    "tom"
  ]
}

あなたの場合、2番目の結果が正しいと思います。

elasticsearch - エラスティックサーチでフィールドがアルファベット順にソートされない

3 に答える 3

Related

Reference