elasticsearch - Elasticsearch でネストされたフィールドを強調表示する方法

Question

Luceneのロジック構造ですが、コンテンツに検索結果が存在する場合に、ネストされたフィールドが強調表示されるようにしようとしています。

Elasticsearchのドキュメントからの説明は次のとおりです（ネストされた型のマッピング`）

内部実装

内部的には、ネストされたオブジェクトは追加のドキュメントとしてインデックス化されますが、同じ「ブロック」内でインデックス化されることが保証されるため、親ドキュメントとの結合が非常に高速になります。

これらの内部のネストされたドキュメントは、インデックスに対する操作 (match_all クエリを使用した検索など) を実行するときに自動的にマスクされ、ネストされたクエリを使用するとバブルアウトします。

ネストされたドキュメントは常に親ドキュメントにマスクされるため、ネストされたドキュメントはネストされたクエリの範囲外では決してアクセスできません。たとえば、格納されたフィールドは、ネストされたオブジェクト内のフィールドで有効にすることができますが、格納されたフィールドはネストされたクエリスコープの外でフェッチされるため、それらを取得する方法はありません。

0.私の場合

次のようなマッピングを含むElasticsearchインデックスがあります。

{
    "my_documents": {
        "dynamic_date_formats": [
            "dd.MM.yyyy",
            "yyyy-MM-dd",
            "yyyy-MM-dd HH:mm:ss"
        ],
        "index_analyzer": "Analyzer2_index",
        "search_analyzer": "Analyzer2_search_decompound",
        "_timestamp": {
            "enabled": true
        },
        "properties": {
            "identifier": {
                "type": "string"
            },
            "description": {
                "type": "multi_field",
                "fields": {
                    "sort": {
                        "type": "string",
                        "index": "not_analyzed"
                    },
                    "description": {
                        "type": "string"
                    }
                }
            },
            "files": {
                "type": "nested",
                "include_in_root": true,
                "properties": {
                    "content": {
                        "type": "string",
                        "include_in_root": true
                    }
                }
            },
            "and then some other": "normal string fields"
        }
    }
}

次のようなクエリを実行しようとしています。

{
    "size": 100,
    "query": {
        "bool": {
            "should": [
                {
                    "nested": {
                        "path": "files",
                        "query": {
                            "bool": {
                                "should": {
                                    "match": {
                                        "content": {
                                            "query": "burpcontrol",
                                            "minimum_should_match": "85%"
                                        }
                                    }
                                }
                            }
                        }
                    }
                },
                {
                    "match": {
                        "description": {
                            "query": "burpcontrol",
                            "minimum_should_match": "85%"
                        }
                    }
                },
                {
                    "match": {
                        "identifier": {
                            "query": "burpcontrol",
                            "minimum_should_match": "85%"
                        }
                    }
                }            ]
        }
    },
    "highlight": {
        "pre_tags": [
            "<span style=\"background-color: yellow\">"
        ],
        "post_tags": [
            "</span>"
        ],
        "order": "score",
        "no_match_size": 100,
        "fragment_size": 50,
        "number_of_fragments": 3,
        "require_field_match": true,
        "fields": {
            "files.content": {},
            "description": {},
            "identifier": {}
        }
    }
}

私が抱えている問題は次のとおりです。

1.require_field_match

使用する"require_field_match": falseと、ネストされたフィールドで強調表示が機能しない場合でも、すべてのフィールドで検索語が強調表示されます。これは私が実際に使用しているソリューションですが、パフォーマンスはひどいものです。50 個のドキュメントの場合、クエリには 25 秒必要です。約50秒で100枚。10 文書 5 秒。 そして、ネストされたフィールドを強調表示から削除すると、すべてが光のように高速に機能します!

2 .include_in_root

ネストされたフィールドのフラット化されたバージョンが必要です（通常のオブジェクト/フィールドとして保存するためです。これを行うには、指定する必要があります

"files": { "type": "nested", " include_in_root ": true, ...

しかし、インデックスを再作成した後、ドキュメントルートに追加のフラット化されたフィールドが表示されない理由はわかりません(のようなものを期待していました"files.content":["content1", "content2", "..."])。

それが機能する場合は、ネストされたフィールドの内容に (フラット化されたフィールドで) アクセスし、その内容を強調表示することができます。

ネストされたフィールドで優れた (そしてパフォーマンスの高い) 強調表示を実現できるかどうか、または少なくともクエリが非常に遅い理由を教えてください。（私はすでにフラグメントを最適化しました）

score 8 · Accepted Answer

親子関係を利用して、ここでできることはたくさんあります。うまくいけば、それが正しい方向にあなたを導くでしょう。このソリューションのパフォーマンスが向上するかどうかを判断するには、まだ多くのテストが必要です。また、明確にするために、セットアップの詳細のいくつかを省略しました。長文お許しください。

次のように親子マッピングを設定しました。

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
   },
   "mappings": {
      "parent_doc": {
         "properties": {
            "identifier": {
               "type": "string"
            },
            "description": {
               "type": "string"
            }
         }
      },
      "child_doc": {
         "_parent": {
            "type": "parent_doc"
         },
         "properties": {
            "content": {
               "type": "string"
            }
         }
      }
   }
}

次に、いくつかのテストドキュメントを追加しました。

POST /test_index/_bulk
{"index":{"_index":"test_index","_type":"parent_doc","_id":1}}
{"identifier": "first", "description":"some special text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is special"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":1}}
{"content":"text that is not"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":2}}
{"identifier": "second", "description":"some different text"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":2}}
{"content":"different child text, but special"}
{"index":{"_index":"test_index","_type":"parent_doc","_id":3}}
{"identifier": "third", "description":"we don't want this parent"}
{"index":{"_index":"test_index","_type":"child_doc","_parent":3}}
{"content":"or this child"}

私があなたの仕様を正しく理解している場合"special"、最後の 2 つを除くこれらのドキュメントのすべてを返すクエリが必要です (間違っている場合は修正してください)。テキストに一致するドキュメント、テキストに一致する子を持つドキュメント、またはテキストに一致する親を持つドキュメントが必要です。

次のように、クエリに一致する親を取得できます。

POST /test_index/parent_doc/_search
{
    "query": {
        "match": {
           "description": "special"
        }
    },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {}
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1.1263815,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 1.1263815,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         }
      ]
   }
}

そして、次のようにクエリに一致する子を取得できます。

POST /test_index/child_doc/_search
{
    "query": {
        "match": {
           "content": "special"
        }
    },
    "highlight": {
        "fields": {
            "content": {}
        }
    }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.92364895,
      "hits": [
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.92364895,
            "_source": {
               "content": "text that is special"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.80819285,
            "_source": {
               "content": "different child text, but special"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         }
      ]
   }
}

次のように、テキストに一致する親とテキストに一致する子を取得できます。

POST /test_index/parent_doc,child_doc/_search
{
    "query": {
        "multi_match": {
           "query": "special",
           "fields": ["description", "content"]
        }
    },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {},
            "content": {}
        }
    }
}
...
{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 3,
      "max_score": 1.1263815,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 1.1263815,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.75740534,
            "_source": {
               "content": "text that is special"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.6627297,
            "_source": {
               "content": "different child text, but special"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         }
      ]
   }
}

ただし、このクエリに関連するすべてのドキュメントを取得するには、boolクエリを使用する必要があります。

POST /test_index/parent_doc,child_doc/_search
{
   "query": {
      "bool": {
         "should": [
            {
               "multi_match": {
                  "query": "special",
                  "fields": [
                     "description",
                     "content"
                  ]
               }
            },
            {
               "has_child": {
                  "type": "child_doc",
                  "query": {
                     "match": {
                        "content": "special"
                     }
                  }
               }
            },
            {
               "has_parent": {
                  "type": "parent_doc",
                  "query": {
                     "match": {
                        "description": "special"
                     }
                  }
               }
            }
         ]
      }
   },
    "highlight": {
        "fields": {
            "description": {},
            "identifier": {},
            "content": {}
        }
    },
    "fields": ["_parent", "_source"]
}
...
{
   "took": 5,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 5,
      "max_score": 0.8866254,
      "hits": [
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "1",
            "_score": 0.8866254,
            "_source": {
               "identifier": "first",
               "description": "some special text"
            },
            "highlight": {
               "description": [
                  "some <em>special</em> text"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "geUFenxITZSL7epvB568uA",
            "_score": 0.67829096,
            "_source": {
               "content": "text that is special"
            },
            "fields": {
               "_parent": "1"
            },
            "highlight": {
               "content": [
                  "text that is <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "IMHXhM3VRsCLGkshx52uAQ",
            "_score": 0.18709806,
            "_source": {
               "content": "different child text, but special"
            },
            "fields": {
               "_parent": "2"
            },
            "highlight": {
               "content": [
                  "different child text, but <em>special</em>"
               ]
            }
         },
         {
            "_index": "test_index",
            "_type": "child_doc",
            "_id": "NiwsP2VEQBKjqu1M4AdjCg",
            "_score": 0.12531912,
            "_source": {
               "content": "text that is not"
            },
            "fields": {
               "_parent": "1"
            }
         },
         {
            "_index": "test_index",
            "_type": "parent_doc",
            "_id": "2",
            "_score": 0.12531912,
            "_source": {
               "identifier": "second",
               "description": "some different text"
            }
         }
      ]
   }
}

(ここ"_parent"に示すように、結果にドキュメントが含まれている理由を簡単に確認できるように、フィールドを含めました)。

これが役立つかどうか教えてください。

使用したコードは次のとおりです。

http://sense.qbox.io/gist/d69a4d6531dc063faa4b4e094cff2a472a73c5a6

elasticsearch - Elasticsearch でネストされたフィールドを強調表示する方法

0.私の場合

1.require_field_match

2 .include_in_root

ネストされたフィールドで優れた (そしてパフォーマンスの高い) 強調表示を実現できるかどうか、または少なくともクエリが非常に遅い理由を教えてください。（私はすでにフラグメントを最適化しました）

1 に答える 1

Related

Reference