elasticsearch - 結果 (idf?) 全体で入れ子になったヒットの合計数をシングルヒットの tf よりも高くするように ElasticSearch を取得しますか?

Question

用語をいじっていたらすみませんが、私のアプリにとって意味のある方法で ES に結果を採点させるのに問題があります。

いくつかの単純なフィールドを使用して数千のユーザーのインデックスを作成しています。また、各ユーザーのインデックスにネストされている可能性のある数百の子オブジェクト (つまり、Book --> Pagesデータモデル) を作成しています。インデックスに送信される JSON は次のようになります。

user_id: 1
  full_name: First User
  username: firstymcfirsterton
  posts: 
   id: 2
    title: Puppies are Awesome
    tags:
     - dog house
     - dog supplies
     - dogs
     - doggies
     - hot dogs
     - dog lovers

user_id: 2
  full_name: Second User
  username: seconddude
  posts: 
   id: 3
    title: Dogs are the best
    tags:
     - dog supperiority
     - dog
   id: 4
    title: Why dogs eat?
    tags: 
     - dog diet
     - canines
   id: 5
    title: Who let the dogs out?
    tags:
     - dogs
     - terrible music

タグはタイプ「タグ」で、「キーワード」アナライザーを使用し、ブーストされた 10 です。タイトルはブーストされません。

「犬」を検索すると、最初のユーザーの方が 2 番目のユーザーよりもスコアが高くなります。これは、最初のユーザーの tf-idf が高いことに関係していると思います。ただし、私のアプリでは、理想的にはその用語のヒットを持つユーザーの投稿が多いほど、最初に来るでしょう。

投稿数で並べ替えてみましたが、ユーザーの投稿数が多いとジャンク結果になります。基本的には、ヒットした投稿が多いユーザーが上位になるように、ユニークな投稿ヒット数で並べ替えたいと考えています。

どうすればこれを行うことができますか。何か案は？

score 2 · Accepted Answer

まず、@karmi と @Zach に同意します。投稿を照合することで何を意味するかを理解することが重要です。簡単にするために、一致する投稿のどこかに「犬」という単語があり、タグの一致とブーストをより面白くするためにキーワードアナライザーを使用していないと仮定します。

私があなたの質問を正しく理解していれば、関連する投稿の数に基づいてユーザーを並べ替えたいと考えています。これは、関連する投稿を見つけるために投稿を検索し、この情報をユーザークエリに使用する必要があることを意味します。投稿が個別にインデックス化されている場合にのみ可能です。つまり、投稿は子ドキュメントまたはネストされたフィールドのいずれかでなければなりません。

投稿が子ドキュメントであると仮定すると、次のようにデータのプロトタイプを作成できます。

curl -XPOST 'http://localhost:9200/test-idx' -d '{
    "settings" : {
        "number_of_shards" : 1,
        "number_of_replicas" : 0
    },
    "mappings" : {
      "user" : {
        "_source" : { "enabled" : true },
        "properties" : {
            "full_name": { "type": "string" },
            "username": { "type": "string" }
        }
      },
      "post" : {
        "_parent" : {
          "type" : "user"
        },
        "properties" : {
            "title": { "type": "string"},
            "tags": { "type": "string", "boost": 10}
        }
      }
    }
}' && echo

curl -XPUT 'http://localhost:9200/test-idx/user/1' -d '{
    "full_name": "First User",
    "username": "firstymcfirsterton"
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/user/2' -d '{
    "full_name": "Second User",
    "username": "seconddude"
}'  && echo

#Posts of the first user
curl -XPUT 'http://localhost:9200/test-idx/post/1?parent=1' -d '{
    "title": "Puppies are Awesome",
    "tags": ["dog house", "dog supplies", "dogs", "doggies", "hot dogs", "dog lovers", "dog"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/2?parent=1' -d '{
    "title": "Cats are Awesome too",
    "tags": ["cat", "cat supplies", "cats"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/3?parent=1' -d '{
    "title": "One fine day with a woof and a purr",
    "tags": ["catdog", "cartoons"]
}'  && echo

#Posts of the second user
curl -XPUT 'http://localhost:9200/test-idx/post/4?parent=2' -d '{
    "title": "Dogs are the best",
    "tags": ["dog supperiority", "dog"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/5?parent=2' -d '{
    "title": "Why dogs eat?",
    "tags": ["dog diet", "canines"]
}'  && echo
curl -XPUT 'http://localhost:9200/test-idx/post/6?parent=2' -d '{
    "title": "Who let the dogs out?",
    "tags": ["dogs", "terrible music"]
}'  && echo

curl -XPOST 'http://localhost:9200/test-idx/_refresh' && echo

Top Children Queryを使用してこれらのデータを照会できます。(または、ネストされたフィールドの場合は、 Nested Queryを使用して同様の結果を得ることができます)

curl 'http://localhost:9200/test-idx/user/_search?pretty=true' -d '{
  "query": {
    "top_children" : {
        "type": "post",
        "query" : {
            "bool" : {
                "should": [
                    { "text" : { "title" : "dog" } },
                    { "text" : { "tags" : "dog" } }
                ]
            }
        },
        "score" : "sum"
    }
  }
}' && echo

このクエリは、一致したタグから得られる膨大なブーストファクターにより、最初のユーザーを最初に返します。そのため、希望どおりに見えない場合がありますが、簡単な修正方法がいくつかあります。まず、tags フィールドのブーストファクターを減らすことができます。10 は、特に数回繰り返すことができるフィールドの場合、非常に大きな要素です。または、クエリを変更して子ヒットのスコアを完全に無視し、代わりに上位一致した子ドキュメントの数をスコアとして使用することもできます。

curl 'http://localhost:9200/test-idx/user/_search?pretty=true' -d '{
  "query": {
    "top_children" : {
        "type": "post",
        "query" : {
            "constant_score" : {
                "query" : {            
                    "bool" : {
                        "should": [
                            { "text" : { "title" : "dog" } },
                            { "text" : { "tags" : "dog" } }
                        ]
                    }
                }
            }
        },
        "score" : "sum"
    }
  }
}' && echo

elasticsearch - 結果 (idf?) 全体で入れ子になったヒットの合計数をシングル ヒットの tf よりも高くするように ElasticSearch を取得しますか?

1 に答える 1

Related

Reference

elasticsearch - 結果 (idf?) 全体で入れ子になったヒットの合計数をシングルヒットの tf よりも高くするように ElasticSearch を取得しますか?