elasticsearch - Elastic Search での Web サイト/URL のインデックス作成

Question

websiteエラスティック検索でインデックス付けされたドキュメントのフィールドがあります。値の例: http://example.com。問題は、を検索するexampleと、ドキュメントが含まれていないことです。ウェブサイト/URL フィールドを正しくマッピングする方法は?

以下のインデックスを作成しました。

{
  "settings":{
    "index":{
        "analysis":{
        "analyzer":{
            "analyzer_html":{
                  "type":"custom",
                  "tokenizer": "standard",
                "filter":"standard",
                "char_filter": "html_strip"
            }
        }
        }
    }
  },
  "mapping":{
    "blogshops": {
        "properties": {
            "category": {
                "properties": {
                    "name": {
                        "type": "string"
                    }
                }
            },
            "reviews": {
                "properties": {
                    "user": {
                        "properties": {
                            "_id": {
                                "type": "string"
                            }
                        }
                    }
                }
            }
        }
    }
  }
}

score 28 · Accepted Answer

standardアナライザーを使用していると思いますが、これはhttp://example.dom2 つのトークンに分割されます -httpとexample.com. ご覧いただけますhttp://localhost:9200/_analyze?text=http://example.com&analyzer=standard。

分割する場合はurl、別のアナライザーを使用するか、独自のカスタムアナライザーを指定する必要があります。

単純なアナライザーでどのようにurlインデックスが作成されるかを確認できます- . ご覧のとおり、現在は3 つの tokens としてインデックス化されています。などのトークンにインデックスを付けたくない場合は、アナライザーを小文字のトークナイザー(単純なアナライザーで使用されるもの) で指定し、フィルターを停止できます。たとえば、次のようなものです。http://localhost:9200/_analyze?text=http://example.com&analyzer=simpleurl['http', 'example', 'com']['http', 'www']

# Delete index
#
curl -s -XDELETE 'http://localhost:9200/url-test/' ; echo
 
# Create index with mapping and custom index
#
curl -s -XPUT 'http://localhost:9200/url-test/' -d '{
  "mappings": {
    "document": {
      "properties": {
        "content": {
          "type": "string",
          "analyzer" : "lowercase_with_stopwords"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "filter" : {
        "stopwords_filter" : {
          "type" : "stop",
          "stopwords" : ["http", "https", "ftp", "www"]
        }
      },
      "analyzer": {
        "lowercase_with_stopwords": {
          "type": "custom",
          "tokenizer": "lowercase",
          "filter": [ "stopwords_filter" ]
        }
      }
    }
  }
}' ; echo

curl -s -XGET 'http://localhost:9200/url-test/_analyze?text=http://example.com&analyzer=lowercase_with_stopwords&pretty'

# Index document
#
curl -s -XPUT 'http://localhost:9200/url-test/document/1?pretty=true' -d '{
  "content" : "Small content with URL http://example.com."
}'

# Refresh index
#
curl -s -XPOST 'http://localhost:9200/url-test/_refresh'

# Try to search document
#
curl -s -XGET 'http://localhost:9200/url-test/_search?pretty' -d '{
  "query" : {
    "query_string" : {
        "query" : "content:example"
    }
  }
}'

注: ストップワードを使用したくない場合は、ストップワードを停止する興味深い記事をご覧ください: 一般的な用語クエリを見てください。

elasticsearch - Elastic Search での Web サイト/URL のインデックス作成

1 に答える 1

Related

Reference