Elasticsearch top_hits performance using shingle filter -


we're using elasticsearch return distinct search term suggestions dozen different fields across large set of data. accomplish this, we're using 'terms' , 'top_hits' aggregations (the terms aggregation uses wildcard term). we're using shingle_filter (min-size:2, max-size:3) on custom analyzer, 1 of requirements of project return search suggestions on multi-word search terms.

i've tried several different approaches, none of performant.

approach 1 - suggestion criteria in _all

all criteria on want return suggestions put in _all field utilizes custom analyzer shingle filter:

'settings' : {   'analysis' : {     'analyzer' : {       'autocomplete_analyzer' : {         'type' : 'custom',         'tokenizer' : 'suggestion_tokenizer',         'filter' : [           'lowercase',            'shingle_filter'         ]       },     },     'tokenizer' : {       'suggestion_tokenizer' : {         'type' : 'whitespace'       }     },     'filter' : {       'shingle_filter' : {         'type' : 'shingle',         'min_shingle_size' : 2,         'max_shingle_size' : 3       }     }   } }, 'mappings' : {   'core' : {     '_all' : {       'enabled' : 'yes',       'index' : 'analyzed',       'analyzer' : 'autocomplete_analyzer'     },     'properties' : {       'suggestion_criteria_1': {         'type' : 'multi_field',         'fields' : {           'analyzed' : {             'type' : 'string',             'index' : 'analyzed'           },           'suggestion_criteria_1': {             'type' : 'string',             'index' : 'not_analyzed',             'include_in_all' : 'yes'           }         }       },...       'filter_criteria_1': {         'type' : 'string',         'include_in_all' : 'no',         'index' : 'not_analyzed'       },...     }   } } 

aggregation/query utilzies filters , suggestion term search array, need know field suggestion match came from:

{   'from' : 0,   'size' : 0,   'query' : {     'filtered' : {       'filter' : {         'and' : [           {search filter array / optional}         ]       }     }   },   'aggs' : {     'suggestions' : {       'terms' : {         'field' : '_all',         'include' : '.*{search_term}.*'       },       'aggs' : {         'field_matches' : {           'top_hits' : {             '_source' : {               'include' : {criteria_array}             },             'size' : 1           }         }       }     }   } }; 

after filters applied, we're dealing set of 100k documents, , result comes in on 500ms, far longer ideal given search suggestions need occur on every keystroke.

approach 2 - include suggestion criteria in aggregation / drop _all

for brevity, i'll describe changes index structure , query/aggregation above.

i disabled _all field , instead applied "autocomplete_analyzer" (which includes shingle_filter) each of suggestion criteria (of there dozen) in mapping.

all suggestion terms added query/aggregation...

  'aggs' : {     'suggestion_term_1' : {       'terms' : {         'field' : 'suggestion_term_1',         'include' : '.*{search_term}.*'       },       'aggs' : {         'field_matches' : {           'top_hits' : {             '_source' : {               'include' : 'suggestion_term_1'             },             'size' : 1           }         }       }     },     'suggestion_term_2' : {       'terms' : {         'field' : 'suggestion_term_2',         'include' : '.*{search_term}.*'       },       'aggs' : {         'field_matches' : {           'top_hits' : {             '_source' : {               'include' : 'suggestion_term_2'             },             'size' : 1           }         }       }     },     etc...   } }; 

this performs @ on 500ms once filters applied. still not ideal.

approach 3 - perform multiple elastic search queries - iterating on criteria

this similar approach 2, instead of including suggestion terms in aggregation include 1 term in each request. iterate on suggestion terms , perform multiple elastic search aggregation requests each of dozen or criteria.

most of results came in 20-30ms or so, when summed on entire iteration we're still north of 300-400ms in total request time.

edge ngrams

i should note alternative using wildcard search term, tried apply edge ngram filter analyzer well. that, however, typically increased total response time 50-70% , ballooned index size no apparent performance benefit, opted stick wildcard approach.

removing shingle filter

i should note see dramatic performance improvements when remove shingle filter, unfortunately multi-word queries requirement project.

i suspect there may approach or 2 i've not yet tried signficantly improved performance times, @ point i'm grasping @ straws. suggestions appreciated.


Comments

Popular posts from this blog

IF statement in MySQL trigger -

c++ - What does MSC in "// appease MSC" comments mean? -

javascript - Blogger related post gadget image Resize s72-c [ Need Expert Help ] -