Elasticsearch top_hits performance using shingle filter -
we're using elasticsearch return distinct search term suggestions dozen different fields across large set of data. accomplish this, we're using 'terms' , 'top_hits' aggregations (the terms aggregation uses wildcard term). we're using shingle_filter (min-size:2, max-size:3) on custom analyzer, 1 of requirements of project return search suggestions on multi-word search terms.
i've tried several different approaches, none of performant.
approach 1 - suggestion criteria in _all
all criteria on want return suggestions put in _all field utilizes custom analyzer shingle filter:
'settings' : { 'analysis' : { 'analyzer' : { 'autocomplete_analyzer' : { 'type' : 'custom', 'tokenizer' : 'suggestion_tokenizer', 'filter' : [ 'lowercase', 'shingle_filter' ] }, }, 'tokenizer' : { 'suggestion_tokenizer' : { 'type' : 'whitespace' } }, 'filter' : { 'shingle_filter' : { 'type' : 'shingle', 'min_shingle_size' : 2, 'max_shingle_size' : 3 } } } }, 'mappings' : { 'core' : { '_all' : { 'enabled' : 'yes', 'index' : 'analyzed', 'analyzer' : 'autocomplete_analyzer' }, 'properties' : { 'suggestion_criteria_1': { 'type' : 'multi_field', 'fields' : { 'analyzed' : { 'type' : 'string', 'index' : 'analyzed' }, 'suggestion_criteria_1': { 'type' : 'string', 'index' : 'not_analyzed', 'include_in_all' : 'yes' } } },... 'filter_criteria_1': { 'type' : 'string', 'include_in_all' : 'no', 'index' : 'not_analyzed' },... } } } aggregation/query utilzies filters , suggestion term search array, need know field suggestion match came from:
{ 'from' : 0, 'size' : 0, 'query' : { 'filtered' : { 'filter' : { 'and' : [ {search filter array / optional} ] } } }, 'aggs' : { 'suggestions' : { 'terms' : { 'field' : '_all', 'include' : '.*{search_term}.*' }, 'aggs' : { 'field_matches' : { 'top_hits' : { '_source' : { 'include' : {criteria_array} }, 'size' : 1 } } } } } }; after filters applied, we're dealing set of 100k documents, , result comes in on 500ms, far longer ideal given search suggestions need occur on every keystroke.
approach 2 - include suggestion criteria in aggregation / drop _all
for brevity, i'll describe changes index structure , query/aggregation above.
i disabled _all field , instead applied "autocomplete_analyzer" (which includes shingle_filter) each of suggestion criteria (of there dozen) in mapping.
all suggestion terms added query/aggregation...
'aggs' : { 'suggestion_term_1' : { 'terms' : { 'field' : 'suggestion_term_1', 'include' : '.*{search_term}.*' }, 'aggs' : { 'field_matches' : { 'top_hits' : { '_source' : { 'include' : 'suggestion_term_1' }, 'size' : 1 } } } }, 'suggestion_term_2' : { 'terms' : { 'field' : 'suggestion_term_2', 'include' : '.*{search_term}.*' }, 'aggs' : { 'field_matches' : { 'top_hits' : { '_source' : { 'include' : 'suggestion_term_2' }, 'size' : 1 } } } }, etc... } }; this performs @ on 500ms once filters applied. still not ideal.
approach 3 - perform multiple elastic search queries - iterating on criteria
this similar approach 2, instead of including suggestion terms in aggregation include 1 term in each request. iterate on suggestion terms , perform multiple elastic search aggregation requests each of dozen or criteria.
most of results came in 20-30ms or so, when summed on entire iteration we're still north of 300-400ms in total request time.
edge ngrams
i should note alternative using wildcard search term, tried apply edge ngram filter analyzer well. that, however, typically increased total response time 50-70% , ballooned index size no apparent performance benefit, opted stick wildcard approach.
removing shingle filter
i should note see dramatic performance improvements when remove shingle filter, unfortunately multi-word queries requirement project.
i suspect there may approach or 2 i've not yet tried signficantly improved performance times, @ point i'm grasping @ straws. suggestions appreciated.
Comments
Post a Comment