Non-full text search: Elasticsearch specific capabilities for complex tasks

image

Hello everyone, my name is Andrew, and I am a developer. A long time ago - it seems, last Friday - our team had a project where it was necessary to search for the ingredients that make up the products. For example, in the composition of the sausage. At the very beginning of the project, not much was needed from the search: show all recipes in which the necessary ingredient is contained in a certain quantity; repeat for N ingredients.

However, in the future, the number of products and ingredients was planned to be significantly increased, and the search should not only cope with the increasing volume of data, but also provide additional options - for example, automatic preparation of a product description for its prevailing ingredients.

Requirements


We decided not to rush and start simple.

First of all, we saved all the ingredients of the product in the database, having received 10,000 entries at first. Unfortunately, even with this size the search in the database took too much time, even considering the use of join-s and indices. And in the near future, the number of entries should have exceeded 50,000. In addition, the customer insisted on using Elasticsearch (hereinafter - ES), since he came across this tool and, apparently, felt warm feelings for him. We had not worked with ES before, but we knew about its advantages and agreed with this choice, since, for example, it was planned that we would often have new entries (according to various estimates from 50 to 500 per day) that would need immediately issue to the user.

We decided to abandon the layers at the driver level and simply use REST requests, since synchronization with the database is done only at the time of creating the document and is no longer needed. This was another advantage - up to sending search queries directly to ES from the browser.

We collected the first prototype in which we transferred the structure from the database (PostgreSQL) to ES documents:

{"mappings" : { "recipe" : { "_source" : { "enabled" : true }, "properties" : { "recipe_id" : {"type" : "integer"}, "recipe_name" : {"type" : "text"}, "ingredients" : { "type" : "nested", "properties": { "ingredient_id": "integer", "ingredient_name": "string", "manufacturer_id": "integer", "manufacturer_name": "string", "percent": "float" } } } } }} 

Based on this mapping, the result is approximately the following document (we cannot show the worker from the project due to NDA):

 { "recipe_id": 1, "recipe_name": "AAA & BBB", "ingredients": [ { "ingredient_id": 1, "ingredient_name": "AAA", "manufacturer_id": 3, "manufacturer_name": "Manufacturer 3", "percent": 1 }, { "ingredient_id": 2, "ingredient_name": "BBB", "manufacturer_id": 4, "manufacturer_name": "Manufacturer 4", "percent": 3 } ] } 

All this was done using the Elasticsearch PHP package. Extensions for Laravel (Elastiquent, Laravel Scout, etc.) decided not to use for one reason - the customer demanded high performance, even as mentioned above, that "300 ms for a request is a lot." And all the packages for Laravel were superfluous overhead and slowed down the work. It could have been done directly on Guzzle, but we decided not to go to extremes.

First, the simplest search for recipes done right on the arrays. Yes, it was all rendered in the configuration files, but the request still turned out too big. The search was carried out according to the attached documents (those same ingredients), according to boolean expressions using “should” and “must”, the directive of obligatory passage through the attached documents also operated - as a result, the query ranged from one hundred lines, and its volume was from three kilobytes.

Do not forget about the requirements for speed and size of the response - by that time the answers in the API were formatted in such a way as to increase the amount of useful information: the keys in each json-object were reduced to one letter. Therefore, requests in ES of a few kilobytes in size have become a luxury that is not allowed.

And we at that time realized that building huge queries in the form of associative arrays in PHP is some kind of fierce addiction. In addition, the controllers were completely unreadable, see for yourself:

 public function searchSimilar() { /*...*/ $conditions[] = [ "nested" => [ "path" => "ingredients", "score_mode" => "max", "query" => [ "bool" => [ "must" => [ ["term" => ["ingredients.ingredient_id" => $ingredient_id]], ["range" => ["ingredients.percent"=>[ "lte"=>$percent + 5, "gte"=>$percent - 5 ]]] ] ] ] ] ]; $parameters['body']['query']['bool']['should'][0]['bool']['should'] = $conditions; /*...*/ $equal_conditions[] = [ "nested" => [ "path" => "flavors", "query" => [ "bool" => [ "must" => [ ["term" => ["ingredients.percent" => $percent]] ] ] ] ] ]; $parameters['body']['query']['bool']['should'][1]['bool']['must'] = $equal_conditions; /*...*/ return $this->client->search($parameters); } 

Lyrical digression: when it came to nested-fields in a document, it turned out that we could not fulfill a query of the form:

 "query": { "bool": { "nested": { "bool": { "should": [ ... ] } } } } 

for one simple reason - you cannot perform multi-search inside a nested filter. Therefore, I had to do this:

 "query": { "bool": { "should": [ {"nested": { "path": "flavors", "score_mode": "max", "query": { "bool": { ... } } }} ] } } 

those. at first, the array of conditions of should was declared, and within each condition a search on a nested-field was called. From the point of view of Elasticsearch, this is more correct and logical. As a result, we ourselves saw that it was logical when adding additional search conditions.

And here we discovered Google templates embedded in ES. The choice fell on Mustache - a rather convenient logic-less template engine. It was possible to take out the entire request body and all transmitted data with virtually no changes, as a result of which the final query acquired the form:

 { "template": "template1", "params": params{} } 

The body of the template turned out quite modest and readable - only JSON and directives of Mustache itself. The template is stored in Elasticsearch itself and is called by name.

 /* search_similar.mustache */ { "query": { "bool": { "should": [ {"bool": { "minimum_should_match": {{ minimumShouldMatch }}, "should": [ {{#ingredientsList}} // mustache         ingredientsList {{#ingredients}} //         ingredients {"nested": { "path": "ingredients", "score_mode": "max", "query": { "bool": { "must": [ {"term": {"ingredients.flavor_id": {{ id }} }}, {"range": {"ingredients.percent" : { "lte": {{ lte }}, "gte": {{ gte }} }}} ] } } }} {{^isLast}},{{/isLast}} //    {{/ingredients}} {{/ingredientsList}} ] }} ] } } } /*  */ { "template": "search_similar", "params": { "minimumShouldMatch": 1, "ingredientsList": { "ingredients": [ {"id": 1, "lte": 10, "gte": 5, "isLast": true } ] } } } 

As a result, at the exit, we received a template, to which we simply transferred an array of necessary ingredients. By logic, the query was a little different from, conditionally, the following:

 SELECT * FROM ingredients LEFT JOIN recipes ON recipes.id = ingredient.recipe_id WHERE ingredients.id in (1,2,3) AND ingredients.id not in (4,5,6) AND ingredients.percent BETWEEN 10.0 AND 20.0 

but he worked faster, and it was a ready-made basis for further requests.

Here, in addition to searching for percentages, we also needed several types of operations: a search by name among the ingredients, groups, and names of recipes; search by ingredient id, taking into account the tolerance of its content in the recipe; the same query, but with the calculation of the results on four conditions (was later converted to another task), as well as the final query.

The request required the following logic: for each ingredient there are five tags that relate it to any group. Conventionally, pork and beef are meat, and chicken and turkey are poultry. Each tag is located on its own level. Based on these tags, we could create a conditional description for the recipe, which allowed us to generate a search tree and / or descriptions automatically. For example, meat and dairy sausage with spices, liver with soy, chicken halal. In one recipe there can be several ingredients with one tag. This allowed us not to fill the tag chain with our hands - based on the composition of the recipe, we could already unambiguously describe it. The structure of the attached document has also changed:

 { "ingredient_id": 1, "ingredient_name": "AAA", "manufacturer_id": 3, "manufacturer_name": "Manufacturer 3", "percent": 1, "level_1": 2, "level_2": 4, "level_3": 6, "level_4": 7, "level_5": 12 } 

There was also a need to set a search for the condition of "purity" recipe. For example, we needed a recipe where there would be nothing but beef, salt and pepper. Then we had to weed out the recipes, where on the first level there would be only beef, and on the second - only spices (the first tag of the spices was zero). It was necessary to cheat here: since mustache is a template without logic, it was not possible to talk about any calculations; here it was required to embed part of the script in the request in the ES built-in scripting language - Painless. Its syntax is as close as possible to Java, so there are no difficulties. As a result, we had a Mustache-template that generates JSON, in which part of the calculations, namely, sorting and filtering were implemented on Painless:

 "filter": [ {{#levelsList}} {{#levels}} {"script": { "script": " int total=0; for (ingredient in params._source.ingredients){ if ([0,{{tag}}].contains(ingredient.level_{{id}})) total+=1; } return (total==params._source.ingredients.length); " }} {{^isLast}},{{/isLast}} {{/levels}} {{/levelsList}} ] 

Hereinafter, the script body is formatted for readability; line wraps cannot be used in queries.

By that time, we removed the ingredient content tolerance and found a bottleneck - we could count beef sausage just because this ingredient is found there. Then we added - all on the same Painless-scripts - filtering by the condition that this ingredient should prevail in the composition:

 "filter": [ {"script":{ "script": " double nest=0,rest=0; for (ingredient in params._source.ingredients){ if([{{#tags}}{{tagId}}{{^isLast}},{{/isLast}}{{/tags}}].contains(flavor.level_{{tags.0.levelId}})){ nest+= ingredient.percent; }else{ if (ingredient.percent>rest){rest = ingredient.percent} } } return(nest>=rest); " }} ] 

As you can see, for this project, Elasticsearch lacked many things, so they had to be assembled from “improvised means”. But this is not surprising - the project is atypical enough for a machine that is used for full-text search.

At one of the intermediate stages of the project, we needed the following thing: to list all the available ingredient groups and the number of positions in each. Here the same problem was revealed as in the prevailing query: out of 10,000 recipes, about 10 groups were generated based on the contents. However, in total in these groups there were about 40,000 recipes, which did not correspond to reality at all. Then we began to dig in the direction of parallel queries.

The first request we received a list of all groups that are on the first level without the number of occurrences. After that, a multi-request was generated: for each group a request was made to obtain a real number of recipes on the basis of the prevailing percentage. All these requests were collected in one and sent to Elasticsearch. The response time for the general request was equal to the processing time of the slowest request. Bulk aggregation allowed them to be parallelized. Similar logic (just grouping by condition in a query) in SQL took about 15 times longer.

 /*   */ $params = config('elastic.params'); $params['body'] = config('elastic.top_list'); return (Elastic::getClient()->search($params))['aggregations']['tags']['buckets']; /*   */ 

After that we needed to evaluate in the framework of one request:

  1. how many recipes are available for the current composition;
  2. what other ingredients can we add to the composition (sometimes we added an ingredient and got an empty sample);
  3. which ingredients among the selected, we can mark as the only ones at this level.

Proceeding from the task, we combined the logic of the last request received for the recipe list and the logic for obtaining exact numbers from the list of all available groups:

 /*  */ "aggs" : { //      "tags" :{ //    "terms" :{ "field" : "ingredients.level_{{ level }}", "order" : {"_term" : "asc"}, "exclude" : [ {{#exclude}}{{ id }},{{/exclude}} 0] }, "aggs": { "reverse_nested": {} } //    ,    } } /*   */ foreach ($not_only as $element) { $parameters['body'][] = config('elastic.params'); $parameters['body'][] = self::getParamsBody( $body, collect($only->all())->push($element), $max_level, 0, 0 ); } /*   */ $parameters['body'][] = config('elastic.params'); $parameters['body'][] = self::getParamsBody( $body, $only, $max_level, $from, $size') ); /*     */ $parameters['max_concurrent_searches'] = 1 + $not_only->count(); return (Elastic::getClient()->msearchTemplate($parameters))['responses']; 

As a result, we received a request that finds all the necessary recipes and their total number (it was taken from response ["hits"] ["total"]). For simplicity, this request was recorded in the last place in the list.

Additionally, through aggregation, we received all the id ingredients for the next level. For each of the ingredients that are not marked as “only”, we created a query, where we tagged it accordingly, and then we simply counted the number of documents found. If it was greater than zero, then the ingredient was considered available to assign the key “only”. I think that here, even without me, you can restore the entire template that we had at the output:

 { "from": {{ from }}, "size": {{ size }}, "query": { "bool": { "must": [ {{#ingredientTags}} {{#tagList}} {"bool": { "should": [ {"term": {"level_{{ levelId }}": {{ tagId }} }} ] }} {{^isLast}},{{/isLast}} {{/tagList}} {{/ingredientTags}} ], "filter": [ {"script":{ "script": " double nest=0,rest=0; for(ingredient in params._source. ingredients){ if([{{#tags}}{{tagId}}{{^isLast}},{{/isLast}}{{/tags}}].contains(ingredient.level_{{tags.0.levelId}})){ nest+= ingredient.percent; }else{ if (ingredient.percent>rest){ rest= ingredient.percent } } } return(nest>=rest); " }} {{#levelsList}}, {{#levels}} {"script": { "script": " int total=0; for(ingredient in params._source.ingredients){ if ([0,{{tag}}].contains(ingredient.level_{{id}})) total+=1; } return (total==params._source.ingredients.length); " }} {{^isLast}},{{/isLast}} {{/levels}} {{/levelsList}} ] } }, "aggs" : { "tags" :{ "terms" :{ "field" : "ingredients.level_{{ level }}", "order" : {"_term" : "asc"}, "exclude" : [ {{#exclude}}{{ id }},{{/exclude}} 0] }, "aggs": { "reverse_nested": {} } } }, "sort": [ {"_score": {"order": "desc"}} ] } 

We, of course, cache some of this pile of templates and requests (like, for example, the page of all available groups with the number of available recipes), which adds a bit of performance to the main page. This solution allowed us to ensure that the data for the main one are collected in 50 ms.

Project results

We implemented a search in the database of at least 50,000 documents on Elasticsearch, which allows you to search for ingredients in the composition of the products and get a description of the product by the ingredients it contains. Soon this database will grow about six times (the data is being prepared), so we are quite satisfied with our results and Elasticsearch as a search tool.

On the issue of performance, we met the requirements of the project, and we are glad that the average response time to a request is 250-300 ms.

Three months after starting work with Elasticsearch, he no longer seems so confused and unusual. But the advantages of template making are obvious: if we see that the request becomes too large again, we simply transfer the additional logic to the template and send the original request to the server again almost unchanged.

“Have a nice day and thank you for the fish!” (C)

PS At the last moment, we also needed sorting by Russian characters in the title. And then it turned out that Elasticsearch does not perceive the Russian alphabet adequately. The conventional sausage “Ultra mega pork 9000 calories” was transformed inside the sorting simply into “9000” and appeared at the bottom of the list. As it turned out, this problem is quite simply solved by converting Russian characters into unicode notation of the form u042B.

Source: https://habr.com/ru/post/413075/


All Articles