Non-full text search: specific features of Elasticsearch for complex tasks

SHARE

ABNORMAL PROGRAMMING *SEARCH TECHNOLOGIES *PHP * DESIGN AND REFACTORING *API *

Non-full text search: specific features of Elasticsearch for complex tasks

Hi everyone, my name is Andrey, and I’m a developer. A long time ago, our team had a project where they needed to search for the ingredients that make up the products. Let’s say, the ingredients of a sausage. At the very beginning of the project, the search was not required to do much: show all recipes in which the desired ingredient is contained in a certain amount; repeat for N ingredients.

However, in the future, it was planned to significantly increase the number of products and ingredients, and the search had to not only cope with the growing volume of data, but also provide additional options – for example, automatically compiling a product description based on its prevailing ingredients.

Requirements

  • Create a search on Elacsticsearch with a database of at least 50,000 documents.
  • Provide high speed response to requests – less than 300 ms.
  • Ensure that requests have small volume and the service is available even in the worst mobile Internet conditions.
  • Make search logic as intuitive as possible in terms of UX. It was essentially about the fact that the interface will reflect the logic of the search – and vice versa.
  • Minimize the number of layers between system elements for higher performance and fewer dependencies.
  • Provide for the possibility at any time to supplement the algorithm with new conditions (for example, automatic generation of a product description).
  • Make further support of the project’s search part as simple and convenient as possible.

We decided to take our time and start easy.

First of all, we saved all the ingredients of the product composition in the database getting 10,000 records at first. Unfortunately, even with this size, searching the database took too much time, even taking into account the use of joins and indexes. And in the near future, the number of records was supposed to exceed 50,000. In addition, the customer insisted on using Elasticsearch (hereinafter referred to as ES), because he came across this tool and, apparently, had warm feelings for it. We had not worked with ES before, but we knew about its advantages and agreed with this choice, since, for example, it was planned that we would often have new records (according to various estimates, from 50 to 500 per day) that would be immediately needed to supply a user with.

We decided to abandon layers at the driver level and simply use REST-requests since synchronization with the database is done only at the time of document creation and is no longer needed. This was another advantage – up to sending search queries directly to ES from the browser.

We put together the first prototype where we transferred the structure from the database (PostgreSQL) to ES documents:

{"mappings" : {
    "recipe" : {
        "_source"    : {
            "enabled" : true
        },
        "properties" : {
            "recipe_id"     : {"type" : "integer"},
            "recipe_name"   : {"type" : "text"},
            "ingredients"   : {
                "type" : "nested",
                "properties": {
                    "ingredient_id":    "integer",
                    "ingredient_name":  "string",
                    "manufacturer_id":  "integer",
                    "manufacturer_name":    "string",
                    "percent":      "float"
                }
            }
        }
    }
}}

Based on this mapping, we approximately get the following document (we cannot show a working document from the project due to NDA):

{
    "recipe_id": 1,
    "recipe_name": "AAA & BBB",
    "ingredients": [
        {
            "ingredient_id": 1,
            "ingredient_name": "AAA",
            "manufacturer_id": 3,
            "manufacturer_name": "Manufacturer 3",
            "percent": 1
        },
        {
            "ingredient_id": 2,
            "ingredient_name": "BBB",
            "manufacturer_id": 4,
            "manufacturer_name": "Manufacturer 4",
            "percent": 3
        }
    ]
}

All this was done using the Elasticsearch PHP package. It was decided to omit extensions for Laravel (Elastiquent, Laravel Scout, etc.) for one reason – the customer demanded high performance, up to, as mentioned above, that “300 ms for a request is a lot.” And all the packages for Laravel acted as an extra overhead and slowed down the work. It would be possible to do it directly on Guzzle, but we decided not to rush into extremes.

At first, the simplest search for recipes was done directly on arrays. Yes, it was all moved to the configuration files, but the request still turned out to be too large. The search went through nested documents (those exact ingredients), and boolean expressions using “should” and “must”. In addition, there was a directive command of mandatory pass through nested documents that acted. As a result, the request consisted of one hundred lines and above, and its volume started from three kilobytes.

Do not forget about the requirements for the speed and size of the response – at this point, the responses in the API were formatted to increase the amount of useful information: the keys in each json-object were shortened to one letter. Therefore, requests in ES of a few kilobytes became an unattainable luxury.

And we also realized at that time that building giant requests in the form of associative arrays in PHP is some kind of fierce insanity. In addition, the controllers became completely unreadable, just have a look:

public function searchSimilar()
{
    /*...*/
 
    $conditions[] = [
        "nested" =>  [
            "path" =>  "ingredients",
            "score_mode" => "max",
            "query" =>  [
                "bool" =>  [
                    "must" =>  [
                        ["term" => ["ingredients.ingredient_id" => $ingredient_id]],
                        ["range" => ["ingredients.percent"=>[
                            "lte"=>$percent + 5,
                            "gte"=>$percent - 5
                        ]]]
                    ]
                ]
            ]
        ]
    ];
 
    $parameters['body']['query']['bool']['should'][0]['bool']['should'] = $conditions;
 
    /*...*/
 
    $equal_conditions[] = [
        "nested" =>  [
            "path" =>  "flavors",
            "query" =>  [
                "bool" =>  [
                    "must" =>  [
                        ["term" => ["ingredients.percent" => $percent]]
                    ]
                ]
            ]
        ]
    ];
 
    $parameters['body']['query']['bool']['should'][1]['bool']['must'] = $equal_conditions; 
 
    /*...*/
 
    return $this->client->search($parameters);
}

Lyrical digression: when it came to nested fields in the document, it turned out that we could not execute a query like this

"query": {
    "bool": {
        "nested": {
            "bool": {
                "should": [ ... ]
            }
        }
    }
}

for one simple reason – you can’t multisearch inside a nested-filter. So I had to do it this way:

"query": {
    "bool": {
        "should": [
            {"nested": {
                "path":  "flavors",
                "score_mode": "max",
                "query": {
                    "bool": { ... }
                }
            }}
        ]
    }
}

i.e. an array of conditions Should was first declared, and inside each condition, a search on the nested-field was called. From the point of view of Elasticsearch, this is more correct and logical. As a result, we ourselves saw that this is logical when we added additional search conditions.

And here we discovered Google templates built into ES. The choice fell on Mustache, a rather convenient logic-less template engine. It was possible to put the entire request body and all transmitted data into it with virtually no changes. As a result, the final request took the form of:

{ "template": "template1", "params": params{} }

The body of the template turned out to be quite modest and readable -– only JSON and Mustache directives. The template is stored in Elasticsearch itself and called by name./

/* search_similar.mustache */
{
    "query": {
        "bool": {
            "should":  [
                {"bool": {
                    "minimum_should_match":  {{ minimumShouldMatch }},
                    "should":  [
                    {{#ingredientsList}} // the feature of mustache is that it checks for a non-empty ingredientsList object
                        {{#ingredients}}   // and here the same directive is a pass along the ingredients array
                        {"nested": {
                            "path":  "ingredients",
                            "score_mode": "max",
                            "query": {
                                "bool": {
                                    "must":  [
                                        {"term": {"ingredients.flavor_id":  }},
                                        {"range": {"ingredients.percent" : {
                                            "lte": {{ lte }},
                                            "gte": {{ gte }}
                                        }}}
                                    ]
                                }
                            }
                        }}
                        {{^isLast}},{{#isLast}} // flag of the last element
                        {{#ingredients}}
                    {{#ingredientsList}}
                    ]
                }}
            ]
        }
    }
}
 
/* query */
{
    "template": "search_similar",
    "params": {
        "minimumShouldMatch": 1,
        "ingredientsList": {
            "ingredients": [
                {"id": 1, "lte": 10, "gte": 5, "isLast": true }
            ]
        }
    }
}

As a result, we got a template which we simply passed an array of the necessary ingredients to. Logically, the request differed a little from, roughly, the following:

SELECT * 
FROM ingredients 
LEFT JOIN recipes ON recipes.id = ingredient.recipe_id 
WHERE ingredients.id in (1,2,3) 
AND 
ingredients.id not in (4,5,6) 
AND ingredients.percent BETWEEN 10.0 AND 20.0

but it worked faster, and it was a ready-made basis for limiting requests.

Here, in addition to searching by percentages, we needed several more types of operations: search by name among ingredients, groups and names of recipes; search by ingredient ID taking into account the tolerance of its content in the recipe; the same query, but with the calculation of the results for four conditions (it was later redesigned for another task), as well as the final query.

The query required the following logic: for each ingredient, there are five tags that attribute it to a group. Conventionally, pork and beef are meat, and chicken and turkey are poultry. Each of the tags is located at its own level. Based on these tags, we could create a conditional description for the recipe, which allowed us to generate a search tree and/or descriptions automatically. For example, meat and milk sausage with spices, liver sausage with soy, halal chicken. One recipe can have several ingredients with the same tag. This allowed us not to fill the chain of tags with our hands – based on the ingredients of the recipe, we could already totally describe it. The structure of the nested document has also changed:

{
    "ingredient_id": 1,
    "ingredient_name": "AAA",
    "manufacturer_id": 3,
    "manufacturer_name": "Manufacturer 3",
    "percent": 1,
    "level_1": 2,
    "level_2": 4,
    "level_3": 6,
    "level_4": 7,
    "level_5": 12
}

There was also a need to set the search by the condition of the recipe’s “purity”. For example, we needed a recipe where there would be nothing but beef, salt and pepper. In this case, we had to filter out recipes where the first level would have only beef, and the second one would have only spices (the first tag for spices was zero). I had to cheat here, to tell the truth: since Mustache is a template without logic, there could be no question of any calculations; it was required to embed part of the script written in the built-in ES scripting language – Painless into the request. Its syntax is as close to Java as possible so there were no difficulties. As a result, we had a Mustache-template that generates JSON where part of the calculations, namely sorting and filtering, were implemented on Painless:

"filter": [
    {{#levelsList}}
        {{#levels}}
        {"script": {
            "script": "
                int total=0;
                for (ingredient in params._source.ingredients){
                    if ([0,].contains(ingredient.level_))
                        total+=1;
                }
                return (total==params._source.ingredients.length);
            "
        }}
        {{^isLast}},{{/isLast}}
        {{/levels}}
    {{/levelsList}}
]

Hereinafter, the body of the script is formatted for readability; line breaks cannot be used in queries.

By that time, we had removed the tolerance for the content of the ingredient and found a bottleneck – we could count the sausage as beef only because this ingredient is found there. Then we added, all using the same Painless scripts, filtering according to the condition that this ingredient should prevail in the composition:

"filter": [
    {"script":{
        "script": "
            double nest=0,rest=0;
            for (ingredient in params._source.ingredients){
                if([{{#tags}}{{tagId}}{{^isLast}},{{/isLast}}{{/tags}}].contains(flavor.level_{{tags.0.levelId}})){
                    nest+= ingredient.percent;
                }else{
                    if (ingredient.percent>rest){rest = ingredient.percent}
                }
            }
            return(nest>=rest);
        "
    }}
]

As you can see, for this project, Elasticsearch lacked many things, so they had to be assembled by “means turned out to be at hand”. It should not be surprising – the project is quite atypical for a machine that is used for a full-text search.

One of the intermediate stages of the project required the following thing from us: display a list of all available groups of ingredients and the number of positions in each. The same problem as in the prevailing request was revealed here: out of 10,000 recipes, about 10 groups were generated based on the content. However, in total these groups had about 40,000 recipes that did not correspond to reality at all. So then we started digging towards parallel queries.

With the first request, we received a list of all groups that are on the first level without the number of occurrences. After that, a multi-request was generated: for each group, a request was made to receive the real number of prescriptions according to the principle of the prevailing percentage. All these requests were collected into one and sent to Elasticsearch. The response time for the overall request was equal to the processing time for the slowest request. Bulk-aggregation made it possible to parallelize them. Similar logic (just grouping by condition in a query) in SQL took about 15 times more time.

/* first query */
$params = config('elastic.params');
$params['body'] = config('elastic.top_list');
 
return (Elastic::getClient()->search($params))['aggregations']['tags']['buckets'];
 
/* second query */

After that, within a single request we needed to evaluate:

  1. how many recipes are available for the current ingredients;
  2. what other ingredients can we add to the composition (sometimes we added an ingredient and got an empty sample);
  3. which ingredients among the selected above we can mark as the only ones at this level.

Based on the task, we combined the logic of the last received request for the recipe list and the logic of getting exact numbers from the list of all available groups:

/* aggregation */
 
"aggs" : {  // aggregation by the number of available tags
    "tags" :{   // will return the number of matches
        "terms" :{
            "field"   : "ingredients.level_",
            "order"   : {"_term" : "asc"},
            "exclude" : [ {{#exclude}}{{ id }},{{/exclude}} 0]
        },
        "aggs": {
            "reverse_nested": {}
        }   // will return the number of real documents, not matches
    }
}
 
/* generalized query */
 
foreach ($not_only as $element) {
    $parameters['body'][] = config('elastic.params');
    $parameters['body'][] = self::getParamsBody(
        $body,
        collect($only->all())->push($element),
        $max_level,
        0,
        0
    );
}
 
/* Main query */
$parameters['body'][] = config('elastic.params');
$parameters['body'][] = self::getParamsBody(
    $body,
    $only,
    $max_level,
    $from,
    $size')
);
 
/* Number of concurrent search threads */
$parameters['max_concurrent_searches'] = 1 + $not_only->count();
 
return (Elastic::getClient()->msearchTemplate($parameters))['responses'];

As a result, we received a request that finds all the necessary recipes and their total number (it was taken from response[“hits”][“total”]). For simplicity, this request was written at the last place in the list.

Additionally, through aggregation, we received all the ingredient IDs for the next level. For each of the ingredients not marked as “the only one” we created a request where we accordingly marked it, and then simply counted the number of documents found. If it was greater than zero, then the ingredient was considered available for assignment of the “single” key. I think here, having a mind of your own, you can reconstruct the entire template that we got at the output:

{
    "from": {{ from }},
    "size": {{ size }},
    "query": {
        "bool": {
            "must":  [
                {{#ingredientTags}}
                    {{#tagList}}
                    {"bool": {
                        "should":  [
                            {"term": {"level_{{ levelId }}": {{ tagId }} }}
                        ]
                    }}
                    {{^isLast}},{{/isLast}}
                    {{/tagList}}
                {{/ingredientTags}}
            ],
            "filter": [
                {"script":{
                    "script": "
                        double nest=0,rest=0;
                        for(ingredient in params._source. ingredients){
                            if([{{#tags}}{{tagId}}{{^isLast}},{{/isLast}}{{/tags}}].contains(ingredient.level_{{tags.0.levelId}})){
                                nest+= ingredient.percent;
                            }else{
                                if (ingredient.percent>rest){
                                    rest= ingredient.percent
                                }
                            }
                        }
                        return(nest>=rest);
                    "
                }}
                {{#levelsList}},
                    {{#levels}}
                        {"script": {
                            "script": "
                                int total=0;
                                for(ingredient in params._source.ingredients){
                                    if ([0,].contains(ingredient.level_))
                                        total+=1;
                                }
                                return (total==params._source.ingredients.length);
                            "
                        }}
                        {{^isLast}},{{/isLast}}
                    {{/levels}}
                {{/levelsList}}
            ]
        }
    },
    "aggs" : {
        "tags" :{
            "terms" :{
                "field"   : "ingredients.level_{{ level }}",
                "order"   : {"_term" : "asc"},
                "exclude" : [ {{#exclude}}{{ id }},{{/exclude}} 0]
            },
            "aggs": {
                "reverse_nested": {}
            }
        }
    },
    "sort": [
        {"_score": {"order": "desc"}}
    ]
}

Obviously, we cache part of this heap of templates and queries (like, for example, a page of all available groups with the number of available recipes) which adds a bit of performance to the main page. This decision made it possible to ensure that the data for the main one is collected in 50 ms.

Project results

We have implemented a database search of at least 50,000 documents on Elasticsearch which allows you to search for ingredients in the composition of products and get a description of the product by the ingredients it contains. Soon this database will grow by about six times (the data is being prepared), so we are quite satisfied with both our results and Elasticsearch as a search tool.

As to the performance issue – we met the requirements of the project, and we ourselves are glad that the response time to a request is 250-300 ms on average.

Three months after starting work with Elasticsearch, it no longer seems so confusing and unusual. And the advantages of templating are obvious: if we see that the request becomes too large again, we simply transfer additional logic to the template and again send the original request to the server with little or no changes.

«So Long, and Thanks for all the Fish!» (с)

P.S. Last minute, we also needed sorting by Russian characters in the title. And then it turned out that Elasticsearch does not adequately perceive the Russian alphabet. The assumed sausage “Ultra mega pork 9000 calories” turned into just “9000” inside the sorting and found itself in the end of the list. As it turned out, this problem is quite simply solved by converting Russian characters into a Unicode-notation like u042B.

Tags:

  • php
  • elasticsearch
  • developing
  • search technologies
  • performance
  • rest
  • postgresql
  • api
  • abnormal programming

Hubs:

  • Abnormal programming
  • Search technologies
  • PHP
  • Design and refactoring
  • API