Elasticsearch – how the full text search system works: pros and cons, alternatives and life hacks

SHARE

Working with e-commerce projects, online stores, online pharmacies, marketplaces, and ultimately even dating sites, you have to cover a large amount of data and develop a convenient search tool. Elasticsearch is such a tool.

In this article, we will tell you what Elasticsearch is and how to use it, consider its advantages and disadvantages, and show examples of use; we will also share life hacks for using this search tool.

WHAT IS ELASTICSEARCH

Elasticsearch is a full text search engine written in Java. In addition, Elasticsearch is a non-relational JSON document store that is released as an open source project under the terms of the Apache license.

HOW IT WORKS

Data is sent to Elasticsearch as JSON documents using an API or a tool like Logstash. Elasticsearch saves the document to the cluster index and makes it searchable. The document can then be found and retrieved using the Elasticsearch API.

ES allows you to search through documents in real time, it scales horizontally and supports multithreading. It wins other NoSQL-systems in the quality and speed of text processing and flexible full-text search across the entire document database.

WHY IS IT COMFORTABLE WORKING WITH ELASTICSEARCH

In services with a lot of data, namely e-commerce projects, online stores, online pharmacies, marketplaces, dating sites, search is facets. Facets are search terms, such as a set of tags or a filter by value.

Working with a large amount of data, we are faced with a standard data schema for a typical e-commerce in a DBMS, and it looks something like this:

Such a CMS system stores products, properties, categories, category properties and property categories, as well as many connections and entities. There is too much data. And in order to structure everything normally, we need a search tool.

ALTERNATIVES TO ELASTICSEARCH

Before you start working with Elasticsearch, let’s look at alternative DBMS that can help make search convenient – MongoDB and PostgreSQL. 

MongoDB:

  • There are no schemes, that is, you don’t need to worry about the data structure, you can add anything you like;
  • Web scale – when the load increases, scaling is very easy both horizontally and vertically;
  • There is no analogue of the JOIN construct, which means that we do not waste time on joining tables, everything is already in the document;
  • All validation is done in code. If you open any manual, you will see that the driver for MongoDB is installed first, and only then the validation library;
  • There are no analogues of stored procedures. Again, it’s all in the code.

PostgreSQL:

  • There is a non-relational storage in JSON-format, you can assign restrictions to fields. For example, there is a user table, the user has personal data, you need to enter restrictions on them. JSON will handle this just fine;
  • Availability of TextSearch functions that provide full-text search;
  • There are stored procedures;
  • Large number of tables/columns/indexes;
  • Not all ORMs understand JSON.  

Using MongoDB and PostgreSQL for full text search, we get limited functionality. In some cases, it will be enough, but for full convenience, we suggest using Elasticsearch (hereinafter – ES). It also has its pros and cons.

PROS

  1. RESTful API is used to communicate with Elasticsearch, or, in simple terms, ordinary HTTP requests. And you can work with it directly from the browser. To view information about the service, just go to localhost:9200;
  2. You can do validation at the storage level, in ES this is called mappings;
  3. ES is able to work with different queries – simple, complex, structured – and different types of data;
  4. JSON QUERY – even more declarative syntax than SQL;
  5. Kibana is a visual tool for ES to interact with data that is stored in ES indexes. The Kibana web interface allows you to execute and test queries, configure the ES cluster itself, apart from many other things. 

In addition to the advantages, ES has, of course, disadvantages.

CONS

ES developers provide the necessary libraries for most popular languages in the Elasticsearch Clients section. And for a number of reasons, this is not the best choice. The client for each language has its own imperfections:

  • Java Client – nested chain calls on lambdas, syntax on the fan;
  • JavaScript Client – most similar to jQuery.ajax – in general, it works fine;
  • Go Client – you can call Search().Raw([]byte), ie. we write a multi-line text query in the same way, as we did in the example with PHP and HEREDOC;
  • .NET Clients – something between Java and Go clients;
  • PHP Client – does not know how to work with templates and, in general, knows little;
  • Python Client – offers a very strange DSL.

Using Elasticsearch Clients, we are trying to take a request in ES itself in JSON format like this:

{
 "_source" : [
   "guidEsId",
   ... 36 fields more ...
   "expireDates.77"
 ],
 "query" : {
   "function_score" : {
     "query" : {
       "bool" : {
         "must" : [
           {"exists" : {"field" : "priceBuyer.77"}},
           {"terms" : {"mnn" : ["phenolphtalein"]}},
           {"term" : {"active" : "1"}},
           {"regexp" : {"itemName" : ".*pills.*25mg.*"}},
           {"range" : {"packQuantity" : {"lte" : 10}}},
           {"range" : {"priceBuyer.77" : {"gt" : 0}}},
           {"terms" : {"itemGroup" : [96]}},
           {
             "bool" : {
               "should" : [
                 {"range" : {"quantities.77.1" : {"gt" : 0}}},
                 {"range" : {"quantities.77.8" : {"gt" : 0}}},
                 {"range" : {"quantities.77.7" : {"gt" : 0}}},
                 {"range" : {"quantities.77.6" : {"gt" : 0}}}
               ]
             }
           }
         ]
       }
     }
   }
 }
}

and build it in the usual OOP style. But the output in ES Clients is the following:


SearchResponse<Product> search = new ElasticsearchClient().search(s -> s
   .query(q -> q
       .bool(b -> b
           .must(m -> m
               .exists(...)
               .terms(...)
               .term(...)
               .regexp(...)
               .range(...)
               .range(...)
               .terms(...)
               .bool(b -> b
                   .should(sh -> sh
                       .range(...)
                       .range(...)
                   )
               )
           )
       )
   )
)

At first, it may be normal and writing will be easy. But when you write queries of 100-120 lines, huge unreadable cycles will appear. In the future, other developers will not be able to support it all.

 That’s why we are suggesting an alternative to Elasticsearch Clients: 

Programming language client What we suggest to replace it with 
Java Client org.apache.http.client.HttpClient
JavaScript Client node:http/axios/Fetch API
Ruby Client Net::HTTP
Go Client net/http
.NET Clients System.Net.Http.HttpClient
PHP Client curl/GuzzleHttp
Python Client requests or httpx
Python Client reqwest

LIFE HACKS TO HELP YOU WHEN USING ELASTICSEARCH

  • You don’t always need ESClient

Despite what we said before, at the beginning of the journey, you can and should use ESClient since at first it can be unusual and extremely difficult to write in the JSON query language. Over time, you will have a pool of typical queries that are always executed. At this point, you will most likely get tired of writing chain queries in Query Builder. Then it will be possible to move on to the “throw ESClient” stage and study the templates, in particular the Mustache template engine.

  • Beware of nested fields

ES has “nested fields” – nested object type. The contents of the fields are stored as a separate document and can save memory. For example, if you have a list of users, and there is an internal object “the organization where this user belongs to”. If you make an index in the usual way, each user will have, relatively speaking, “their own unique organization” as a separate field. If you make nested objects, one organization object will remain in the nested index, and all other users will simply refer to it. We are actually compressing both the input and the indexed data. 

  • Compress responses 

ES has &filter_path. This is a parameter that allows you to specify which fields you collect from the final response. Because in ES, in the search, it returns an object that has a hits field, where the hits field lies, which is an array, in which objects lie. And inside this object, for example, there are the _source and _score fields.

  • hits.hits._source — will return the body of the document;
  • hits.hits._score — will return the matching rating;
  • hits.hits._id — will return the document’s internal ID.

Besides, ES returns a lot of additional data, and filter_path allows you to trim the response.

  • Filter what is already filtered 

In online stores, you have often seen the opportunity to choose between three options, for instance, a phone with 16GB of RAM, 32GB and 64GB. You can check the box for 32GB, and all other options are hidden as inactive. In this case, the filter works and no other filtering options will be displayed. But sometimes the user needs to leave other options available for selection. For this, ES has a post_filter – it filters what has already been filtered and grouped. Ultimately, the user has a filtered list of the parameters that he needs in the phone. But at the same time, other options remain available that are also subject to search.


{
    "query": {
         "bool": {
            "filter": {
              "term": {"brand": "gucci" }
            }
         }
     },
     "aggs": {
        "colors": {
           "terms": { "field": "color" }
        },
        "color_red": {
           "filter": {
             "term": { "color": "red" }
           },
           "aggs": {
              "models": {
                   "terms": { "field": "model" }
               }
           }
    }
 },
"post_filter": {
   "term": { "color": "red" }
  }
}

In this example, we show the user what other colors are available to search, but only show the red models.

  • Use Aliases

Alias are index nicknames. Initially, the problem is that most often we look at one index, and this index has a static name which is already somewhere in the settings. Sometimes, to add data, you need to rebuild the index, but forcing the user to wait for indexing is not reasonable – it takes a lot of time. In ES, you can do this in the background. We just create a new index and populate it. At this time, the user is calmly looking for data on the old index and does not know that the product data has changed. After that, we simply switch the index, it takes literally 10 milliseconds. The user won’t even notice.


{
"actions" : [
      { "remove" : { "index" : "OUR_INDEX_2000_01_01", "alias" : "INDEX NAME" } },
                  { "add" : { "index" : "OUR_INDEX_2022_09_17", "alias" : "INDEX NAME" } }
]
}
  • Don’t index what you’re not looking for

For example, there is an organization, there are people in the organization, and people have cars. In addition, the organization has a data that a person has driving experience in years. You can search for a person by any data – by name, surname, you can search by personal car number. But hardly anyone will look for a person by driving experience. Accordingly, index only what you are looking for. Just specify the type and disable field indexing with {“index”: false}:

{ “mappings”: { “properties”: { “exp”: { “type”: “integer”, “index”: false }}}}

  • Read How-to

How-to is a guide for optimizing work with ES by its developers. By reading it, you will better understand what to use and how, and you will be able to perform a number of optimizations to improve the performance of your project.

CONCLUSION

Elasticsearch is a good full-text search tool that will help speed up your project and reduce the time it takes to complete it. But in order to understand all its possibilities, it will take a lot of practice. Especially if you want to build NoSQL storage based on it.