What is Elasticsearch, how does it work, and why is it so effective? Elasticsearch is simple to configure, has incredible flexibility, and is an excellent tool for complex searches. Let's take a closer look.
Elasticsearch is an open-source search engine based on the Lucene library. It was developed in Java and is designed to operate in real time. It can search and index document files in diverse formats. It was designed to be used in distributed environments by providing flexibility and scalability. Now, Elasticsearch is a widely popular enterprise search engine.
Elasticsearch supports a large amount of data without losing performance. It can be deployed on any system, regardless of platform, by providing a REST API. It solves many problems by offering very complex search possibilities, including phonetic search, full-text search, multiple search and parallelism.
To help understand how Elasticsearch handles data, we can make an analogy to a database.
Elasticsearch stores the data using the "schema-less" concept. This means that it is not necessary to define the structure of the data that will be entered in advance, as happens with relational databases known in the market: Oracle, MySQL, and SQLServer, among others.
In our analogy of traditional relational databases, the structure of the data used by Elasticsearch would be:
In talking about the infrastructure that Elasticsearch provides, we have some important terms to learn.
Image courtesy of bonsai
With this structure, Elasticsearch offers us a good API. In addition to the data manipulation API being responsible for replication, it will also take care of the success of transactions, cluster health, and other management jobs.
Here are the main APIs and their functions:
The "default" configuration of the tool is very robust and scales horizontally. But if a change is needed, it's usually quite simple: this can be done by changing the properties of the application configuration file or through API REST calls.
First, because Elasticsearch was built to make it easier to manage your activities in a simple and efficient way, as we have already explained, thus enabling you to scale infrastructure as needed.
Second, internally, it keeps the data cached to make the result even more performative. Queries are made when the HTTP GET verb passes the desired parameters.
Third, its storage structure records information in a different way than traditional relational databases; it uses a structure we call an inverted index. In this case, the index is the same as in the relational database, i.e. a structure already known, which helps to store data for future use.
For example, imagine that we want to search all articles that contain the word "Brazil" in the title or text. How would we do this in a relational database?
We would have to use a query like this:
SELECT * FROM article WHERE title LIKE '%Brasil%' OR text LIKE '%Brasil%';
In this case, we know that using indexes in the columns will not be a good option, especially in the "text" column, because we need to use the operator "LIKE" involving the search on the wildcard "%", which introduces a big performance issue.
The query won't be effective at all since the database will have to go through all the records to find the words within the columns, so it will make a fullscan in the table, and this will not be good. Imagine that this table has millions of records...
It is at this time that our friend inverted index stands out. Its structure is assembled through the words, which are called "terms" in Elasticsearch. #CodingExplained offers an excellent introduction on the inverted index:
"The purpose of an inverted index, is to store text in a structure that allows for very efficient and fast full-text searches. When performing full-text searches, we are actually querying an inverted index and not the JSON documents that we defined when indexing the documents...
An inverted index consists of all of the unique terms that appear in any document covered by the index. For each term, the list of documents in which the term appears, is stored. So essentially an inverted index is a mapping between terms and which documents contain those terms. Since an inverted index works at the document field level and stores the terms for a given field, it doesn’t need to deal with different fields." - #CodingExplained
Alright, so let’s look at an example. Suppose that we have two recipes with the following titles: “The Best Pasta Recipe with Pesto” and “Delicious Pasta Carbonara Recipe.” The following table shows what the inverted index would look like.
So the terms from both of the titles have been added to the index. For each term, we can see which document contains which term, and this enables Elasticsearch to efficiently match documents containing specific terms.
The first step of a search query is to find the documents that match the query in the first place. So if we were to search for “pasta recipe,” we would see that both documents contain both terms.
If we searched for “delicious recipe,” the results would be as follows:
As we can see, an inverted index structure contains the following information:
Elasticsearch applies some rules to save each word separately and simply for future searches. It's what we call the analyze process.
Now you can see why Elasticsearch is so fast for searches: it already has the term saved and knows which documents have it, so you don't need to fullscan the data.
Of course, Elasticsearch has an effective implementation to work with data in memory and manipulate the data in the operating system, but the fact that it uses inverted indexes definitely makes it more appropriate than relational databases when it comes to searches.
Elasticsearch will "break" a text into terms so that it can create this index. To separate words from the text, it uses a technique called analyzer. This technique is responsible for processing the text being saved. The default analyzer is called "Standard Analyzer."
Generally speaking, what the analyzer does is:
With this, Elasticsearch can store words in a very uniform way. Here's what the following text would look like: "Is this dèja vu?"
The terms would look like this: [ is, this, deja, vu ]
Note that the procedure separated the words and applied the quoted rules before saving the term.
These are some of the main procedures performed by analyzers. There are several types of analyzers, and each has its own set of rules to apply to the data. You can also create your own analyzer using an existing set of rules or even creating your own rules.
In addition to these techniques, Elasticsearch has several settings to give weight to a certain field, such as frequency of the term, the length of the term, frequency of the reverse document, etc. to retrieve documents with accuracy and speed.
Elasticsearch can be a great way to optimize your product, improving the search by making it faster and enabling more complex searches. It's a simple tool to configure, has incredible flexibility, and is an excellent option for data searches and complex searches.
It has the power to execute queries quickly and with excellent results. Elasticsearch provides excellent documentation, and its ever-growing community is active and helpful.