Elasticsearch Tutorial: Your Detailed Guide to Getting Started

Phil Vuollet Developer Tips, Tricks & Resources

In this Elasticsearch tutorial, I’m going to show you the basics. There are so many things to learn about Elasticsearch so I won’t be able to cover everything in this post. If you have experience searching Apache Lucene indexes, you’ll have a significant head start. Also, if you’ve worked with distributed indexes, this should be old hat. But if you’re new to these concepts, you’ll want to take some time to ingest the basics.

We’ll focus on the main arena of Elasticsearch: search. But first, I’ll give you the lay of the land so you can actually set it up and do some exercises of your own.

Hello Elasticsearch!

Elasticsearch is an open source, document-based search platform with fast searching capabilities. In other words, it’s optimized for needle-in-haystack problems rather than consistency or atomicity. Elasticsearch (the product) is the core of Elasticsearch’s (the company) Elastic Stack line of products. To avoid confusion, I’ll refer to the product as Elasticsearch or ES and the company as Elastic.

Elasticsearch runs on a clustered environment. A cluster can be one or more servers. Each server in the cluster is a node. As with all document databases, records are called documents. I’ll often refer to them as records because I’m stuck in my ways. Documents are stored in indexes, which can be sharded, or split into smaller pieces. Elasticsearch can run those shards on separate nodes to distribute the load across servers. You can and should replicate shards onto other servers in case of network or server issues (trust me, they happen).

Elasticsearch uses Apache Lucene to index documents for fast searching. Lucene has been around for nearly two decades and it’s still being improved! Although this search engine has been ported to other languages, it’s mainstay is Java. Thus, Elasticsearch is also written in Java and runs on the JVM. You’ll need that installed before you set up Elasticsearch. Let’s see how you can do that now.

Set up Elasticsearch

You can run Elasticsearch locally or consume it as a service via Amazon Web Services (AWS) or Google Cloud Platform (GCP). If Docker is more your thing, Elastic provides Docker containers with all versions of their products. If you’re just getting your feet wet, I recommend using a Docker container or installing on a VM. It’s also easy enough to run on your local machine. I’ll be doing this using the Apache 2.0 licensed version for the demos in this tutorial.

Elasticsearch runs as a cloud service or on your own server or VM, or you can run it with Docker. It’s meant to be run in a cluster of servers to scale the load across nodes. But you can run it with just one node if you’re taking it for a spin. Elastic offers a free version that you can download and install. It runs on the JVM, so you’d have to have that installed as well. Alternatively, you can pull the Docker image and run it that way.

Whichever method you choose to use, it’s easy to get the service up and running. The installed version is self-contained. You start the server simply by running a premade script. The containerized version takes nothing more than a docker run command to start it in development mode. Remember, development mode is for local use without clustering.

Production deployment takes a bit more finesse to configure. For production environments, you’ll need to set up security and all the nodes in the cluster. That topic is beyond the scope of this article. The documentation on the Elastic site has all the details.

I’m running the OSS version for Windows for this tutorial. With it installed, it’s a simple matter of running the batch file that’s in the “bin” directory to launch the server. Once the server is started, we’re ready to consume the service.

Consume Elasticsearch

You need flexibility in how you access your data. On the one hand, you might be building an Alexa skill to report sales rollups to executives. On the other hand, you might be building a tool to allow business analysts to perform ad-hoc queries on…well…anything! In order to support such a broad range of goals, Elasticsearch uses the ubiquitous HTTP protocol. We’ll take a look at how to search using that API. But first, a few words on security so we are keeping our minds in the right place.

Security first

Elasticsearch provides a RESTful API for consumption. The API is served over HTTP. If you’re hosting Elasticsearch, you’ll need to use X-Pack or brew up your own security layer. However, it can be a slippery slope of complexity when it comes to rolling out your own solutions.  Security is one area where you can’t weigh the investment lightly. Building your own security layer can become expensive in the long run. You might be better off investing in the X-Pack solution after all is said and done. Either way, you’ll need to have security in place once you’re in production, so plan accordingly so that your data is secure in transit and at rest!

The security game changes somewhat when you’re running Elasticsearch as a cloud service. The cloud providers offer their own platform-specific security models.

Of course, you are also free to host Elasticsearch on any cloud infrastructure on a VM or container service. You would use X-Pack and/or a combination of the providers’ security features. This option is similar to hosting a solution on your own servers, except that the infrastructure is on the cloud platform.

It’s actually pretty smart to separate the concerns of security from the concerns of the core search capabilities that Elasticsearch provides. This way, when your security needs change, you don’t have to change anything about your ES implementation. I’m seeing this separation between security and core more often these days. Typically, a reverse proxy or a load balancer handles the TLS and forwards all calls over plain HTTP to the actual hosted service.

With that out of the way, we can start looking at the interface.


Stackify Loves Developers

RESTful API

Elasticsearch has quite a few APIs. Starting at the largest scope, we can use the “cluster” API to manage our clusters. The “index” APIs give us access to our indices, mappings, aliases, etc.  Of course, you’ll find the real action in the “search” APIs. This is what you use to query, count, and filter your data across multiple indexes and types. And you can’t search unless you add data using the “documents” APIs. In fact, let’s check that one out first!

Create and update records

Like all Elasticsearch’s RESTful APIs, the document API accepts a PUT request to create a new document. The document is placed by “index” using the following path pattern: “/{index}/{type}/{id}.” The given index will be created if it doesn’t yet exist. Types have mappings, which will be inferred if you don’t provide one. Mappings assign types to attributes to describe the document structure to Lucene. Lucene does optimizations based on those attribute types.

Create

You don’t have to specify an “id” to create a record. Instead, you can use a POST to the “/{index}” endpoint. When you use POST, the engine will generate a unique id for you. Let’s try this now:

curl -POST http://localhost:9200/my_index/my_type -curl -H 'Content-Type: application/json' -d '{"user":"Phil","message":"Hello World!"}'

This request will create an index named “my_index” with a type “my_type” and place the document in that index. It’ll generate an id for the document. The result looks like this:

{"_index":"my_index","_type":"my_type","_id":"VutxJGUBn9IhJVP8xXFf","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

As you can see, it generated the “_id” for the record as “VutxJGUBn9IhJVP8xXFf.”

If I provide an id in the path, it’ll use that as the document’s “_id.” Let’s try that one now.

curl -POST http://localhost:9200/my_index/my_type/G123 -curl -H 'Content-Type: application/json' -d '{"user":"Phil","message":"Hello World!"}'

As you can see, I added “/G123” to the path. That resulted in the following response:

{"_index":"my_index","_type":"my_type","_id":"G123","_version":1,"result":"created","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":0,"_primary_term":1}

The “_id” for this document is “G123.”

Read

We can retrieve as you’d expect from a RESTful API.

curl -GET http://localhost:9200/my_index/my_type/G123

That will bring back the document record, which looks like this:

{"_index":"my_index","_type":"my_type","_id":"G123","_version":1,"found":true,"_source":{"user":"Phil","message":"Hello World!"}}

Notice how the entire record has attributes about the document. The actual document is shown in the “_source” attribute. Note the “_version” attribute.

Update

Elasticsearch has built-in document versioning. The documents are versioned automatically by starting at version 1 and incrementing by one with each future operation. Use a PUT operation and specify the version to update.

# request
curl -PUT http://localhost:9200/my_index/my_type/G123?version=1 -curl -H 'Content-Type: application/json' -d '{"user":"Phil","message":"Hello, World!"}'

# response
{"_index":"my_index","_type":"my_type","_id":"G123","_version":2,"result":"updated","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":1,"_primary_term":1}

Notice how the version is now at “2”? Let’s GET the document again.

# request
curl -GET http://localhost:9200/my_index/my_type/G123

# response
{"_index":"my_index","_type":"my_type","_id":"G123","_version":2,"found":true,"_source":{"user":"Phil","message":"Hello, World!" }}

Normally, Elasticsearch uses a hash function on the id to map it to the proper shard.

Once you have data in your index, you can do some searching. Next, we’ll look at some basic search functions.

Search your data

The main event for Elasticsearch is, of course, the search feature. I’ve created a dump of my “System” event log, then made a quick application to move the records into an index named “syslogs.” I used the following command from the “cat” API to print out the stats on that index.

curl -X GET "localhost:9200/_cat/count/syslogs?v"

epoch timestamp count
1533611156 22:05:56 4137

As you can see, we have 4137 documents in that index to work with. It’s not much in terms of what we would actually use this technology for, but it’ll do for a demo. Let’s search!

Filter by context

First, let’s take a look at how many errors are in the logs. We’ll do this with the “_search” endpoint on the index as follows:

curl -X GET http://localhost:9200/syslogs/_search?q=level:error

The response looks like this:

{"took":7,"timed_out":false,"_shards":{"total":5,"successful":5,"skipped":0,"failed":0},"hits":{"total":355,"max_score":2.579004,"hits":[{"_index":"syslogs","_type":"event","_id":"ROtLEmUBn9IhJVP8VmEO","_score":2.579004,"_source":{"level":"Error","dateAndTime":"2018-08-06T20:41:14","source":"Microsoft-Windows-DistributedCOM","eventId":10016,"taskCategory":"None"}},{"_index":"syslogs","_type":"event","_id":"qetLEmUBn9IhJVP8hGHg","_score":2.579004,"_source":{"level":"Error","dateAndTime":"2018-08-02T08:37:55","source":"Microsoft-Windows-DistributedCOM","eventId":10016,"taskCategory":"None"}},{"_index":"syslogs","_type":"event","_id":"M-tKEmUBn9IhJVP8nGEu","_score":2.579004,"_source":{"level":"Error","dateAndTime":"2018-08-06T20:45:31","source":"Microsoft-Windows-DistributedCOM","eventId":10016,"taskCategory":"None"}},{"_index":"syslogs","_type":"event","_id":"e-tLEmUBn9IhJVP8bGFJ","_score":2.579004,"_source":{"level":"Error","dateAndTime":"2018-08-03T21:03:55","source":"Microsoft-Windows-DistributedCOM","eventId":10016,"taskCategory":"None"}},{"_index":"syslogs","_type":"event","_id":"mOtLEmUBn9IhJVP88WK8","_score":2.579004,"_source":{"level":"Error","dateAndTime":"2018-07-28T10:18:11","source":"Microsoft-Windows-WindowsUpdateClient","eventId":20,"taskCategory":"Windows Update Agent"}},{"_index":"syslogs","_type":"event","_id":"IetMEmUBn9IhJVP8K2ME","_score":2.579004,"_source":{"level":"Error","dateAndTime":"2018-07-27T04:32:36","source":"Microsoft-Windows-DistributedCOM","eventId":10016,"taskCategory":"None"}},{"_index":"syslogs","_type":"event","_id":"LOtMEmUBn9IhJVP8MGMd","_score":2.579004,"_source":{"level":"Error","dateAndTime":"2018-07-27T00:48:15","source":"Microsoft-Windows-WindowsUpdateClient","eventId":20,"taskCategory":"Windows Update Agent"}},{"_index":"syslogs","_type":"event","_id":"YutMEmUBn9IhJVP8RmPY","_score":2.579004,"_source":{"level":"Error","dateAndTime":"2018-07-25T20:11:54","source":"Microsoft-Windows-DistributedCOM","eventId":10016,"taskCategory":"None"}},{"_index":"syslogs","_type":"event","_id":"ZutMEmUBn9IhJVP8SmOS","_score":2.579004,"_source":{"level":"Error","dateAndTime":"2018-07-25T11:59:15","source":"Microsoft-Windows-WindowsUpdateClient","eventId":20,"taskCategory":"Windows Update Agent"}},{"_index":"syslogs","_type":"event","_id":"zetMEmUBn9IhJVP8e2M8","_score":2.579004,"_source":{"level":"Error","dateAndTime":"2018-07-23T12:33:49","source":"Microsoft-Windows-DistributedCOM","eventId":10016,"taskCategory":"None"}}]}}

Well, I don’t know if that’s even readable. It’s just a wall of JSON as far as I can tell.

Make it pretty

Let’s see if we can get a better look by using the “pretty” option like this:

curl -X GET http://localhost:9200/syslogs/_search?q=level:error&pretty

Notice how one of the query params is “pretty.” You can do “pretty=true” if it makes you feel better, but it isn’t necessary. The only thing is that it outputs 10 records by default. I’ll truncate for brevity, but it looks like this:

{
    "took" : 5,
    "timed_out" : false,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
    },
    "hits" : {
        "total" : 355,
        "max_score" : 2.579004,
        "hits" : [
            {
                "_index" : "syslogs",
                "_type" : "event",
                "_id" : "ROtLEmUBn9IhJVP8VmEO",
                "_score" : 2.579004,
                "_source" : {
                    "level" : "Error",
                    "dateAndTime" : "2018-08-06T20:41:14",
                    "source" : "Microsoft-Windows-DistributedCOM",
                    "eventId" : 10016,
                    "taskCategory" : "None"
                }
            },
            ...
            {
                "_index" : "syslogs",
                "_type" : "event",
                "_id" : "zetMEmUBn9IhJVP8e2M8",
                "_score" : 2.579004,
                "_source" : {
                    "level" : "Error",
                    "dateAndTime" : "2018-07-23T12:33:49",
                    "source" : "Microsoft-Windows-DistributedCOM",
                    "eventId" : 10016,
                    "taskCategory" : "None"
                }
            }
        ]
    }
}

That’s a lot of information that we don’t need. We really just want a count of error events.

Trim the fat

We can trim down the result. Exclude the “_source” by adding “_source=false” to the query params.

curl -X GET http://localhost:9200/syslogs/_search?q=level:error&pretty&_source=false

That’ll tell Elasticsearch to skip the “_source” for each record. It’s better, but it’s still too much. It looks like this now:

{
    "took" : 5,
    "timed_out" : false,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
    },
    "hits" : {
        "total" : 355,
        "max_score" : 2.579004,
        "hits" : [
            {
                "_index" : "syslogs",
                "_type" : "event",
                "_id" : "ROtLEmUBn9IhJVP8VmEO",
                "_score" : 2.579004
            },
            ...
            {
                "_index" : "syslogs",
                "_type" : "event",
                "_id" : "zetMEmUBn9IhJVP8e2M8",
                "_score" : 2.579004
            }
        ]
    }
}

And that’s the truncated version as before. Notice the “hits” no longer include the “_source” attribute.

One more thing we can do is limit the “hits” returned to zero. The point of “hits” is that we can page the results. There are “from” and “size” parameters that we can use for paging. I’m going to set “size” to zero.

curl -X GET http://localhost:9200/syslogs/_search?q=level:error&pretty&source=false&size=0

{
    "took" : 7,
    "timed_out" : false,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
    },
    "hits" : {
        "total" : 355,
        "max_score" : 0.0,
        "hits" : [ ]
    }
}

And now it’s pretty easy to see how many error events are in the logs! It’s not the best way to get a count, but it does show some interesting properties of the search API. Notice the “max_score” is “0.0” in our results here.

Then again, this API isn’t for counting, it’s for searching and paging results. It’s always good to have a hit count in any paging API—good design! Here’s how paging works…

Page results for a better UX

How do we page the results? Can’t we just return all 355 records in one query, or do we have to page them? Let’s try it!

curl -X GET http://localhost:9200/syslogs/_search?q=level:error&pretty&_source=false&size=355

I won’t bore you with the details, but it printed out all 355 hits as requested! There’s a practical limit though. I’d say somewhere around 20-25 at most. They’re pretty spot on with 10 as a starting point though.

Let’s try something sane with paging. We’ll get the first 25. Mind you, we aren’t sorting yet so these are being returned in a somewhat arbitrary order. The highest “_score” values are coming up first, but all search results match exactly (case insensitive). There is only a slight difference in scores across all 355 events (wouldn’t you like to know why? I would.).

curl -X GET http://localhost:9200/syslogs/_search?q=level:error&pretty&_source=false&size=25

I’ve added “&size=25,” which will return the first 25 events (“from” defaults to zero). To get the next 25, we’ll do this:

curl -X GET http://localhost:9200/syslogs/_search?q=level:error&pretty&_source=false&size=25&from=25

Here, I’ve added the “&from=25”, which brings back the next 25.

One thing to consider when paging is the last set. Usually, it’ll be less than the page size. When we start from 350 with a “size” of 25, we’ll get the last five back without any errors.

What’s more, we can even start past the number of results. Unlike our earlier example of “size=0,” we get a “max_score” in the response.

curl -X GET http://localhost:9200/syslogs/_search?q=level:error&pretty&_source=false&size=1&from=360

{
   "took" : 7,
    "timed_out" : false,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
    },
    "hits" : {
        "total" : 355,
        "max_score" : 2.579004,
        "hits" : [ ]
    }
}

As you can see, this API has a pretty solid paging implementation. But what about sorting?


Stackify Loves Developers

Sort for relevance

We can sort by adding the “sort” parameter. To sort by “_score,” add “&sort=_score:desc.” This way, we’ll get the most relevant hits first.

You can see how the scoring was done by adding the “explain” parameter. Let’s see how it scored our results:

{
    "_shard": "[syslogs][3]",
    "_node": "BcPIBXb_SR-HCQx_WJi3jg",
    "_index": "syslogs",
    "_type": "event",
    "_id": "mOtLEmUBn9IhJVP88WK8",
    "_score": 2.579004,
    "_explanation": {
        "value": 2.579004,
        "description": "weight(level:error in 70) [PerFieldSimilarity], result of:",
        "details": [
            {
                "value": 2.579004,
                "description": "score(doc=70,freq=1.0 = termFreq=1.0n), product of:",
                "details": [
                    {
                        "value": 2.579004,
                        "description": "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                        "details": [
                            {
                                "value": 62.0,
                                "description": "docFreq",
                                "details": []
                            },
                            {
                                "value": 823.0,
                                "description": "docCount",
                                "details": []
                            }
                        ]
                    },
                    {
                        "value": 1.0,
                        "description": "tfNorm, computed as (freq  (k1 + 1)) / (freq + k1  (1 - b + b * fieldLength / avgFieldLength)) from:",
                        "details": [
                            {
                                "value": 1.0,
                                "description": "termFreq=1.0",
                                "details": []
                            },
                            {
                                "value": 1.2,
                                "description": "parameter k1",
                                "details": []
                            },
                            {
                                "value": 0.75,
                                "description": "parameter b",
                                "details": []
                            },
                            {
                                "value": 1.0,
                                "description": "avgFieldLength",
                                "details": []
                            },
                            {
                                "value": 1.0,
                                "description": "fieldLength",
                                "details": []
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

As you can see, “explain” is fairly intense! I could only include one record here for the sake of space, but it’s instructive! Did you notice how the result comes from a specific shard?

Ranking value

Well, the first ranking method “id” depends on the total number of documents in the shard. When I look at the results with lower “_score” values (2.38), I can see that they come from shard 4. Shard 4 has a match rate of 80 in 870 where Shard 3 matches on “error” 62 times out of 823 records in that shard. That’s why we have different weights to the same matches! It’s all about frequency per shard. This implies that we won’t always have a truth unless we do something about that difference in weight.

Since we’re talking about sorting, I wanted to make you aware of this ranking “caveat.” Text search scores are a balance of two functions, one of which depends on hit density within the shard. Whether or not this is a problem worth solving depends on your situation. Elasticsearch handles very big data well—like orders of magnitude larger than our current sample. At that scale, the imbalance is irrelevant. However, in case you were wondering, there are some things you can do to make it better.

Balancing act

In my sample data, there are four levels of events: informational, warning, error, and critical. I could index each log level separately. The search API allows us to search across multiple indices.

curl -GET http://localhost:9200/error,critical/_search?pretty

This will return 10 results and give me a count of how many there are across both indices. Results will be from both indices, but which ten we get depends on the id (the default sort).

We don’t exactly have to use separate indexes in this case since those log records are all the same type. Before version 6.0.0, you could mix and match types within the same index. You can’t do this anymore. Instead, you’ll need to put each type into its own index. Adjust the shards to balance out the indexes for each type.

The rest of the RESTful API

The RESTful APIs have an enormous surface area! I could write a whole book on the topic and still not cover everything. Hopefully, you have enough now to get a good start. If you haven’t done so already, grab the OSS version and whip up a data migration. There are libraries for many of the major languages, some of which include JavaScript, Python, Java, PHP, and .NET. You can use your favorite language, grab some system logs or whatever you have available to seed an index and get a real feel for Elasticsearch using data you know.

While you’re at it, you might appreciate Kibana. It’s another one of Elastic’s products in the Elastic lineup. It’s a graphical interface for making sense of the data in a very visual way. Whip up some visuals and show off your analytical skills. You’d be sure to impress some folks with your valuable new Elasticsearch skills!

 

About Phil Vuollet

Phil Vuollet uses software to automate processes to improve efficiency and repeatability. He writes about topics relevant to technology and business, occasionally gives talks on the same topics, and is a family man who enjoys playing soccer and board games with his children.
Improve Your Code with Retrace APM

Stackify’s APM tools are used by thousands of .NET, Java, and PHP developers all over the world.
Explore Retrace’s product features to learn more.