ElasticSearch — Inverted Index, Source, Index, Norms, Routing…

Nil Seri
4 min readOct 5, 2021

Lucene vs ElasticSearch, Details about Inverted Index, source-index-norms-routing Explained in Detail

Photo by Ryan Geller on Unsplash

Lucene vs ElasticSearch

Differences between Lucene and ElasticSearch:

  • Lucene is a Java library
  • ElasticSearch is built over Lucene and provides a JSON based REST API to refer to Lucene features.
  • Elasticsearch provides a distributed system (which is something Lucene is not capable of) on top of Lucene.
  • Elasticsearch provides other supporting features like monitoring and management APIs.

All the data in Elasticsearch is internally stored in Apache Lucene as an inverted index.

What is Inverted Index?

Sample documents and resulting inverted index — elastic.co

After some simple text processing (lowercasing, removing punctuation and splitting words) about which you can find details in my post added below, “inverted index” is constructed as shown in the figure.

The inverted index maps terms to documents (and possibly positions in the documents) containing the term. Since the terms in the dictionary are sorted, we can quickly find a term, and subsequently its occurrences. This is contrary to a “forward index”, which lists terms related to a specific document.

You can think of this as a map which, in inverted index, key is the term and value is the document list whereas in forward index, key is the document and value is the term list related to it.

What are _source and source?

When you index a document in Elasticsearch, the original document (without any analyzing or tokenizing) is stored in a special field called _source.

To disable storing the original document (maybe for security reasons), you can set its value false like below at the top of your index mapping file.

{
"_source" : {
"enabled" : false
},
"properties": {
"organizationId": {
"type": "keyword"
},
"accountId": {
"type": "keyword"
},
"type": {
"type": "text",
"analyzer": "whitespace"
}
}
}

You can also use the _source parameter to select what fields of the source are retrieved. This is called source filtering.

To return the source for only the “field1” and “field2” fields and their properties.

"_source": [ "field1.*", "field2.*" ],

To sum it up, a field is stored if

  • no declaration at the top of index mapping file for “_source” (all fields’ values are already part of the “_source” field, which is stored by default.)
  • it is declared in field “_source” (like “_source”: [ “field1.*”, “field2.*” ] example)
  • you disabled it by setting “_source” : { “enabled” : false },” but set this field’s “store” option as “true”

What is index?

For a field to be searchable, mapping option “index” should be set to “true” which is also the default value and means it is indexed and analyzed by default.

There used to be 3 different states; “analyzed”, “not_analyzed” and “no”.
In latest versions of Elasticsearch, possible values can be one of “true” or “false”.

We now manage our “analysis” decision by “type” property. “string” is now separated as “text” and “keyword”.
For “not_analyzed”, we use “keyword” and for “analyzed”, we use “text”.

If we do not want the field to be indexed, we will set “index” property to “false”.

This is an analyzed field declaration (you can remove index property as it is true by default):

{   
"foo": {
"type": "text",
"index": true
}
}

And this one is a not_analyzed field declaration :

{   
"foo": {
"type": "keyword",
"index": true
}
}

What is norms?

Norms store various normalization factors (numbers to represent the relative field length and the index time boost setting) that are later used at query time in order to compute the score of a document relatively to a query.

Although useful for scoring, norms also require quite a lot of memory (typically in the order of one byte per document per field in your index, even for documents that do not have this specific field). As a consequence, if you do not need scoring on a specific field, you should disable norms on that field.

You can set it false on a field like:

"properties": {
"title": {
"type": "text",
"norms": false
}
}

Nodes & Shards

  • An Elasticsearch node contains shards.
  • A single Elasticsearch index is spread across nodes using shards.
  • Each shard holds a partition of the documents in the Elasticsearch index.
  • Each one of these shards is an instance of Lucene.

What is routing?

Elasticsearch can store copies of an index’s data across multiple shards on multiple nodes. When running a search request, Elasticsearch selects a node containing a copy of the index’s data and forwards the search request to that node’s shards. This process is known as search shard routing or routing.

So, when you index a document, you can specify an optional routing value, which routes the document to a specific shard. After that, it becomes important to provide the routing value whenever indexing, getting, deleting, or updating a document.

{
"_routing": {
"required": true
},
"properties": {
"organizationId": {
"type": "keyword"
},
"accountId": {
"type": "keyword"
}
}

This is an example for a bulk request with routing:

Happy Coding!

--

--

Nil Seri

I would love to change the world, but they won’t give me the source code | coding 👩🏻‍💻 | coffee ☕️ | jazz 🎷 | anime 🐲 | books 📚 | drawing 🎨