In this post, you will learn about fundamentals and best practices with ElasticSearch based on the following:
- Revision notes on Elasticsearch fundamentals
- A set of questions to test your knowledge and, in turn, help you learn Elasticsearch concepts related to index and shards; These questions could as well help you prepare for interviews related to ElasticSearch
- A set of interview questions
ElasticSearch Fundamentals – Revision Notes
- Each Elasticsearch shard is a Lucene index
- The number of shards and replicas can be defined per index at the time of creation of the index. The number of replicas per shard can later be changed.
- Shard in ElasticSearch is primarily a Lucene index made up of one or more Lucene segments which store the document data in form of an inverted index.
- Lucene segments are immutable
- Average shard size could vary from 10GB to 40 GB depending upon the nature of data stored in the index. It is commonly seen that time-based data is stored in shard size of 20-40 GB.
- It is recommended to run force-merge operation of merging multiple smaller segments into a larger one in off-peak hours (when no more data is written to the index).
- It is recommended to have 20-25 shards per GB heap space. Thus, a node with 20 GB heap can have 400-500 shards.
- Each shard has metadata related to shard and segment which needs to be stored in memory, and thus, use heap space.
- The size of the shard could be managed based on one of the following techniques:
- Creating shards based on time-based indexing
- Creating shards based on documents count for each shard and using rollover API
- Merging/shrinking existing shards into new shard using Shrink APIs
- It is recommended to determine the maximum shard size from a query performance perspective based on the benchmark using realistic data and queries. There is no thumb rule or one-size-fits-all solution to this.
- It is recommended to use time-based indices for managing data retention whenever possible. Data can be grouped into indices based on the retention period. This makes it manage the indices in terms of creating and deleting the indices.
Sample Interview Questions
- Explain the concepts of the cluster, node, index, shard, and replicas?
- How to determine the shard size? What is recommended as the size of shard consisting of time-based data?
- How does update and delete documents from index works?
- How many shards can be allocated to a node having the memory of 20 GB or so?
- Explain Lucene segments and merging of segments?
- What is rollover and shrink APIs used for?
Sample Quiz (Objective Questions) on ElasticSearch
How many shards are created by default when elasticsearch server starts?
How many replicas are created by default for each shard?
How many shards including primary and replica shards in total are created by default?
Shards can further be splitted into multiple shards
Number of shards of an index can be changed at any point of time
Data is available for querying as soon as _______
Lucene segments are immutable
Updating a document results in which of the following
Deleting the document results in which of the following
The more heap space a node has, the more data and shards it can handle.
Number of shards on a node depends upon the available heap space
Smaller the shard size, smaller is the segment, greater is the overhead
Each query is executed in a single thread per shard
Which of the following API are used to create a new index given a pre-defined count of documents to be stored in an index is reached?
Which of the following API is used to reduce the number of shards in case many shards have been configured initially
Creating multiple shards of an index and partioning the data into different indices are one and the same thing
Share your Results:
Further Reading / References
- ElasticSearch Basic concepts
- ElasticSearch from the Bottom Up – Part 1
- How many shards should I have in my Elasticsearch cluster?
In this post, you learned about quick concepts, sample interview questions and quiz related to Elasticsearch. Did you find this article useful? Do you have any questions or suggestions about this article? Leave a comment and ask your questions and I shall do my best to address your queries.
- Generative Modeling in Machine Learning: Examples - March 19, 2023
- Data Analytics Training Program (Beginners) - March 18, 2023
- Histogram Plots using Matplotlib & Pandas: Python - March 18, 2023
Leave a Reply