Following are some of the systems components that a web-based search system could have:
The documents are written in a database. Index builder retrieves that documents from the database, and create lucene index. For the new data, the index builder retrieves the new document, creates the Lucene indices and tools like RSync is used to sync the new indices with existing lucene indices.
When the users enter the search keyword phrase and submits, following happens:
This approach might work well until you reach a couple of million documents. Post that, one may hit performance problems, primarily, related with caching vis-a-vis systems’ capacity to load the entire Lucene index in memory.
Due to system cache related issues, one could adopt this technique where lucene index just stored the document id.
This could help you serve 20-30 million documents.
However, this model has disadvantage of table locking leading to degraded performance. When the search UI tries to read the data from searchable DB, locking leads to another aggregation service being unable to update the DB with new datasets. Alternatively, search UI has to wait for the Lock when aggregation service is updating the DB with new datasets. This could however be sorted out using data replication on the redundant server. However, the replication delays could lead to situation when one may not find document information for a particular document Id that was retrieved from Lucene index.
In this model, one could adopt application caching such as Memcache. Thus, following would be search flow:
One could write a wrapper service that could interface with the application cache and database to retrieve the data. This would solve some of the problem of above model. Store the data/new data in application cache using a daemon/background thread (that reads from the database and put new data) and have Search UI hit the application cache rather than the database and, go to database if only it misses.
This could help you server 100 miilions documents or more. However, this model may have some of the following limitations:
Above limitations was solved by having an intermediate serialized file system where data used to get serialized on to the disk and, Search UI, if unable to find the data in application cache, reach out to this serialized file system and then, to the DB. Following is how the search flow would look like:
This model helps to scale the system across different data centers. What is copied to different data centers is the set of files in this serialized filesystem.
This model helps in avoiding the write contention as the background thread now reads from this serialized filesystem which gets synced with up-to-date data from DB using RSync. It no more requires the access to DB locally or remotely. The serialized file system could store the documents based on creating folders and file name using ID information. For example, someid1/someid2/someno1/someno2/someno3.txt
This could help one achieve the document searches of more than 200-300 million a day. Post that, it may hit limitations primarily due to slow updates to intermediate filesystem given the need to update large number of files. This is due to the fact that the writes were random. In next consideration, we will look into sequential writes and LSM Tree to achieve greater efficiency.
Instead of storing the data in an intermediate filesystem based on random I/O, one could use the index created based data structure algorithm such as LSM Tree. One could use LSM-Tree based implementation such as LevelDB to achieve this implementation scenario. This index is stored in following manner:
In this kind of index, the data is written with complexity log(n) and is therefore very efficient. Following is the steps for read and write:
Artificial Intelligence (AI) agents have started becoming an integral part of our lives. Imagine asking…
In the ever-evolving landscape of agentic AI workflows and applications, understanding and leveraging design patterns…
In this blog, I aim to provide a comprehensive list of valuable resources for learning…
Have you ever wondered how systems determine whether to grant or deny access, and how…
What revolutionary technologies and industries will define the future of business in 2025? As we…
For data scientists and machine learning researchers, 2024 has been a landmark year in AI…