Document Search Architecture to Search Millions of Documents

This article represents different document search architectural models using which one could create a search architecture that could search through 100s of millions of documents in faster time (milliseconds) with most up-to-date and fresh results. If you are planning to create a document search infrastructure which could search millions of documents, and shows up results in less than a second time, go ahead and explore different models and adopt the one that suits your needs at this stage. Note that the models given below could scale to multiple data centers. In this blog, we shall try and examine different architecture models that could achieve the search timing of less than a second. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.

Following are some of the systems components that a web-based search system could have:

  • Search UI
  • Search application
  • RDBMS Database where data resides
  • Application cache or NoSQL
  • Intermediate filesystems for greater throughput

 

Model 1: Search UI Retrieves Results From Lucene – Search 5 Million Documents

The documents are written in a database. Index builder retrieves that documents from the database, and create lucene index. For the new data, the index builder retrieves the new document, creates the Lucene indices and tools like RSync is used to sync the new indices with existing lucene indices.

When the users enter the search keyword phrase and submits, following happens:

  • Document Ids for documents consisting of the search keyword phrase are found
  • While creating Lucene index, information related with each document Id is also stored in form of Stored Fields. The documents with document id and stored fields information is retrieved and passed on to front-end in some data formats such as JSON, which is then displayed in the UI.

This approach might work well until you reach a couple of million documents. Post that, one may hit performance problems, primarily, related with caching vis-a-vis systems’ capacity to load the entire Lucene index in memory.

 

Model 2: Search UI Retrieves Results using Both Lucene & Database – Search 20-30 Million Documents

Due to system cache related issues, one could adopt this technique where lucene index just stored the document id.

  • Use Lucene to search Document Id for all documents matching search keywords
  • Get the document detailed information from another dedicated database consisting of aggregated information, rather than using Lucene stored fields

This could help you serve 20-30 million documents.

However, this model has disadvantage of table locking leading to degraded performance. When the search UI tries to read the data from searchable DB, locking leads to another aggregation service being unable to update the DB with new datasets. Alternatively, search UI has to wait for the Lock when aggregation service is updating the DB with new datasets. This could however be sorted out using data replication on the redundant server. However, the replication delays could lead to situation when one may not find document information for a particular document Id that was retrieved from Lucene index.

 

Model 3: Search UI Retrieves Results using Lucene, Application Cache & Database – Search 100 Millions of Docs

In this model, one could adopt application caching such as Memcache. Thus, following would be search flow:

  • Search Lucene index for the documents Ids
  • Go to application cache to find the matching document
  • If not found, go to database

One could write a wrapper service that could interface with the application cache and database to retrieve the data. This would solve some of the problem of above model. Store the data/new data in application cache using a daemon/background thread (that reads from the database and put new data) and have Search UI hit the application cache rather than the database and, go to database if only it misses.

This could help you server 100 miilions documents or more. However, this model may have some of the following limitations:

  • Write contention due to dependence on the database; This was impacting the background thread to read from database and update application cache.
  • Loading large data from DB to the application cache
  • Need to replicate data to different data centres

 

Model 4: Search UI Retrieves Results using Lucene, Application Cache, An Intermediate Serialized Filesystem & Database – Search 200-300 Million Docs

Above limitations was solved by having an intermediate serialized file system where data used to get serialized on to the disk and, Search UI, if unable to find the data in application cache, reach out to this serialized file system and then, to the DB. Following is how the search flow would look like:

  • Search Lucene index for the documents Ids
  • Go to application cache to find the matching document
  • If not found, go to the serialized filesystem
  • If not found, do not do anything.

This model helps to scale the system across different data centers. What is copied to different data centers is the set of files in this serialized filesystem.
This model helps in avoiding the write contention as the background thread now reads from this serialized filesystem which gets synced with up-to-date data from DB using RSync. It no more requires the access to DB locally or remotely. The serialized file system could store the documents based on creating folders and file name using ID information. For example, someid1/someid2/someno1/someno2/someno3.txt

This could help one achieve the document searches of more than 200-300 million a day. Post that, it may hit limitations primarily due to slow updates to intermediate filesystem given the need to update large number of files. This is due to the fact that the writes were random. In next consideration, we will look into sequential writes and LSM Tree to achieve greater efficiency.

 

Model 5: Search UI Retrieves Results using Lucene, Application Cache, An Intermediate LSM Tree based Index & Database – Search 500+ Million Docs

Instead of storing the data in an intermediate filesystem based on random I/O, one could use the index created based data structure algorithm such as LSM Tree. One could use LSM-Tree based implementation such as LevelDB to achieve this implementation scenario. This index is stored in following manner:

  • A size of LSM-Tree based index is stored in-memory
  • Remaining index data is stored on the disk based on LSM-Tree structure.

In this kind of index, the data is written with complexity log(n) and is therefore very efficient. Following is the steps for read and write:

  • Read: This is how a read request is addressed.
    • Search Lucene index for the documents Ids
    • Go to application cache to find the matching document
    • If not found, go to LSM-Tree based index and search. Note that LSM-Tree based index is stored in-memory and disk. The document is first searched in in-memory. If not found, it is then, looked in the indices stored onto the disk.
  • Write: This is how a write request is addressed.
    • Data from the RDBMS database is aggregated to be written on application cache and LSM-Tree based index.
    • The new documents are stored appropriately in application cache.
    • These new documents are also stored in the LSM-Tree based index.

 

Ajitesh Kumar
Latest posts by Ajitesh Kumar (see all)

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.
Posted in Big Data. Tagged with , .