The article aims to describe how data is managed at LinkedIn.com, the most popular professional social networking site. Please shout out loud if you disagree with one or more of the aspects mentioned below. Also, do suggest if I missed on one or more aspects.
Following are some of the data use-cases that we may have come across while we are surfing various different LinkedIn pages:
And, quite amazingly, the pages’ load time in all of the above use cases have been in milliseconds provided we have decent internet bandwidth. Kudos to LinkedIn Engineering team!
As like any other startups, LinkedIn in its early days lived on top a single RDBMS containing a handful of tables for user profiles and connections. Quite natural, isn’t it? This RDBMS was augmented by two additional database systems of which one was used to support full-text search for user-profile and other was used to support the social graph. These two DB system was kept up-to-date with the Databus, a change capture system whose primary goal is to capture the change in data set on the source-of-truth systems (such as Oracle) and update the same to additional DB systems.
However, it soon became a challenge to meet the website data needs with above data architecture, primarily, because of the need to achieve all of the below objectives which seemed to be not possible as per Brewer’s CAP theorem:
With above said, LinkedIn engineering team achieved what they called as timeline consistency (or eventual consistency with nearline systems explained below) along with other two characteristics such as availability and partition tolerance. Read further to see how LinkedIn data architecture looks today.
Given the need to support millions of users and related transactions to happen in fraction of seconds and not any more, the above data architecture was definitely not enough. Thus, LinkedIn engineering team came up with the three-phase data architecture consisting of online, offline and nearline data systems.At a high level, LinkedIn data are stored in following different forms of data systems (look at the diagram below):
Following is description of these three different systems in which above data stores can be categorized into:
Online DB Systems
The online systems handle users’ real-time interactions; Primary databases such as Oracle falls in this category. Primary data stores are used for users’ writes and only few reads. In case of Oracle, Writes are made on to Oracle master. Recently, LinkedIn has been working on yet another data system named as Espresso to fulfill the complex data needs which do not seem to be achieved by RDBMS such as Oracle. Lets keep a tab on whether they would be able to move entirely to NoSQL data storage system like Espresso and phase out Oracle completely or mostly.
Espresso is a horizontally scalable, indexed, timeline-consistent, document-oriented, highly available NoSQL data store. It is envisioned to replace legacy Oracle databases across the company’s web operations. It was originally designed to provide a usability boost for LinkedIn’s InMail messaging service. Currently, following are some of the applications that use Espresso as a source-of-truth system. That’s kind of amazing to see how NoSQL data store is used to handle so many application data needs.
Offline DB Systems
The offline systems, primarily Hadoop and a Teradata warehouse, handle batch processing and analytic workloads. The reason why it is called as offline is the batch handling of data. Apache Azkaban, is used for managing Hadoop & ETL jobs which extracts data from primary source-of-truth systems, runs map-reduce across it, stores the data in HDFS and notifies the consumers such as Voldemart to retrieve data appropriately. Consumers like Voldemart retrieves the data from HDFS and swap the indexes to keep the up-to-date data.
Nearline DB Systems (Timeline Consistency)
The nearline systems is created with aim to achieve timeline consistency (eventual consistency) and, handle features such as People You May Know (read-only data sets), search, social graph, which update constantly but require slightly less than online latency. Following are different types of nearline systems:
Following represents how data change capture events are updated to nearline systems using databus:
Let’s say you updated your profile with latest skills and position. You also accepts a connection request. What happens underneath:
Following are lessons that one could learn and apply in designing data architecture to achieve Data consistency, scalability and availability like that of LinkedIn.com:
References
[adsenseyu1]
In recent years, artificial intelligence (AI) has evolved to include more sophisticated and capable agents,…
Adaptive learning helps in tailoring learning experiences to fit the unique needs of each student.…
With the increasing demand for more powerful machine learning (ML) systems that can handle diverse…
Anxiety is a common mental health condition that affects millions of people around the world.…
In machine learning, confounder features or variables can significantly affect the accuracy and validity of…
Last updated: 26 Sept, 2024 Credit card fraud detection is a major concern for credit…
View Comments
shouldn't the arrow btwn business layer and derived db systems (reads) be going the other way?
Thanks for the wonderful article explaining the architecture and data flow at a high level.
One question - for a near real time system - "where the reads are honored using Voldermot" - I am wondering when you mention that the input to the Voldermot system is from Batch processing systems.
If that is the case - how does it become a near real time system - since batch processing on platforms like Hadoop is inherently slow and time consuming.
Did I miss something ? Any clarification would be very appreciated ?
thanks
thanks for sharing! I can recognise similar principles as the ones in cqrs http://martinfowler.com/bliki/CQRS.html
I thought you guys replaced Databus with Kafka?