Categories: Big Data

Quick Cheat Sheet for Big Data Technologies

This article represents quick details on some of the key open-source technologies (tools & frameworks) associated with Big Data. The objective of this article is to present quick details on open-source tools & frameworks in a well-categorized manner using top-down approach where data engineering and data science aspects of Big Data is associated with relevant tools & framework. Most of these tools and frameworks could be found with commercial Hadoop distributions such as Cloudera, Hortonworks, MapR etc. Please feel free to comment/suggest if I missed to mention one or more important frameworks. Also, sorry for the typos.

Following is the key classication of tools/frameworks that have been briefed later in this article:

  • Data Engineering
    • Data collection/aggregation/transfer
    • Data processing
    • Data storage
    • Data access
    • Data coordination
  • Data Science
    • Machine Learning (Data analytics)

 

Key Open-Source Data Engineering Technologies

Following is the list of open-source tools and frameworks which are used to achieve several requirements of data engineering:

  • Data collection/aggregation/transfer
    • Data collection/aggregation from streaming sources
      • Apache Flume: Collection of libraries for collecting, aggregating and moving data from different sources, such as Web Server Logs, into HDFS or HBase in real-time.
    • Data transfer between Hadoop & RDBMS systems
      • Apache Sqoop: A tool used for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Data processing
    • Apache Hadoop: A framework, forming the core of Big Data technologies, is used for the distributed processing of large data sets across clusters of computers. Following are some of the key components:
      • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data
      • Hadoop MapReduce: Parallel Processing of large data sets
      • Hadoop YARN: A framework for job scheduling and cluster resource management.
      • Hadoop Common: Common utilities that support the other Hadoop modules
    • Apache Spark: An in-memory data processing engine which could handle large-scale data processing in an easy manner.
    • Apache Crunch: A framework for writing, testing, and running MapReduce pipelines
    • Apache Oozie: A workflow scheduler system to manage Hadoop jobs.
    • Apache Avro: A data serialization system
  • Data storage
    • HDFS: HDFS, being part of Apache Hadoop framework, is a distributed file system that provides high-throughput access to application data.
    • Apache HBase: A distributed, scalable, data store that could be ideally used for random, realtime read/write access to the Big Data of the size such as billions of rows with millions of columns.
  • Data access
    • Apache Hive: A framework for querying and managing large datasets residing in distributed file storage systems.
    • Pig: A platform for analyzing large data sets that consists of a high-level language for expressing and evaluating data analysis programs.
    • Hue: A web interface for analyzing data with Apache Hadoop. An open-source product developed by Cloudera.
    • Apache DataFu: Libraries for large-scale data processing in Hadoop and Pig. The libraries could be further categorized into following manner:
      • Collection of useful user-defined functions for data analysis in Apache Pig
      • Collection of libraries for incrementally processing data using Hadoop MapReduce.
  • Data coordination
    • Apache Zookeeper: Highly reliable distributed coordination service which is used for maintaining configuration information, naming, providing distributed synchronization, and providing group services to applications such as HBase.
Key Open-Source Data Science Technologies

Following is the list of open-source/free tools which could be used for data analytics:

  • Machine Learning (Data analytics)
    • Apache Mahout: Collection of libraries for classification, clustering and collaborative filtering of Hadoop data.
    • R Project: A free software environment for statistical computing and graphics. Very useful for working with machine learning algorithms.
    • R Studio: A powerful and productive user interface for R.
    • GNU Octave: A high-level interpreted language, primarily intended for numerical computations related with linear and nonlinear problems, and for performing other numerical experiments. Very useful for machine learning modeling.

 

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.

Recent Posts

What is Embodied AI? Explained with Examples

Artificial Intelligence (AI) has evolved significantly, from its early days of symbolic reasoning to the…

4 weeks ago

Retrieval Augmented Generation (RAG) & LLM: Examples

Last updated: 25th Jan, 2025 Have you ever wondered how to seamlessly integrate the vast…

4 months ago

How to Setup MEAN App with LangChain.js

Hey there! As I venture into building agentic MEAN apps with LangChain.js, I wanted to…

4 months ago

Build AI Chatbots for SAAS Using LLMs, RAG, Multi-Agent Frameworks

Software-as-a-Service (SaaS) providers have long relied on traditional chatbot solutions like AWS Lex and Google…

4 months ago

Creating a RAG Application Using LangGraph: Example Code

Retrieval-Augmented Generation (RAG) is an innovative generative AI method that combines retrieval-based search with large…

4 months ago

Building a RAG Application with LangChain: Example Code

The combination of Retrieval-Augmented Generation (RAG) and powerful language models enables the development of sophisticated…

4 months ago