Categories: Big Data

Quick Cheat Sheet for Big Data Technologies

This article represents quick details on some of the key open-source technologies (tools & frameworks) associated with Big Data. The objective of this article is to present quick details on open-source tools & frameworks in a well-categorized manner using top-down approach where data engineering and data science aspects of Big Data is associated with relevant tools & framework. Most of these tools and frameworks could be found with commercial Hadoop distributions such as Cloudera, Hortonworks, MapR etc. Please feel free to comment/suggest if I missed to mention one or more important frameworks. Also, sorry for the typos.

Following is the key classication of tools/frameworks that have been briefed later in this article:

  • Data Engineering
    • Data collection/aggregation/transfer
    • Data processing
    • Data storage
    • Data access
    • Data coordination
  • Data Science
    • Machine Learning (Data analytics)

 

Key Open-Source Data Engineering Technologies

Following is the list of open-source tools and frameworks which are used to achieve several requirements of data engineering:

  • Data collection/aggregation/transfer
    • Data collection/aggregation from streaming sources
      • Apache Flume: Collection of libraries for collecting, aggregating and moving data from different sources, such as Web Server Logs, into HDFS or HBase in real-time.
    • Data transfer between Hadoop & RDBMS systems
      • Apache Sqoop: A tool used for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Data processing
    • Apache Hadoop: A framework, forming the core of Big Data technologies, is used for the distributed processing of large data sets across clusters of computers. Following are some of the key components:
      • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data
      • Hadoop MapReduce: Parallel Processing of large data sets
      • Hadoop YARN: A framework for job scheduling and cluster resource management.
      • Hadoop Common: Common utilities that support the other Hadoop modules
    • Apache Spark: An in-memory data processing engine which could handle large-scale data processing in an easy manner.
    • Apache Crunch: A framework for writing, testing, and running MapReduce pipelines
    • Apache Oozie: A workflow scheduler system to manage Hadoop jobs.
    • Apache Avro: A data serialization system
  • Data storage
    • HDFS: HDFS, being part of Apache Hadoop framework, is a distributed file system that provides high-throughput access to application data.
    • Apache HBase: A distributed, scalable, data store that could be ideally used for random, realtime read/write access to the Big Data of the size such as billions of rows with millions of columns.
  • Data access
    • Apache Hive: A framework for querying and managing large datasets residing in distributed file storage systems.
    • Pig: A platform for analyzing large data sets that consists of a high-level language for expressing and evaluating data analysis programs.
    • Hue: A web interface for analyzing data with Apache Hadoop. An open-source product developed by Cloudera.
    • Apache DataFu: Libraries for large-scale data processing in Hadoop and Pig. The libraries could be further categorized into following manner:
      • Collection of useful user-defined functions for data analysis in Apache Pig
      • Collection of libraries for incrementally processing data using Hadoop MapReduce.
  • Data coordination
    • Apache Zookeeper: Highly reliable distributed coordination service which is used for maintaining configuration information, naming, providing distributed synchronization, and providing group services to applications such as HBase.
Key Open-Source Data Science Technologies

Following is the list of open-source/free tools which could be used for data analytics:

  • Machine Learning (Data analytics)
    • Apache Mahout: Collection of libraries for classification, clustering and collaborative filtering of Hadoop data.
    • R Project: A free software environment for statistical computing and graphics. Very useful for working with machine learning algorithms.
    • R Studio: A powerful and productive user interface for R.
    • GNU Octave: A high-level interpreted language, primarily intended for numerical computations related with linear and nonlinear problems, and for performing other numerical experiments. Very useful for machine learning modeling.

 

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com

Recent Posts

Mean Squared Error vs Cross Entropy Loss Function

Last updated: 28th April, 2024 As a data scientist, understanding the nuances of various cost…

3 days ago

Cross Entropy Loss Explained with Python Examples

Last updated: 28th April, 2024 In this post, you will learn the concepts related to…

3 days ago

Logistic Regression in Machine Learning: Python Example

Last updated: 26th April, 2024 In this blog post, we will discuss the logistic regression…

5 days ago

MSE vs RMSE vs MAE vs MAPE vs R-Squared: When to Use?

Last updated: 22nd April, 2024 As data scientists, we navigate a sea of metrics to…

6 days ago

Gradient Descent in Machine Learning: Python Examples

Last updated: 22nd April, 2024 This post will teach you about the gradient descent algorithm…

1 week ago

Loss Function vs Cost Function vs Objective Function: Examples

Last updated: 19th April, 2024 Among the terminologies used in training machine learning models, the…

2 weeks ago