Quick Cheat Sheet for Big Data Technologies

This article represents quick details on some of the key open-source technologies (tools & frameworks) associated with Big Data. The objective of this article is to present quick details on open-source tools & frameworks in a well-categorized manner using top-down approach where data engineering and data science aspects of Big Data is associated with relevant tools & framework. Most of these tools and frameworks could be found with commercial Hadoop distributions such as Cloudera, Hortonworks, MapR etc. Please feel free to comment/suggest if I missed to mention one or more important frameworks. Also, sorry for the typos.

Following is the key classication of tools/frameworks that have been briefed later in this article:

  • Data Engineering
    • Data collection/aggregation/transfer
    • Data processing
    • Data storage
    • Data access
    • Data coordination
  • Data Science
    • Machine Learning (Data analytics)

 

Key Open-Source Data Engineering Technologies

Following is the list of open-source tools and frameworks which are used to achieve several requirements of data engineering:

  • Data collection/aggregation/transfer
    • Data collection/aggregation from streaming sources
      • Apache Flume: Collection of libraries for collecting, aggregating and moving data from different sources, such as Web Server Logs, into HDFS or HBase in real-time.
    • Data transfer between Hadoop & RDBMS systems
      • Apache Sqoop: A tool used for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
  • Data processing
    • Apache Hadoop: A framework, forming the core of Big Data technologies, is used for the distributed processing of large data sets across clusters of computers. Following are some of the key components:
      • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data
      • Hadoop MapReduce: Parallel Processing of large data sets
      • Hadoop YARN: A framework for job scheduling and cluster resource management.
      • Hadoop Common: Common utilities that support the other Hadoop modules
    • Apache Spark: An in-memory data processing engine which could handle large-scale data processing in an easy manner.
    • Apache Crunch: A framework for writing, testing, and running MapReduce pipelines
    • Apache Oozie: A workflow scheduler system to manage Hadoop jobs.
    • Apache Avro: A data serialization system
  • Data storage
    • HDFS: HDFS, being part of Apache Hadoop framework, is a distributed file system that provides high-throughput access to application data.
    • Apache HBase: A distributed, scalable, data store that could be ideally used for random, realtime read/write access to the Big Data of the size such as billions of rows with millions of columns.
  • Data access
    • Apache Hive: A framework for querying and managing large datasets residing in distributed file storage systems.
    • Pig: A platform for analyzing large data sets that consists of a high-level language for expressing and evaluating data analysis programs.
    • Hue: A web interface for analyzing data with Apache Hadoop. An open-source product developed by Cloudera.
    • Apache DataFu: Libraries for large-scale data processing in Hadoop and Pig. The libraries could be further categorized into following manner:
      • Collection of useful user-defined functions for data analysis in Apache Pig
      • Collection of libraries for incrementally processing data using Hadoop MapReduce.
  • Data coordination
    • Apache Zookeeper: Highly reliable distributed coordination service which is used for maintaining configuration information, naming, providing distributed synchronization, and providing group services to applications such as HBase.
Key Open-Source Data Science Technologies

Following is the list of open-source/free tools which could be used for data analytics:

  • Machine Learning (Data analytics)
    • Apache Mahout: Collection of libraries for classification, clustering and collaborative filtering of Hadoop data.
    • R Project: A free software environment for statistical computing and graphics. Very useful for working with machine learning algorithms.
    • R Studio: A powerful and productive user interface for R.
    • GNU Octave: A high-level interpreted language, primarily intended for numerical computations related with linear and nonlinear problems, and for performing other numerical experiments. Very useful for machine learning modeling.

 

Ajitesh Kumar

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.
Posted in Big Data. Tagged with .