Categories: Big Data

Quick Cheat Sheet for Big Data Technologies

This article represents quick details on some of the key open-source technologies (tools & frameworks) associated with Big Data. The objective of this article is to present quick details on open-source tools & frameworks in a well-categorized manner using top-down approach where data engineering and data science aspects of Big Data is associated with relevant tools & framework. Most of these tools and frameworks could be found with commercial Hadoop distributions such as Cloudera, Hortonworks, MapR etc. Please feel free to comment/suggest if I missed to mention one or more important frameworks. Also, sorry for the typos.

Following is the key classication of tools/frameworks that have been briefed later in this article:

Data Engineering
- Data collection/aggregation/transfer
- Data processing
- Data storage
- Data access
- Data coordination
Data Science
- Machine Learning (Data analytics)

Key Open-Source Data Engineering Technologies

Following is the list of open-source tools and frameworks which are used to achieve several requirements of data engineering:

Data collection/aggregation/transfer
- Data collection/aggregation from streaming sources
  - Apache Flume: Collection of libraries for collecting, aggregating and moving data from different sources, such as Web Server Logs, into HDFS or HBase in real-time.
- Data transfer between Hadoop & RDBMS systems
  - Apache Sqoop: A tool used for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Data processing
- Apache Hadoop: A framework, forming the core of Big Data technologies, is used for the distributed processing of large data sets across clusters of computers. Following are some of the key components:
  - Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data
  - Hadoop MapReduce: Parallel Processing of large data sets
  - Hadoop YARN: A framework for job scheduling and cluster resource management.
  - Hadoop Common: Common utilities that support the other Hadoop modules
- Apache Spark: An in-memory data processing engine which could handle large-scale data processing in an easy manner.
- Apache Crunch: A framework for writing, testing, and running MapReduce pipelines
- Apache Oozie: A workflow scheduler system to manage Hadoop jobs.
- Apache Avro: A data serialization system
Data storage
- HDFS: HDFS, being part of Apache Hadoop framework, is a distributed file system that provides high-throughput access to application data.
- Apache HBase: A distributed, scalable, data store that could be ideally used for random, realtime read/write access to the Big Data of the size such as billions of rows with millions of columns.
Data access
- Apache Hive: A framework for querying and managing large datasets residing in distributed file storage systems.
- Pig: A platform for analyzing large data sets that consists of a high-level language for expressing and evaluating data analysis programs.
- Hue: A web interface for analyzing data with Apache Hadoop. An open-source product developed by Cloudera.
- Apache DataFu: Libraries for large-scale data processing in Hadoop and Pig. The libraries could be further categorized into following manner:
  - Collection of useful user-defined functions for data analysis in Apache Pig
  - Collection of libraries for incrementally processing data using Hadoop MapReduce.
Data coordination
- Apache Zookeeper: Highly reliable distributed coordination service which is used for maintaining configuration information, naming, providing distributed synchronization, and providing group services to applications such as HBase.

Key Open-Source Data Science Technologies

Following is the list of open-source/free tools which could be used for data analytics:

Machine Learning (Data analytics)
- Apache Mahout: Collection of libraries for classification, clustering and collaborative filtering of Hadoop data.
- R Project: A free software environment for statistical computing and graphics. Very useful for working with machine learning algorithms.
- R Studio: A powerful and productive user interface for R.
- GNU Octave: A high-level interpreted language, primarily intended for numerical computations related with linear and nonlinear problems, and for performing other numerical experiments. Very useful for machine learning modeling.

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking. Check out my other blog, Revive-n-Thrive.com