Categories: Big Data

Quick Cheat Sheet for Big Data Technologies

This article represents quick details on some of the key open-source technologies (tools & frameworks) associated with Big Data. The objective of this article is to present quick details on open-source tools & frameworks in a well-categorized manner using top-down approach where data engineering and data science aspects of Big Data is associated with relevant tools & framework. Most of these tools and frameworks could be found with commercial Hadoop distributions such as Cloudera, Hortonworks, MapR etc. Please feel free to comment/suggest if I missed to mention one or more important frameworks. Also, sorry for the typos.

Following is the key classication of tools/frameworks that have been briefed later in this article:

Data Engineering
- Data collection/aggregation/transfer
- Data processing
- Data storage
- Data access
- Data coordination
Data Science
- Machine Learning (Data analytics)

Key Open-Source Data Engineering Technologies

Following is the list of open-source tools and frameworks which are used to achieve several requirements of data engineering:

Data collection/aggregation/transfer
- Data collection/aggregation from streaming sources
  - Apache Flume: Collection of libraries for collecting, aggregating and moving data from different sources, such as Web Server Logs, into HDFS or HBase in real-time.
- Data transfer between Hadoop & RDBMS systems
  - Apache Sqoop: A tool used for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
Data processing
- Apache Hadoop: A framework, forming the core of Big Data technologies, is used for the distributed processing of large data sets across clusters of computers. Following are some of the key components:
  - Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data
  - Hadoop MapReduce: Parallel Processing of large data sets
  - Hadoop YARN: A framework for job scheduling and cluster resource management.
  - Hadoop Common: Common utilities that support the other Hadoop modules
- Apache Spark: An in-memory data processing engine which could handle large-scale data processing in an easy manner.
- Apache Crunch: A framework for writing, testing, and running MapReduce pipelines
- Apache Oozie: A workflow scheduler system to manage Hadoop jobs.
- Apache Avro: A data serialization system
Data storage
- HDFS: HDFS, being part of Apache Hadoop framework, is a distributed file system that provides high-throughput access to application data.
- Apache HBase: A distributed, scalable, data store that could be ideally used for random, realtime read/write access to the Big Data of the size such as billions of rows with millions of columns.
Data access
- Apache Hive: A framework for querying and managing large datasets residing in distributed file storage systems.
- Pig: A platform for analyzing large data sets that consists of a high-level language for expressing and evaluating data analysis programs.
- Hue: A web interface for analyzing data with Apache Hadoop. An open-source product developed by Cloudera.
- Apache DataFu: Libraries for large-scale data processing in Hadoop and Pig. The libraries could be further categorized into following manner:
  - Collection of useful user-defined functions for data analysis in Apache Pig
  - Collection of libraries for incrementally processing data using Hadoop MapReduce.
Data coordination
- Apache Zookeeper: Highly reliable distributed coordination service which is used for maintaining configuration information, naming, providing distributed synchronization, and providing group services to applications such as HBase.

Key Open-Source Data Science Technologies

Following is the list of open-source/free tools which could be used for data analytics:

Machine Learning (Data analytics)
- Apache Mahout: Collection of libraries for classification, clustering and collaborative filtering of Hadoop data.
- R Project: A free software environment for statistical computing and graphics. Very useful for working with machine learning algorithms.
- R Studio: A powerful and productive user interface for R.
- GNU Octave: A high-level interpreted language, primarily intended for numerical computations related with linear and nonlinear problems, and for performing other numerical experiments. Very useful for machine learning modeling.

Author
Recent Posts

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin.
Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.