Following is the key classication of tools/frameworks that have been briefed later in this article:
- Data Engineering
- Data collection/aggregation/transfer
- Data processing
- Data storage
- Data access
- Data coordination
- Data Science
- Machine Learning (Data analytics)
Key Open-Source Data Engineering Technologies
Following is the list of open-source tools and frameworks which are used to achieve several requirements of data engineering:
- Data collection/aggregation/transfer
- Data collection/aggregation from streaming sources
- Apache Flume: Collection of libraries for collecting, aggregating and moving data from different sources, such as Web Server Logs, into HDFS or HBase in real-time.
- Data transfer between Hadoop & RDBMS systems
- Apache Sqoop: A tool used for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
- Data collection/aggregation from streaming sources
- Data processing
- Apache Hadoop: A framework, forming the core of Big Data technologies, is used for the distributed processing of large data sets across clusters of computers. Following are some of the key components:
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data
- Hadoop MapReduce: Parallel Processing of large data sets
- Hadoop YARN: A framework for job scheduling and cluster resource management.
- Hadoop Common: Common utilities that support the other Hadoop modules
- Apache Spark: An in-memory data processing engine which could handle large-scale data processing in an easy manner.
- Apache Crunch: A framework for writing, testing, and running MapReduce pipelines
- Apache Oozie: A workflow scheduler system to manage Hadoop jobs.
- Apache Avro: A data serialization system
- Apache Hadoop: A framework, forming the core of Big Data technologies, is used for the distributed processing of large data sets across clusters of computers. Following are some of the key components:
- Data storage
- HDFS: HDFS, being part of Apache Hadoop framework, is a distributed file system that provides high-throughput access to application data.
- Apache HBase: A distributed, scalable, data store that could be ideally used for random, realtime read/write access to the Big Data of the size such as billions of rows with millions of columns.
- Data access
- Apache Hive: A framework for querying and managing large datasets residing in distributed file storage systems.
- Pig: A platform for analyzing large data sets that consists of a high-level language for expressing and evaluating data analysis programs.
- Hue: A web interface for analyzing data with Apache Hadoop. An open-source product developed by Cloudera.
- Apache DataFu: Libraries for large-scale data processing in Hadoop and Pig. The libraries could be further categorized into following manner:
- Collection of useful user-defined functions for data analysis in Apache Pig
- Collection of libraries for incrementally processing data using Hadoop MapReduce.
- Data coordination
- Apache Zookeeper: Highly reliable distributed coordination service which is used for maintaining configuration information, naming, providing distributed synchronization, and providing group services to applications such as HBase.
Key Open-Source Data Science Technologies
Following is the list of open-source/free tools which could be used for data analytics:
- Machine Learning (Data analytics)
- Apache Mahout: Collection of libraries for classification, clustering and collaborative filtering of Hadoop data.
- R Project: A free software environment for statistical computing and graphics. Very useful for working with machine learning algorithms.
- R Studio: A powerful and productive user interface for R.
- GNU Octave: A high-level interpreted language, primarily intended for numerical computations related with linear and nonlinear problems, and for performing other numerical experiments. Very useful for machine learning modeling.
- Questions to Ask Before Starting Data Analysis - January 19, 2023
- Pearson Correlation Coefficient & Statistical Significance - January 17, 2023
- Overfitting & Underfitting in Machine Learning - January 15, 2023
Leave a Reply