Category Archives: Big Data

42 Free Online Books on Machine Learning & Data Science

Machine Learning Books

This post represents a comprehensive list of 42 free books on machine learning which are available online for self-paced learning.  This would be very helpful for data scientists starting to learn or gain expertise in the field of machine learning / deep learning. Please feel free to comment/suggest if I missed to mention one or more important books that you like and would like to share. Also, sorry for the typos. Following are the key areas under which books are categorized: Pattern Recognition & Machine Learning Probability & Statistics Neural Networks & Deep Learning List of 42 Online Free eBooks on Machine Learning Following is a list of 35 FREE online …

Continue reading

Posted in Big Data, Data Science, Machine Learning. Tagged with , , .

Spark – How does Apache Spark Work?

This blog represents concepts on how does apache spark work with the help of diagrams. Following are some of the key aspects in relation with Apache Spark which is described in this blog: Apache Spark – basic concepts Apache Spark with YARN & HDFS/HBase Apache Spark with Mesos & HDFS/HBase Apache Spark – Basic Concepts The following represents basic concepts in relation with Spark: Apache Spark with YARN & HBase/HDFS Following are some of the key architectural building blocks representing how does Apache Spark work with YARN and HDFS/HBase. Spark driver program runs on client node. YARN is used as cluster manager. As part of YARN setup, there would be multiple nodes running …

Continue reading

Posted in Big Data. Tagged with , , , .

HBase Architecture Components for Beginners

HBase Architectural Building Blocks

This blog represents high-level concepts on HBase architecture components. Following diagram represents the same: HBase Architecture Components – Key Building Blocks Following diagram represents the same: Pay attention to some of the following in relation to above diagram:  HMaster: Responsible for coordinating the region servers including assigning regions on startup as well as recovery, and, monitoring region servers using Zookeeper Region Servers: Manages one or more regions Zookeeper: Zookeeper is used as a distributed coordination service for maintaining the server state of the cluster. Regions: Records in HBase tables are split horizontally based on the key range. Each of these splits can be called as Regions. A region contains all rows in …

Continue reading

Posted in Big Data. Tagged with , .

When a Spark application starts on Spark Standalone Cluster?

This article represents detailed view on what happens when a driver program (spark application) is started on one of the worker node when working with Spark standalone cluster. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Snapshot into what happens when Spark Standalone Cluster Starts? Snapshot into what happens when a spark application (Spark Shell) starts on one of the worker nodes? Snapshot into what happens when a spark application (Spark Shell) stops on the worker node? Snapshot into what happens when Spark Standalone Cluster Starts? In our …

Continue reading

Posted in Big Data, Dockers. Tagged with , .

Hello World with Apache Spark Standalone Cluster on Docker

This article presents instructions and code samples for Docker enthusiasts to quickly get started with setting up Apache Spark standalone cluster with Docker containers. Thanks to the owner of this page for putting up the source code which has been used in this article. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Basic concepts on Apache Spark Cluster Steps to setup the Apache spark standalone cluster Code sample for Setting up Spark Code sample for Docker-compose to start the cluster Code sample for starting the Driver program using Spark …

Continue reading

Posted in Big Data. Tagged with , .

Dockers – How to Get Started with Spark on Windows

This article represents tips on how to get started with Apache Spark on Windows using Dockers. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. If you are familiar with Dockers, the instructions below would help you get started with Spark in no time. Download the Spark from https://spark.apache.org/downloads.html page. Remember to select a package type with option such as “Pre-built…”. Once the zipped files are downloaded, unzip the files under the location “C:\Users\<Username>” Build Java8 image and start the container. Follow the instructions on this page, http://vitalflux.com/dockers-how-to-get-started-with-java8-dev-environment/. Once the container is started, go to the folder where you …

Continue reading

Posted in Big Data, Dockers. Tagged with , .

9 Linux Foundation Projects for IOT, Cloud, Big Data

This article represents top Linux foundation projects in relation with IOT, Cloud and Big Data. With the convergence of these three technology domains, it becomes of utmost important to keep a track of news/announcements happening in these areas. The reference of all the projects could be found on this page. Following are the key linux foundation projects in relation with IOT, Cloud and Big Data. IOT (Internet of Things) AllSeen Alliance: A cross-industry consortium dedicated to enabling interoperability of billions of devices, services and apps that comprise the internet of things (IOT). Bookmark announcements and news for latest information. IoTivity: An open-source software framework enabling seamless device-to-device connectivity to address …

Continue reading

Posted in Big Data, Cloud, IOT. Tagged with , , .

Top 5 Pages listing Big Data Conferences in 2016

This article represents top 5 pages listing global big data conferences coming up in 2016. Please feel free to comment/suggest if I missed to mention any other important pages. Also, sorry for the typos. Following are the top 5 pages: Global Big Data Conference KDNuggets List of Meetings/Conferences on Analytics, Big Data, Data Mining, Data Science Important Big Data events coming up in 2016 Big Data conference directory listing big data conferences happening around the world. O’Reilly List of conferences of on various topics including Big Data

Posted in Big Data. Tagged with .

Dockers – How to Get Started with Cloudera

This article represents information and code/scripts which could be used to get started with Cloudera using Dockers. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Docker machine configuration Cloudera & Dockers Test the Cloudera installation Scripts to install & run Cloudera Docker Machine Configuration To run the cloudera in docker container, one would require to do following configuration to the Docker machine. Open Oracle VM Virtualbox Manager. Stop the default machine. Then, change the settings as shown below. Change the processor (core) setting to 2 Change the memory …

Continue reading

Posted in Big Data, DevOps, Dockers.

Hadoop Map-Reduce Explained with an Example

This article represents key steps of Hadoop Map-Reduce Jobs using a word count example. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key steps of how Hadoop MapReduce works in a word count problem: Input is fed to a program, say a RecordReader, that reads data line-by-line or record-by-record. Mapping process starts which includes following steps: Combining: Combines the data (word) with its count such as 1 Partitioning: Creates one partition for each word occurence Shuffling: Move words to right partition Sorting: Sort the partition by word Last step is Reducing which comes up with …

Continue reading

Posted in Big Data. Tagged with , , .

Big Data – How Data is Retrieved and Written from/to HDFS?

This blog represents my notes on how data is read and written from/to HDFS. Please feel free to suggest if it is done otherwise. Following are steps using which clients retrieve data from HDFS: Clients ask Namenode for a file/data block Name-node returns data node information (ID) where the file/data blocks are located Client retrieves data directly from the data node. Following are steps in which data is written to HDFS: Clients ask Name-node that they want to write one or more data blocks pertaining to a file. Name-node returns data nodes information to which these data blocks needs to be written Clients write each data block to the data nodes suggested. The …

Continue reading

Posted in Big Data. Tagged with , , .

Hadoop Map-Reduce Described With Example

I came across a great page describing Hadoop map-reduce and HDFS architecture. The page presents some of the following details: HDFS responsibilities and execution flows Key characteristics of Map-Reduce lifecycle A sample example related with web crawler and Hadoop Map-reduce setup

Posted in Big Data.

Learn R – How to Get Data Frames Columns as Vectors

This article represents different ways in which one could get a data frame column as a vector. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. 4 Techniques to Get Data Frame Column as Vector In the examples below, diamonds dataset from ggplot2 package is considered. This is how a diamond dataset looks like: Following are four different technique/method using which one could retrieve a data frame column as a vector. # In the data set shown above, carat represents column name and hence, [[‘carat’]] carat1 <- diamonds[[‘carat’]] # In the data set shown above, carat represents 1st column …

Continue reading

Posted in Big Data. Tagged with .

Top 8 Data Science Training Institutes in India

This article lists down top 8 data science/analytics training institutes from India. Some of them including INSOFE just provide classroom coaching while others such as Edureka provide online training. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following is the list of training institues which are detailed later in this article: INSOFE Jigsaw Academy UReach Solutions AnalytixLabs Edureka SpringPeople SimpliLearn EduPristine   INSOFE International School of Engineering was launched in 2011 with an aim to transform the applied engineering education space in India. Their current focus area is Big Data Analytics / Data Science. Out of all of …

Continue reading

Posted in Big Data, Career Planning. Tagged with .

Top 5 Usecases of Solr to Power Your Web & Mobile Search

This article represents top 5 usecases for using Solr to power your web and mobile search. Note that in case of mobile search requirements, Solr exposes APIs that could be used to retrieve data from Solr index server and serve to mobile client. It also presents a classification of websites which are using Solr to fulfill their search requirements. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Top 5 Usecases for Solr Search Different Classes of Websites using Solr to Power Search Engines Top 5 Usecases for Solr Search Search Engine: Many …

Continue reading

Posted in Big Data. Tagged with .

Dummies Notes on How Distributed Computing Works using Hadoop

distributed computing using hadoop

This article intends to present dummies notes on how distributed computing works using Hadoop. As Hadoop is inspired by Google GFS/Map-Reduce/BigTable paper,I have tried and refer to GFS/Map-Reduce/BigTable in this article appropriately wherever possible. One must note that distributed computing paradigm has become mainstream given the advent of Big Data related large scale project implementation going on in several companies. Please feel free to shout if you find discrepancies with my understanding and help me correct the mistakes. Simply speaking, distributed computing refers to the computing paradigm in which processing happens on multiple different boxes consisting of data and, the result is, then, aggregated appropriately to display the final result. In traditional …

Continue reading

Posted in Big Data, Dummies. Tagged with .