Tag Archives: bigdata

Hadoop Map-Reduce Explained with an Example

This article represents key steps of Hadoop Map-Reduce Jobs using a word count example. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key steps of how Hadoop MapReduce works in a word count problem: Input is fed to a program, say a RecordReader, that reads data line-by-line or record-by-record. Mapping process starts which includes following steps: Combining: Combines the data (word) with its count such as 1 Partitioning: Creates one partition for each word occurence Shuffling: Move words to right partition Sorting: Sort the partition by word Last step is Reducing which comes up with …

Continue reading

Posted in Big Data. Tagged with , , .

Top 7 Should-Have Skills of A Data Scientist

With all the hype around data scientist as one of the most lucrative career option in the recent times, it is but natural that we may get tempted to explore on whether we have in ourselves what it may take to become a successful data scientist. As a matter of fact, I have come across this question very frequently as to what would it take to become a data scientist. Well, this question has been addressed numerous times in many articles. However, I wanted to present a fresh perspective based on the grilling and rigorous journey of Data Science that I went through, in last year or so. Based out …

Continue reading

Posted in Big Data. Tagged with , , .

Machine Learning – How to Debug Learning Algorithm for Regression Model

This article represents some of the key reasons for larger prediction error while working with regression models and, what one could do to solve the prediction error. Below mentioned techniques could be used for both, linear and logistic regression models. As a matter of fact, below arguments could also be used to debug an artificial neural network. In place of features, what is considered is number of hidden layers and units. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Key Reasons for Larger Prediction Error Key Techniques to …

Continue reading

Posted in Big Data. Tagged with , , .

Machine Learning – How to Diagnose Underfitting/Overfitting of Learning Algorithm

This article represents technique that could be used to identify whether the Learning Algorithm is suffering from high bias (under-fitting) or high variance (over-fitting) problem. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key problems related with learning algorithm that are described later in this article: Under-fitting Problem Over-fitting Problem   Diagnose Under-fitting & Over-fitting Problem of Learning Algorithm The challenge is to identify whether the learning algorithm is having one of the following: High bias or under-fitting: At times, our model is represented using polynomial equation of relatively lower degree, although a higher degree of …

Continue reading

Posted in Big Data. Tagged with , , .

Machine Learning – 7 Steps to Train a Neural Network

7 Steps to Train a Neural Network

This article represents some of the key steps required to train a neural network. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Key Steps for Training a Neural Network Following are 7 key steps for training a neural network. Pick a neural network architecture. This implies that you shall be pondering primarily upon the connectivity patterns of the neural network including some of the following aspects: Number of input nodes: The way to identify number of input nodes is identify the number of features. Number of hidden layers: The default is to use the single or one hidden …

Continue reading

Posted in Big Data. Tagged with , , .

Big Data – Free Hadoop Online Training Course from MapR

This article represents quick information on free Hadoop online on-demand training that has been announced yesterday by MapR Technologies, the Hadoop distribution specialist. I took Hadoop Essentials course and I must say that I liked the training session. The downside of these training sessions is that you would soon hit MapR related technologies in relation with MapReduce, HBase, HDFS. However, that said, its worth giving a shot. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.   Training Courses for Hadoop Developer, Hadoop Administrator & Data Analyst The training includes topics related with a range of Hadoop technologies for …

Continue reading

Posted in Big Data, Career Planning. Tagged with , , .

Data Science – Top 5 Videos to Get Started with Neural Networks

This article represents some good youtube videos that I found useful to get started with understanding how brain works and what is neural networks. Note that I needed to do this as I wanted to get started with machine learning and neural network algorithm. In order to do that effectively, I needed to understand what are neural networks and videos below helped me get started within an hour. Please feel free to suggest other great videos which I may have missed. Sorry for the typos.   From Neurons to Networks I would rate it as the one of the best videos I saw on how human brain works. MUST watch!!! …

Continue reading

Posted in Big Data. Tagged with , .

Data Science – Common Exploratory R Commands for Classification Problems

This article represents common exploratory R commands that could used during the stage of data preparation when solving classification problems. I found them being used when I have been going through KNN or naive Bayes algorithms. I know that there may be more to the list below. I would love to hear those additional commands from you. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.   In the set of commands listed below, a data frame, message_text, is used which is a set of text data, loaded using read.table command such as following: messages_text <- read.table( file.choose(), sep=”\t”, …

Continue reading

Posted in Big Data. Tagged with , .

Learn R – When to use Histogram, Scatterplot & Boxplot – Code Example

This article represents some facts on when to use what kind of plots with code example and plots, when working with R programming language. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key plots described later in this article: Histogram Scatterplot Boxplot   Following is the description for above mentioned plots along with code examples based on base R package. Note that each of the these plots could be done using different commands when using ggplot2 package. Histogram:Histograms is one of the best form of visualizations when working with single continuous variable. It plots the relative …

Continue reading

Posted in Big Data. Tagged with , .

Big Data – Top 6 Frameworks Required to Get Started

This article represents top 6 software frameworks (or tools) to get started with Big Data POC projects. This article may be of interest to those who are beginning with Big Data and want to understand about tools/frameworks required to get started with their Big Data POC projects. The article presents only the  bare minimum set of frameworks that are required to get started. I am sure there could be more to this list. However, my objective is to cover only the minimum set. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are key functional areas in Big Data …

Continue reading

Posted in Big Data. Tagged with .

How to Start a Big Data Practice

This article represents key aspects of starting up Big Data practice in your organization. Currently, I have started working in the same area and this blog is the result of my research. Hope you find it useful. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos.   Big Data Center of Excellence (COE) It may be a good idea to plan around setting up a Big Data Center of Excellence (COE)whose main objective would be take a holistic approach towards following two key aspects of Big Data from different perspectives such as setting up team, evaluating tools & frameworks, …

Continue reading

Posted in Big Data. Tagged with .