# Category Archives: Big Data

## Data Science – Key Algebra Topics to Master

This article represents some of the key topics in Algebra that one may need to brush up or master in order to get good at understanding different aspects of machine learning algorithms. If you are gearing up to become the data scientist, the topics below may be worth your attention as I had to brush them up eventually when I was learning different machine learning algorithms. The concepts listed below, especially related with linear algebra, touches almost all machine learning algorithms. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key high level topics which are …

## Data Science – Key Probability & Statistics Topics to Master

This article represents a list of key probability & statistics topics that one may need to master if he is aiming to become a data scientist. This article lists topics that has worked for me so far in relation with working on a data science problem. One could also see the below list as table of content for key probability and statistics topics for data science. However, I do believe that there are some topics that I might not have mentioned. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Probability & Statistics Topics Following are some of the …

## Learn R – How to Get Random Training and Test Data Set

This article represents sample source code which could be used to extract random training and test data set from a data frame using R programming language. The R code below could prove very handy while you are working to create a model using any machine learning algorithm. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. # Read the data from a file; The command below assumes that the working # directory has already been set. One could set working directory using # setwd() command. sample_df <- read.csv(“glass.data”, header=TRUE, stringsAsFactors=FALSE) # get a vector comprising of all indices …

## Machine Learning – Bookmarks for Great Tutorials, Books & Videos

This article represents quick bookmarks on some good machine learning web pages including tutorials’ documents and videos. Please feel free to comment/suggest if you know of further good bookmarks. I shall be adding more bookmarks in time to come. Also, sorry for the typos. Following are the key bookmarks: List of Tutorial Pages on Different Machine Learning Topics: You shall surely want to bookmark this page as it consists of some real cool links covering different topics in machine learning. List of Machine Learning Books: Those looking out for machine learning books to get started would want to bookmark this page which consists of list of some great books recommended …

## Machine Learning – When to Use Logistic Regression vs. SVM

This article represents guidelines based on which one could determine whether to use Logistic regression or SVM with Kernels when working on a classification problem. These are guidelines which I gathered from one of the Andrew NG videos on SVM from his machine learning course in Coursera.org. As I wanted a place to reach out quickly in future when I am working on classification problem and, want to refer which algorithm to use out of Logistic regression or SVM, I decided to blog it here. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Key Criteria for Using Logistic Regression vs …

## Machine Learning – When to Use Linear vs Guassian Kernel with SVM

This article represents guidelines which could be used to decide whether to use Linear kernel or Gaussian kernel when working with Support Vector Machine (SVM). Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: When to Use Linear Kernel When to Use Gaussian Kernel When to Use Linear Kernel In case there are large number of features and comparatively smaller number of training examples, one would want to use linear kernel. As a matter of fact, it can also be called as SVM with No Kernel. One may …

## Top 7 Should-Have Skills of A Data Scientist

With all the hype around data scientist as one of the most lucrative career option in the recent times, it is but natural that we may get tempted to explore on whether we have in ourselves what it may take to become a successful data scientist. As a matter of fact, I have come across this question very frequently as to what would it take to become a data scientist. Well, this question has been addressed numerous times in many articles. However, I wanted to present a fresh perspective based on the grilling and rigorous journey of Data Science that I went through, in last year or so. Based out …

## 8 Key Steps to Follow When Solving A Machine Learning Problem

This article represents some of the key steps one could take in order to create most effective model to solve a given machine learning problem, using different machine learning algorithms. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. 8 Key Steps for Solving A Machine Learning Problem Gather the data set: This is one of the most important step where the objective is to as much large volume of data set as possible. Given that features have been selected appropriately, large data set helps to minimize the training data set error and also, enable cross-validation and training data set error …

## Machine Learning – How to Debug Learning Algorithm for Regression Model

This article represents some of the key reasons for larger prediction error while working with regression models and, what one could do to solve the prediction error. Below mentioned techniques could be used for both, linear and logistic regression models. As a matter of fact, below arguments could also be used to debug an artificial neural network. In place of features, what is considered is number of hidden layers and units. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Key Reasons for Larger Prediction Error Key Techniques to …

## Machine Learning – How to Diagnose Underfitting/Overfitting of Learning Algorithm

This article represents technique that could be used to identify whether the Learning Algorithm is suffering from high bias (under-fitting) or high variance (over-fitting) problem. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key problems related with learning algorithm that are described later in this article: Under-fitting Problem Over-fitting Problem Diagnose Under-fitting & Over-fitting Problem of Learning Algorithm The challenge is to identify whether the learning algorithm is having one of the following: High bias or under-fitting: At times, our model is represented using polynomial equation of relatively lower degree, although a higher degree of …

## Machine Learning – 7 Steps to Train a Neural Network

This article represents some of the key steps required to train a neural network. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Key Steps for Training a Neural Network Following are 7 key steps for training a neural network. Pick a neural network architecture. This implies that you shall be pondering primarily upon the connectivity patterns of the neural network including some of the following aspects: Number of input nodes: The way to identify number of input nodes is identify the number of features. Number of hidden layers: The default is to use the single or one hidden …

## Data Science – 8 Steps to Multiple Regression Analysis

This article represents a list of steps and related details that one would want to follow when doing multiple regression analysis. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: 8 Steps to Multiple Regression Analysis Techniques used in Multiple regression analysis 8 Steps to Multiple Regression Analysis Following is a list of 7 steps that could be used to perform multiple regression analysis Identify a list of potential variables/features; Both independent (predictor) and dependent (response) Gather data on the variables Check the relationship between each predictor variable …

## Big Data – Top Education Resources from MIT

This article represents information on Big Data initiative from MIT (Massachusetts Institute of Technology) including bookmarks on lecture notes related machine learning courses and also, machine learning video channel from MIT on Youtube. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: MIT CSAIL Big Data Initiative Machine Learning Lecture Notes & Videos MIT CSAIL Big Data Initiative MIT has a website dedicated to Big Data initiative from MIT CSAIL (Computer Science and Artificial Intelligence Laboratory). Following pages are worth visits to understand ongoing research and listen/view talks …

## Weekly Roundup – Machine Learning & Statistics Bookmarks – 02 Feb 2015

This article represents links to some of cool pages on machine learning & statistics that I thought worth sharing. Please feel free to comment/suggest any other webpages that found to be good. Sorry for the typos. Machine Learning & Statistics Bookmarks Andrew NG: One starting to learn machine learning is sure to come across course, paper, or a web page related with Andrew NG, an Associate Professor at Stanford; Chief Scientist of Baidu; and Chairman and Co-Founder of Coursera. Some of the pages sighting his work are following: Courses Publications Research Andrew W. Moore: Great set of tutorials by Andrew D. More, who is Dean of the School of Computer …

## Machine Learning – 9 Most Common Usecases for Higher Business Growth

This article represents some of the most common use cases of machine learning algorithms which has been found to impact business growth (in terms of revenues) in a positive manner. These usecases could be most commonly seen with all businesses which are running some or the other form of ecommerce site to support one or more aspects of their business. I have tried and provide information regarding which algorithm (or class of algorithm) could be used to come up with a solution for these usecases. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are different areas, at …

## Top 4 Machine Learning Usecases for Energy Forecasting

This article represents top 4 machine learning usecases for energy forecasting. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Machine Learning Usecases for Energy Forecasting Following are different usecases in relation with energy management where machine learning could be used for probabilistic energy forecasting. For those who are new to probabilistic forecasting, here is the definition from Wikipedia: Probabilistic forecasting summarises what is known, or opinions about, future events. In contrast to a single-valued forecasts (such as forecasting that the maximum temperature at given site on a given day will be 23 degrees Celsius or that the result …