Category Archives: Big Data

Dockers – How to Get Started with Cloudera

This article represents information and code/scripts which could be used to get started with Cloudera using Dockers. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Docker machine configuration Cloudera & Dockers Test the Cloudera installation Scripts to install & run Cloudera Docker Machine Configuration To run the cloudera in docker container, one would require to do following configuration to the Docker machine. Open Oracle VM Virtualbox Manager. Stop the default machine. Then, change the settings as shown below. Change the processor (core) setting to 2 Change the memory …

Continue reading

Posted in Big Data, DevOps, Dockers.

Hadoop Map-Reduce Explained with an Example

This article represents key steps of Hadoop Map-Reduce Jobs using a word count example. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key steps of how Hadoop MapReduce works in a word count problem: Input is fed to a program, say a RecordReader, that reads data line-by-line or record-by-record. Mapping process starts which includes following steps: Combining: Combines the data (word) with its count such as 1 Partitioning: Creates one partition for each word occurence Shuffling: Move words to right partition Sorting: Sort the partition by word Last step is Reducing which comes up with …

Continue reading

Posted in Big Data. Tagged with , , .

Big Data – How Data is Retrieved and Written from/to HDFS?

This blog represents my notes on how data is read and written from/to HDFS. Please feel free to suggest if it is done otherwise. Following are steps using which clients retrieve data from HDFS: Clients ask Namenode for a file/data block Name-node returns data node information (ID) where the file/data blocks are located Client retrieves data directly from the data node. Following are steps in which data is written to HDFS: Clients ask Name-node that they want to write one or more data blocks pertaining to a file. Name-node returns data nodes information to which these data blocks needs to be written Clients write each data block to the data nodes suggested. The …

Continue reading

Posted in Big Data. Tagged with , , .

Hadoop Map-Reduce Described With Example

I came across a great page describing Hadoop map-reduce and HDFS architecture. The page presents some of the following details: HDFS responsibilities and execution flows Key characteristics of Map-Reduce lifecycle A sample example related with web crawler and Hadoop Map-reduce setup

Posted in Big Data.

Learn R – How to Get Data Frames Columns as Vectors

This article represents different ways in which one could get a data frame column as a vector. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. 4 Techniques to Get Data Frame Column as Vector In the examples below, diamonds dataset from ggplot2 package is considered. This is how a diamond dataset looks like: Following are four different technique/method using which one could retrieve a data frame column as a vector. # In the data set shown above, carat represents column name and hence, [[‘carat’]] carat1 <- diamonds[[‘carat’]] # In the data set shown above, carat represents 1st column …

Continue reading

Posted in Big Data. Tagged with .

Top 8 Data Science Training Institutes in India

Data analytics training

This article lists down top 8 data science/analytics training institutes from India. Some of them including INSOFE just provide classroom coaching while others such as Edureka provide online training. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following is the list of training institues which are detailed later in this article: INSOFE Jigsaw Academy UReach Solutions AnalytixLabs Edureka SpringPeople SimpliLearn EduPristine   INSOFE International School of Engineering was launched in 2011 with an aim to transform the applied engineering education space in India. Their current focus area is Big Data Analytics / Data Science. Out of all of …

Continue reading

Posted in Big Data, Career Planning. Tagged with .

Top 5 Usecases of Solr to Power Your Web & Mobile Search

This article represents top 5 usecases for using Solr to power your web and mobile search. Note that in case of mobile search requirements, Solr exposes APIs that could be used to retrieve data from Solr index server and serve to mobile client. It also presents a classification of websites which are using Solr to fulfill their search requirements. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Top 5 Usecases for Solr Search Different Classes of Websites using Solr to Power Search Engines Top 5 Usecases for Solr Search Search Engine: Many …

Continue reading

Posted in Big Data. Tagged with .

Dummies Notes on How Distributed Computing Works using Hadoop

distributed computing using hadoop

This article intends to present dummies notes on how distributed computing works using Hadoop. As Hadoop is inspired by Google GFS/Map-Reduce/BigTable paper,I have tried and refer to GFS/Map-Reduce/BigTable in this article appropriately wherever possible. One must note that distributed computing paradigm has become mainstream given the advent of Big Data related large scale project implementation going on in several companies. Please feel free to shout if you find discrepancies with my understanding and help me correct the mistakes. Simply speaking, distributed computing refers to the computing paradigm in which processing happens on multiple different boxes consisting of data and, the result is, then, aggregated appropriately to display the final result. In traditional …

Continue reading

Posted in Big Data, Dummies. Tagged with .

60 Most Commonly Used R Packages in R Programming Language

This article represents a comprehensive list of 60 most commonly used R packages which helps to achieve some of the following objectives when working with data science/analytics projects: Predictive modeling Data handling/manipulation Visualization Integration Hadoop GUI Database   60 Most Commonly Used R Packages Following is the list of 60 or so R packages which help take care of different aspects when working to create predictive models: Predictive Modeling: Represents packages which help in working with various different predictive models (linear/multivariate/logistic regression models, SVM, neural network etc.) caret: Stands for Classification And REgression Training. Provides a set of functions which could be used to do some of the following when …

Continue reading

Posted in Big Data. Tagged with , .

Data Science – Who could become a Data Scientist?

This article represents information related different classes of IT & Non-IT professionals who could take on different data science free courses (as mentioned) and get on to the path of becoming a data scientist. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the different classifications of IT/Non-IT professional which has been addressed later in this article: Software Development Stakeholders working on Non-analytics projects Datawarehouse/BI Developers Big Data Developers Statisticians Senior Management Executive Non-Software Professionals Could I become a Data Scientist? Anyone matching following criteria could become a data scientist. One is decent with Mathematics & Statistics …

Continue reading

Posted in Big Data. Tagged with , .

Top 10 Solution Approaches for Supervised Learning Problems

This article represents top 10 solutions approaches that could be used to solve supervised learning problems. For those unaware of what is supervised learning problem, here is the supervised learning definition from Wikipedia: Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Following are two different kind of supervised …

Continue reading

Posted in Big Data. Tagged with , .

Document Search Architecture to Search Millions of Documents

This article represents different document search architectural models using which one could create a search architecture that could search through 100s of millions of documents in faster time (milliseconds) with most up-to-date and fresh results. If you are planning to create a document search infrastructure which could search millions of documents, and shows up results in less than a second time, go ahead and explore different models and adopt the one that suits your needs at this stage. Note that the models given below could scale to multiple data centers. In this blog, we shall try and examine different architecture models that could achieve the search timing of less than a …

Continue reading

Posted in Big Data. Tagged with , .

Learn R or Python for Becoming Data Scientist?

This article presents analysis on whether one should go for learning R or Python programming language to create one or more predictive models using different machine learning algorithms. It could be noted that both languages, R and Python, is equally doing good and sought after by developers and the companies hiring such developers. So, you could choose either one of these languages. However, majority has been found to be voted in favour of Python for ease of learning and greater community support.   Data Scientist with expertise in R Following indeed.com plot represents the job trends for the search term, “Data Scientist R”. It clearly indicates the trend such as …

Continue reading

Posted in Big Data. Tagged with , .

Machine Learning – Top 16 Learning Resources on Statistics

This article represents some of the top learning resources (webpages, videos etc) on my frequent visit list. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key categories of webpages/videos that are expanded later in this article: Websites Quora Youtube Videos Coursera courses Khan Academy Top 16 Learning Resources on Statistics Folllowing is the list of URLs for these learning resources: Websites on Statistics Stattrek.com Elementary Statistics with R StatsDirect.com Usable Stats Quora.com Statistics Channel Probability & Statistics Statistics (Acacedmic Discipline) Bayesian Inference Youtube Videos Playlists on Statistics Brandon Foltz StatisticsFun JBStatistics Quantitative Specialists Coursera Courses …

Continue reading

Posted in Big Data. Tagged with , .

Machine Learning Research in Top 10 US Universities

This article represents information related with machine learning departments & related research projects in top 10 US universities (as per USNews Ranking). I have put it together for my quick reference and thought to share with you for the same purpose. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are top 10 universities covered later in this article: Princeton University Harvard University Yale University Columbia University Stanford University University of Chicago MIT Duke University University of Pennsylvania California Institue of Technology   Machine Learning @ Top 10 US Universities Princeton University: Machine Learning Department at Princeton University …

Continue reading

Posted in Big Data. Tagged with .

Data Science – 175 Probability & Statistics Interview Questions

data science interviews

This article presents URL and short description of around 175 probability & statistics objective questions which could prove very useful and helpful for those who are planning to attend one or more data scientist interviews in time to come. These tests/quizzes were created when I was learning probability and statistics some time back and, found various concepts interesting enough to be converted into quizzes for my future references. As probability & statistics form key to data science, it may be worth spending some time on these tests and check your understanding. You may also use this for your future reference. These questions could also be used for checking your concepts …

Continue reading

Posted in Big Data, Career Planning, Interview questions. Tagged with , , , .