Learn R – 5 Techniques to Create Empty Data Frames with Column Names
This article represents techniques on how one could create an empty data frame with column names. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. 5 Techniques to Create Empty Data Frames In each of the examples below, the data frame is created with three columns, namely, ‘name’, ‘rating’, ‘relyear’. It represents moview names, ratings, and the release year. # Command data.frame is used df1 <- data.frame(name=””, rating=””, relyear=””, stringsAsFactors=FALSE) # Command data.frame is used df2 <- data.frame(name=character(), rating=character(), relyear=character(), stringsAsFactors=FALSE) # Usage of read.table command to create empty data frame df3 <- read.table(text = “”, colClasses = c(“character”, …
Learn R – How to Get Data Frames Columns as Vectors
This article represents different ways in which one could get a data frame column as a vector. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. 4 Techniques to Get Data Frame Column as Vector In the examples below, diamonds dataset from ggplot2 package is considered. This is how a diamond dataset looks like: Following are four different technique/method using which one could retrieve a data frame column as a vector. # In the data set shown above, carat represents column name and hence, [[‘carat’]] carat1 <- diamonds[[‘carat’]] # In the data set shown above, carat represents 1st column …
Dummies Notes – How SAML-based SSO Authentication Works?
This article represents dummies notes on how could one go for SSO implementation using SAML. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: What is SAML? How does SSO authentication happen using SAML? What are Key Components of SSO Design, in general? What is SAML? For those of you unaware of what is SAML, here is the definition from WIKIPedia page on SAML: Security Assertion Markup Language (SAML, pronounced sam-el[1]) is an XML-based, open-standard data format for exchanging authentication and authorization data between parties, in particular, between …
Top 8 Data Science Training Institutes in India
This article lists down top 8 data science/analytics training institutes from India. Some of them including INSOFE just provide classroom coaching while others such as Edureka provide online training. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following is the list of training institues which are detailed later in this article: INSOFE Jigsaw Academy UReach Solutions AnalytixLabs Edureka SpringPeople SimpliLearn EduPristine INSOFE International School of Engineering was launched in 2011 with an aim to transform the applied engineering education space in India. Their current focus area is Big Data Analytics / Data Science. Out of all of …
Top 5 Usecases of Solr to Power Your Web & Mobile Search
This article represents top 5 usecases for using Solr to power your web and mobile search. Note that in case of mobile search requirements, Solr exposes APIs that could be used to retrieve data from Solr index server and serve to mobile client. It also presents a classification of websites which are using Solr to fulfill their search requirements. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key points described later in this article: Top 5 Usecases for Solr Search Different Classes of Websites using Solr to Power Search Engines Top 5 Usecases for Solr Search Search Engine: Many …
Dummies Notes – What is B-Tree and Why Use Them?
This article represents quick notes on what is B-Tree Data structure and why use them. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. I found this page (Memory locality & the magic of B-Trees!) on B-Trees as a very interesting read and, would recommend anyone and everyone to go through it to quickly understand the nuances of B-Tree. B-Tree could be defined as a linked sorted distributed range array with predefined sub array size which allows searches, sequential access, insertions and deletions in logarithmic time. Simply speaking, B-Tree is nothing but the generalization of a Binary Search Tree. One may …
Key Training Topics for Hadoop Developer
This article represents key topics that one would want to learn in order to become a Hadoop Developer. One may also check these topics against topics provider by the training vendor. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the key areas tof focus for learning/training which are described later in this article: Java Essentials Hadoop Essentials Java Essentials As Hadoop is based on Java programming language, one would want to get expertise of at least intermediary level to do good with Hadoop development. Following are some of the key concepts that one would want to …
Dummies Notes on How Distributed Computing Works using Hadoop
This article intends to present dummies notes on how distributed computing works using Hadoop. As Hadoop is inspired by Google GFS/Map-Reduce/BigTable paper,I have tried and refer to GFS/Map-Reduce/BigTable in this article appropriately wherever possible. One must note that distributed computing paradigm has become mainstream given the advent of Big Data related large scale project implementation going on in several companies. Please feel free to shout if you find discrepancies with my understanding and help me correct the mistakes. Simply speaking, distributed computing refers to the computing paradigm in which processing happens on multiple different boxes consisting of data and, the result is, then, aggregated appropriately to display the final result. In traditional …
60 Most Commonly Used R Packages in R Programming Language
This article represents a comprehensive list of 60 most commonly used R packages which helps to achieve some of the following objectives when working with data science/analytics projects: Predictive modeling Data handling/manipulation Visualization Integration Hadoop GUI Database 60 Most Commonly Used R Packages Following is the list of 60 or so R packages which help take care of different aspects when working to create predictive models: Predictive Modeling: Represents packages which help in working with various different predictive models (linear/multivariate/logistic regression models, SVM, neural network etc.) caret: Stands for Classification And REgression Training. Provides a set of functions which could be used to do some of the following when …
API Tips – How to Write API Documentation
This article represents tips on how to write documentation for APIs which are going to be published to developers, both internal and external. It touches upon some of the important areas/points that needed to be included in API documentation such that developers find it easy enough to work with APIs. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. 3 Areas to Cover while doing API Documentation Landing page which provides details such as high level information of APIs, links to APIs pages, release information, changelog details A summary page providing an overview on APIs in general, list of API …
Quick Notes on What is CAP Theorem?
This article briefly talks about what is CAP theorem and provides appropriate examples. I have come across many candidates appearing for architect interview who failed to answer the question such as some of the following: What is CAP theorem? RDBMS system such as Oracle achieves which of the following two: Consistency, Availability, Partition Tolerance NoSQL datastore such as HBase tends to achieve which of the following two: Consistency, Availability, Partition Tolerance The article below addresses some of the above questions. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following points are discussed later in this article: What is …
Data Science – Who could become a Data Scientist?
This article represents information related different classes of IT & Non-IT professionals who could take on different data science free courses (as mentioned) and get on to the path of becoming a data scientist. Please feel free to comment/suggest if I missed to mention one or more important points. Also, sorry for the typos. Following are the different classifications of IT/Non-IT professional which has been addressed later in this article: Software Development Stakeholders working on Non-analytics projects Datawarehouse/BI Developers Big Data Developers Statisticians Senior Management Executive Non-Software Professionals Could I become a Data Scientist? Anyone matching following criteria could become a data scientist. One is decent with Mathematics & Statistics …
Top 10 Solution Approaches for Supervised Learning Problems
This article represents top 10 solutions approaches that could be used to solve supervised learning problems. For those unaware of what is supervised learning problem, here is the supervised learning definition from Wikipedia: Supervised learning is the machine learning task of inferring a function from labeled training data.[1] The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Following are two different kind of supervised …
Document Search Architecture to Search Millions of Documents
This article represents different document search architectural models using which one could create a search architecture that could search through 100s of millions of documents in faster time (milliseconds) with most up-to-date and fresh results. If you are planning to create a document search infrastructure which could search millions of documents, and shows up results in less than a second time, go ahead and explore different models and adopt the one that suits your needs at this stage. Note that the models given below could scale to multiple data centers. In this blog, we shall try and examine different architecture models that could achieve the search timing of less than a …
Top 10 Simpler Interview Questions, Architects Find Difficult to Answer
This article represents my list of top 10 interview questions which I see people, appearing for technical architect position, find difficult to answer. Although these questions seem to be simpler and subjective, I found candidates finding it difficult to answer. Do check the list below and see if you cracked all of them. Please feel free to comment/suggest if you would want me to include other questions. Sorry for the typos. Top 10 Interview Questions, Technical Architects Find Difficult to Answer Architecture & Design: Questions below are intended to test the candidates understanding on architectural frameworks and their abilities/capabilities to lay down system architecture/design. What are 3-4 most common …
Learn R or Python for Becoming Data Scientist?
This article presents analysis on whether one should go for learning R or Python programming language to create one or more predictive models using different machine learning algorithms. It could be noted that both languages, R and Python, is equally doing good and sought after by developers and the companies hiring such developers. So, you could choose either one of these languages. However, majority has been found to be voted in favour of Python for ease of learning and greater community support. Data Scientist with expertise in R Following indeed.com plot represents the job trends for the search term, “Data Scientist R”. It clearly indicates the trend such as …
I found it very helpful. However the differences are not too understandable for me