# Author Archives: Ajitesh Kumar

## Wilcoxon Signed Rank Test: Concepts, Examples

How can data scientists accurately analyze data when faced with non-normal distributions or small sample sizes? This is a challenge that often arises in the dynamic field of data science, where making precise inferences is crucial. Enter the Wilcoxon Signed Rank Test—a non-parametric statistical method that stands as a powerful alternative to the traditional t-test. This blog post aims to unravel the concepts and practical applications of the Wilcoxon Signed Rank Test, offering key insights for data scientists and researchers navigating complex data landscapes. The beauty of the Wilcoxon Signed Rank Test lies in its wide applicability across numerous fields. From healthcare, where it can compare the efficacy of different …

## R-squared in Linear Regression Models: Concepts, Examples

In linear regression, R-squared (R2) is a measure of how close the data points are to the fitted line. It is also known as the coefficient of determination. Understanding the concept of R-squared is crucial for data scientists as it helps in evaluating the goodness of fit in linear regression models, compare the explanatory power of different models on the same dataset and communicate the performance of their models to stakeholders. In this post, you will learn about the concept of R-Squared in relation to assessing the performance of multilinear regression machine learning model with the help of some real-world examples explained in a simple manner. Before doing a deep dive, …

## Hierarchical Clustering: Concepts, Python Example

Hierarchical clustering a type of unsupervised machine learning algorithm that stands out for its unique approach to grouping data points. Unlike its counterparts, such as k-means, it doesn’t require the predetermined number of clusters. This feature alone makes it an invaluable method for exploratory data analysis, where the true nature of data is often hidden and waiting to be discovered. But the capabilities of hierarchical clustering go far beyond just flexibility. It builds a tree-like structure, a dendrogram, offering insights into the data’s relationships and similarities, which is more than just clustering—it’s about understanding the story your data wants to tell. In this blog, we’ll explore the key features that …

## Minimum Description Length (MDL): Formula, Examples

Learning the concepts of Minimum Description Length (MDL) is valuable for several reasons, especially for those involved in statistics, machine learning, data science, and related fields. One of the fundamental problems in statistics and data analysis is choosing the best model from a set of potential models. The challenge is to find a model that captures the essential features of the data without overfitting. This is where methods such as MDL, AIC, BIC, etc. comes to rescue. MDL offers a principled way to balance model complexity against the goodness of fit. This is crucial in many areas, such as machine learning and statistical modeling, where overfitting is a common problem. …

## AIC & BIC for Selecting Regression Models: Formula, Examples

Are you grappling with the complexities of choosing the right regression model for your data? You are not alone. When working with regression models, selecting the most appropriate machine learning model is a critical step toward understanding the relationships between variables and making accurate predictions. With numerous regression models available, it becomes essential to employ robust criteria for model selection. This is where the two most widely used criteria come to the rescue. They are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). In this blog, we will learn about the concepts of AIC, BIC and how they can be used to select the most appropriate machine …

## Linear Regression Datasets: CSV, Excel

Linear regression is a fundamental machine learning algorithm that helps in understanding the relationship between independent and dependent variables. It is widely used in various fields for predicting numerical outcomes based on one or more input features. To practice and learn about linear regression, it is essential to have access to good quality datasets. In this blog, we have compiled a list of 17 datasets suitable for training linear regression models, available in CSV or easily convertible to CSV (Excel) format. I have also provided a sample Python code you can use to train using these datasets. List of Dataset for Training Linear Regression Models The following is a list …

## Pearson vs Spearman: Choosing the Right Correlation Coefficient

Are you as a data scientist trying to decipher relationship between two or more variables within vast datasets to solve real-world problems? Whether it’s understanding the connection between physical exercise and heart health, or the link between study habits and exam scores, uncovering these relationships is crucial. But with different methods at our disposal, how do we choose the most suitable one? This is where the concept of correlation comes into play, and particularly, the choice between Pearson and Spearman correlation coefficients becomes pivotal. The Pearson correlation coefficient is the go-to metric when both variables under consideration follow a normal distribution, assuming there’s a linear relationship between them. Conversely, the …

## Pearson Correlation Coefficient: Formula, Examples

In the world of data science, understanding the relationship between variables is crucial for making informed decisions or building accurate machine learning models. Correlation is a fundamental statistical concept that measures the strength and direction of the relationship between two variables. However, without the right tools and knowledge, calculating correlation coefficients and p-values can be a daunting task for data scientists. This can lead to suboptimal decision-making, inaccurate predictions, and wasted time and resources. In this post, we will discuss what Pearson’s r represents, how it works mathematically (formula), its interpretation, statistical significance, and importance for making decisions in real-world applications such as business forecasting or medical diagnosis. We will …

## t-distribution vs Normal distribution: Differences, Examples

Understanding the differences between the t-distribution and the normal distribution is crucial for anyone delving into the world of statistics, whether they’re students, professionals in research, or data enthusiasts trying to make sense of the world through numbers. But why should one care about the distinction between these two statistical distributions? The answer lies in the heart of hypothesis testing, confidence interval estimation, and predictive modeling. When faced with a set of data, choosing the correct distribution to describe it can greatly influence the accuracy of your conclusions. The normal distribution is often the default assumption due to its simplicity and the central limit theorem, which states that the means …

## One Sample T-test: Calculations, Formula & Examples

In statistics, the t-test is often used in research when the researcher wants to know if there is a significant difference between the mean of sample and the population, or whether there is a significant difference between the means of two different groups. There are two types of t-tests: the one sample t-test and the two samples t-test. As data scientists, it is important for us to understand the concepts of t-test and how to use it in our data analysis. In this blog post, we will focus on the one sample t-test and explain with formula and examples. What is one-sample T-test? One-sample T-test is a statistical hypothesis testing …

## First Principles Thinking using ChatGPT

Have you ever wondered why an object such as a chair is shaped the way it is, or why it’s even needed in the first place? What mystery unravels when we dig into the very essence of everyday objects and concepts around us? Navigating through a universe having well-established beliefs and customary wisdom, the hunt for innovative answers and deciphering the secrets hidden behind the everyday becomes not just a curiosity, but a necessity. This is where first principles thinking comes to the rescue. I have posted a detailed blog on First principles thinking – First principles thinking: Concepts & Examples. In this blog, let’s explore how we can utilize …

## Generative AI Framework for Product Managers: Examples

Ever wondered how you as a product manager can stay ahead in the competitive era fuelled by technological advancements such as generative AI? Are you constantly grappling with the pressure to deliver groundbreaking solutions in line with your business goals? As a product manager, wouldn’t it be revolutionary to have some kind of a playbook that simplifies these challenges? What exactly can generative AI do for modern product managers? Which areas of your daily struggles can it alleviate, and what AI frameworks are best suited for your unique challenges? In this blog, we will dive into the potential of Generative AI vis-a-vis real-life use cases for product managers. Use Cases: …

## Problems with Categorical Variables: Examples

Have you ever encountered unfamiliar words while learning a new language and didn’t know their meanings? Or tried to fit all your belongings into a suitcase, only to realize it’s too full? Or started reading a book series from the third book and felt lost? These scenarios in our daily lives surprisingly resemble some challenges we face with categorical variables in machine learning. Categorical variables, while essential in many datasets, bring with them a unique set of challenges. In this article, we’ll be discussing three major problems associated with categorical features: Let’s explore each with real-life examples and supporting Python code snippets. Incomplete Vocabulary The “Incomplete Vocabulary” problem arises when …

## Central Tendency in Machine Learning: Python Examples

Have you ever wondered why your machine learning model is not performing as expected? Could the “average” behavior of your dataset be misleading your model? How does the “central” or “typical” value of a feature influence the performance of a machine learning model? In this blog, we will explore the concept of central tendency, its significance in machine learning, and the importance of addressing skewness in your dataset. All of this will be demonstrated with the help of Python code examples using a diabetes dataset. We will be working with the diabetes dataset which can be found on Kaggle – Diabetes Dataset. The dataset consists for multiple columns such as …

## Feature Engineering in Machine Learning: Python Examples

Have you ever wondered why some machine learning models perform exceptionally well while others don’t? Could the magic ingredient be something other than the algorithm itself? The answer is often “Yes,” and the magic ingredient is feature engineering. Good feature engineering can make or break a model. In this blog, we will demystify various techniques for feature engineering, including feature extraction, encoding categorical variables, feature scaling, and feature selection. To demonstrate these methods, we’ll be using a real-world dataset containing car sales data. This dataset includes a variety of features such as ‘Company Name’, ‘Model Name’, ‘Price’, ‘Model Year’, ‘Mileage’, and more. Through this dataset, we’ll explore how to improve …

## Data Analytics for Car Dealers: Actionable Insights

Are you starting a car dealership and wondering how to leverage data to make informed business decisions? In today’s data-driven world, analytics can be the difference between a thriving business and a failing one. This blog aims to provide actionable insights for car dealers, especially those starting new car dealer business, to excel in various business aspects. I will cover inventory management, pricing strategy, marketing and sales, customer service, and risk mitigation, all backed by data analytics. I will continue to update this blog with more methods in time to come. The data used for analysis can be found on the Kaggle.com – Ultimate Car Price Prediction Dataset. First and …

Nice question to help us