# Category Archives: Data Science

## Two independent samples t-tests: Formula & Examples As a data scientist, you may often come across scenarios where you need to compare the means of two independent samples. In such cases, a two independent samples t-test, also known as unpaired two samples t-test, is an essential statistical tool that can help you draw meaningful conclusions from your data. This test allows you to determine whether the difference between the means of two independent samples is statistically significant or due to chance. In this blog, we will cover the concept of two independent samples t-tests, its formula, real-world examples of its applications and the Python example (using scipy.stats.ttest_ind function). We will begin with an overview of what a …

Posted in Data Science, statistics. Tagged with , .

## Neyman-Pearson Lemma: Hypothesis Test, Examples Have you ever faced a crucial decision where you needed to rely on data to guide your choice? Whether it’s determining the effectiveness of a new medical treatment or assessing the quality of a manufacturing process, hypothesis testing becomes essential. That’s where the Neyman-Pearson Lemma steps in, offering a powerful framework for making informed decisions based on statistical evidence. The Neyman-Pearson Lemma holds immense importance when it comes to solving problems that demand decision making or conclusions to a higher accuracy. By understanding this concept, we learn to navigate the complexities of hypothesis testing, ensuring we make the best choices with greater confidence. In this blog post, we will explore …

Posted in Data Science, statistics. Tagged with , , .

## Pandas CSV to Dataframe Python Example Converting CSV files to DataFrames is a common task in data analysis. In this blog, we’ll explore a Python code example using the Pandas library to efficiently convert CSV files to DataFrames. This approach offers flexibility, speed, and convenience, making it a valuable technique for handling large datasets. Read CSV into Pandas Dataframe The following is the code which can be used to read the CSV file from local drive: In case, you want to read CSV file from the URL, the following will be the code. As a matter of fact, nothing changes except for the fact that you pass the URL to read_csv function. The following are some …

Posted in Data Science, Python. Tagged with , .

## Huggingface Hello World Transformers: Python Example Pre-trained models have revolutionized the field of natural language processing (NLP), enabling the development of advanced language understanding and generation systems. Hugging Face, a prominent organization in the NLP community, provides the “transformers” library—a powerful toolkit for working with pre-trained models. In this blog post, we’ll explore a “Hello World” example using Hugging Face’s Python library, uncovering the capabilities of pre-trained models in NLP tasks. With Hugging Face’s transformers library, we can leverage the state-of-the-art machine learning models, tokenization tools, and training pipelines for different NLP use cases. We’ll discuss the importance of pre-trained models in NLP, provide an overview of Hugging Face’s offerings, and guide you through an example …

Posted in Data Science.

## Mann-Whitney U Test (Wilcoxon Rank Sum): Python Example In the ever-evolving world of data science, extracting meaningful insights from diverse data sets is a fundamental task. However, a significant problem arises when these data sets do not conform to the assumptions of normality and equal variances, rendering popular parametric tests like the t-test ineffectual. Real-world data often tends to be skewed, includes outliers, or originates from an unknown distribution. For instance, data related to salaries, house prices, or user behavior metrics often challenge traditional statistical methods. This is where the Wilcoxon Rank Sum Test, also known as the Mann-Whitney U test, proves to be an invaluable statistical test. As a non-parametric alternative to the independent two-sample t-test, it …

Posted in Data Science, statistics. Tagged with , .

## Chi-square test – Formula, Concepts, Examples The Pearson’s Chi-square (χ2) test is a statistical test used to determine whether the distribution of observed data is consistent with the distribution of data expected under a particular hypothesis. The Chi-square test can be used to compare or evaluate the independence of two distributions, or to assess the goodness of fit of a given distribution to observed data. In this blog post, we will discuss different types of Chi-square tests, the concepts behind them, and how to perform them using Python / R. As data scientists, it is important to have a strong understanding of the Chi-square test so that we can use it to make informed decisions about …

Posted in Data Science, Python, statistics. Tagged with , .

## Pearson Correlation Coefficient: Formula, Examples In the world of data science, understanding the relationship between variables is crucial for making informed decisions or building accurate machine learning models. Correlation is a fundamental statistical concept that measures the strength and direction of the relationship between two variables. However, without the right tools and knowledge, calculating correlation coefficients and p-values can be a daunting task for data scientists. This can lead to suboptimal decision-making, inaccurate predictions, and wasted time and resources. In this post, we will discuss what Pearson’s r represents, how it works mathematically (formula), its interpretation, statistical significance, and importance for making decisions in real-world applications  such as business forecasting or medical diagnosis. We will …

Posted in Data Science, statistics. Tagged with , .

## Procurement Advanced Analytics Use Cases The procurement analytics applications are poised to grow exponentially in the next few years. With so much data available and the need for digital transformation across procurement organization, it’s important to know how procurement analytics can help you make better business decisions. This blog will cover procurement analytics and key use cases of advanced analytics that will be useful for business stakeholders such as category managers, sourcing managers, supplier relationship managers, business analysts / product managers, and data scientists implement different use cases using machine learning. Procurement analytics will allow you to use data very effectively in achieving data-driven decision making.  Procurement analytics use cases can be initiated by utilizing …

Posted in Data Science, Machine Learning, Procurement. Tagged with , , .

## Occam’s Razor in Machine Learning: Examples “Everything should be made as simple as possible, but not simpler.” – Albert Einstein Consider this: According to a recent study by IDC, data scientists spend approximately 80% of their time cleaning and preparing data for analysis, leaving only 20% of their time for the actual tasks of analysis, modeling, and interpretation. Does this sound familiar to you? Are you frustrated by the amount of time you spend on complex data wrangling and model tuning, only to find that your machine learning model doesn’t generalize well to new data? As data scientists, we often find ourselves in a predicament. We strive for the highest accuracy and predictive power in our …

Posted in Data Science, Machine Learning. Tagged with , .

## Pandas Dataframe: How to Add Rows & Columns Pandas is a popular data manipulation library in Python, widely used for data analysis and data science tasks. Pandas Dataframe is a two-dimensional labeled data structure with columns of potentially different types, similar to a spreadsheet or SQL table. One of the common tasks in data manipulation is adding new rows or columns to an existing dataframe. It might seem like a trivial task, but choosing the right method to add rows or columns can significantly impact the performance and efficiency of your code. In this blog, we will explore the different ways to add rows and columns to a Pandas Dataframe. We will look into different methods available in …

Posted in Data Science, Python. Tagged with , .

## Outlier Detection Techniques in Python: Examples In the realm of data science, mastering outlier detection techniques is paramount for ensuring data integrity and robust machine learning model performance. Outliers are the data points which deviate significantly from the norm. The outliers data points can greatly impact the accuracy and reliability of statistical analyses and machine learning models. In this blog, we will explore a variety of outlier detection techniques using Python. The methods covered will include statistical approaches like the z-score method and the interquartile range (IQR) method, as well as visualization techniques like box plots and scatter plots. Whether you are a data science enthusiast or a seasoned professional, it is important to grasp these …

Posted in Data Science, Machine Learning, Python. Tagged with , , .

## R-squared & Adjusted R-squared: Differences, Examples There are two measures of the strength of linear regression models: adjusted r-squared and r-squared. While they are both important, they measure different aspects of model fit. In this blog post, we will discuss the differences between adjusted r-squared and r-squared, as well as provide some examples to help illustrate their meanings. As a data scientist, it is of utmost importance to understand the differences between adjusted r-squared and r-squared in order to select the most appropriate linear regression model out of different regression models. What is R-squared? R-squared, also known as the coefficient of determination, is a measure of what proportion of the variance in the value of the …

Posted in Data Science, Machine Learning. Tagged with , .

## Boston Housing Dataset Linear Regression: Predicting House Prices Predicting house prices accurately is crucial in the real estate industry. However, it can be challenging to determine the factors that significantly impact house prices. Without a clear understanding of these factors, accurate predictions are difficult to achieve. The Boston Housing Dataset addresses this problem by providing a comprehensive set of variables that influence house prices in the Boston area. However, effectively utilizing this dataset and building robust predictive models require appropriate techniques and evaluation methods. In this blog, we will provide an overview of the Boston Housing Dataset and explore linear regression, LASSO, and Ridge regression as potential models for predicting house prices. Each model has its unique properties …

Posted in Data Science, Machine Learning. Tagged with , .

## ChatGPT Cheat Sheet for Data Scientists With the explosion of data being generated, data scientists are facing increased pressure to analyze and interpret large amounts of text data effectively. However, this can be a challenging task, especially when dealing with unstructured data. Additionally, data scientists often spend a significant amount of time manually generating text and answering complex questions, which can be a time-consuming process. Welcome ChatGPT! ChatGPT offer a powerful solution to these challenges. By learning different ChatGPT prompts, data scientists can significantly become super productive while generating relevant insights, answer complex questions, and perform machine learning tasks with ease such as data preprocessing, hypothesis testing, training models, etc. In this blog, I will provide …

Posted in ChatGPT, Data Science, Generative AI, Machine Learning. Tagged with , , , .

## Python Tesseract PDF & OCR Example Have you ever needed to extract text from an image or a PDF file? If so, you’re in luck! Python has an amazing library called Tesseract that can perform Optical Character Recognition (OCR) to extract text from images and PDFs. In this blog, I will share sample Python code using with you can use Tesseract to extract text from images and PDFs. As a data scientist, it can be very helpful and useful to be able to extract text from images or PDFs, especially when working with large amounts of data found in receipts, invoices, etc. Tesseract is an OCR engine widely used in the industry, known for its accuracy … 