Coefficient of Variation in Regression Modelling: Example

When building a regression model or performing regression analysis to predict a target variable, understanding the characteristics of your data including both independent and dependent variable is key. While descriptive statistics like the mean and standard deviation provide a basic summary, they don’t always tell the whole story, especially when comparing variables with different scales. This is where the Coefficient of Variation (CV) shines.

The Coefficient of Variation is a standardized measure of dispersion that expresses the standard deviation as a percentage of the mean. The formula is simple:

CV = (Standard Deviation / Mean) * 100%

Unlike the standard deviation, which is an absolute measure of variability, the CV is a relative measure. This makes it incredibly useful for comparing the variability of different features, even if they are measured in completely different units.

Let’s consider car price prediction dataset. We can have several numerical features, including ‘Year’, ‘Engine Size’, ‘Mileage’, and the target variable ‘Price’. Looking at the standard deviations that can be obtained using descriptive statistics alone wouldn’t give us a clear picture of which features have the most relative spread. For example, the standard deviation of ‘Mileage’ can be much larger than the standard deviation of ‘Year’ simply because mileage values are typically much larger than year values. Here is a sample Python code to get descriptive statistics. The code is run on google colab.

import pandas as pd

df = pd.read_csv('/content/car_price_prediction_.csv')

numerical_cols = df.select_dtypes(include=['int64', 'float64'])
descriptive_stats = numerical_cols.describe()
display(descriptive_stats)

However, lets say we calculated the Coefficient of Variation for the numerical columns (as seen in our analysis) and we got the following values. The following Python code can be used to calculate coefficient of variation.

# Calculate the Coefficient of Variation for numerical columns
# CV = (Standard Deviation / Mean) * 100%

cv_values = (numerical_cols.std() / numerical_cols.mean()) * 100

print("Coefficient of Variation for Numerical Columns:")
display(cv_values)

Car ID: 57.72%
Year: 0.35%
Engine Size: 41.33%
Mileage: 58.71%
Price: 51.86%

These percentages can offer valuable insights for our regression modeling task:

1. Understanding the Target Variable’s Variability

A high CV for the target variable, like the 51.86% we observed for ‘Price’, isn’t necessarily a bad thing. In fact, a certain level of variability is essential! If all car prices were exactly the same, there would be nothing to predict, and a regression model would be trivial.

A moderately high CV, like ours, indicates that there is significant variation in car prices relative to their average. This variability is precisely what we, as modelers, want to explain using our features (like ‘Year’, ‘Engine Size’, ‘Mileage’, ‘Brand’, etc.). The more variation in the target that can be attributed to the features, the better our model will be at making accurate predictions.

In this context, a CV of 51.86% suggests there’s plenty of interesting price variation for our model to learn from. It tells us that car prices are not tightly clustered around the mean, implying that factors captured by our features are indeed influencing the price. If the CV were very low (close to 0%), it would suggest that price is relatively constant, making prediction less impactful or necessary.

When to Start Giving a Closer Look (High CV as a Red Flag):

While a high CV is often desirable for the target variable, there are instances where it warrants closer inspection and potentially different modeling strategies. Consider these scenarios:

Extremely High CV (e.g., well over 100%): If the CV for ‘Price’ were significantly higher, perhaps exceeding 100%, it would indicate that the standard deviation is greater than the mean. This signifies an extremely high level of relative variability. While still not inherently “bad,” it suggests that the distribution might be highly skewed or contain extreme values that are pulling the standard deviation up significantly. This level of variability can make it more challenging for traditional linear models to capture the underlying patterns effectively.
Visual Inspection of Distribution: A high CV should prompt you to visually inspect the distribution of the target variable using histograms and box plots (as we did in our analysis). Even with a CV of 51.86%, our box plot for ‘Price’ shows a symmetrical distribution with no outliers detected by the IQR method. However, if the visualizations revealed strong skewness or obvious outliers alongside a high CV, this would be a clearer signal to investigate further.
Impact on Model Assumptions: Some regression models, particularly linear regression, assume that the target variable (or the residuals after modeling) follows a normal distribution and exhibits homoscedasticity (constant variance). A very high CV, often associated with skewed distributions, might violate these assumptions.

Actions to Consider When High CV Raises Concerns

If a high CV, coupled with visualizations, suggests potential issues with the target variable’s distribution, here are some steps to consider:

Data Transformation: Applying transformations like the logarithm (which we explored with ‘Log_Price’), square root, or Box-Cox can help normalize skewed distributions and reduce the impact of extreme values. This can improve the performance of models sensitive to the shape of the target distribution.
Outlier Analysis and Treatment: While our IQR method didn’t find outliers in ‘Price’, different methods or a more liberal definition might identify some. Addressing significant outliers (through removal, capping, or transformation) can reduce the standard deviation and thus the CV.
Exploring Different Model Types: Consider using regression models that are less sensitive to the distribution of the target variable, such as tree-based models (Random Forest, Gradient Boosting like LightGBM) or robust regression techniques.

2. Identifying Potentially Influential Features

The CV helps us identify which features exhibit the most relative variation. In our analysis, ‘Mileage’ has the highest CV at 58.71%, closely followed by ‘Car ID’ (though we’ll likely exclude ‘Car ID’ as it’s an identifier). ‘Engine Size’ has a moderate CV of 41.33%, while ‘Year’ has a very low CV of just 0.35%.

Features with higher CVs, like ‘Mileage’, have a wider range of values relative to their mean. This suggests that ‘Mileage’ is not consistently close to its average value, and therefore, different mileage values could potentially lead to significantly different car prices. In regression modeling, features with higher relative variability are often more likely to be strong predictors of the target variable. Our scatter plot of ‘Price’ vs. ‘Mileage’ (from a later step in our analysis) visually supports this, showing a clear downward trend as mileage increases, indicating a strong relationship.

3. Comparing Variability Across Different Scales:

The power of the CV lies in its ability to compare the variability of features measured in different units. We could not directly compare the standard deviation of ‘Year’ (measured in years) to the standard deviation of ‘Mileage’ (measured in miles) to understand which is relatively more variable. The CV allows us to do this by providing a unitless percentage. Our analysis clearly shows that ‘Mileage’ (CV = 58.71%) is significantly more variable relative to its mean than ‘Year’ (CV = 0.35%).

In conclusion, the Coefficient of Variation is a valuable tool in the exploratory data analysis phase of regression modeling. It moves beyond absolute measures of spread to provide a relative understanding of variability within and across features. By examining the CV, we can gain insights into the nature of our target variable, identify potentially influential predictors, and make informed decisions about data transformations and modeling strategies, ultimately leading to more robust and accurate regression models.

Author
Recent Posts

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. I would love to connect with you on Linkedin.
Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking.