Linear regression is a popular statistical method used to model the relationship between a dependent variable and one or more independent variables. In linear regression, the t-test is a statistical hypothesis testing technique that is used to test the hypothesis related to linearity of the relationship between the response variable and different predictor variables. In this blog, we will discuss linear regression and t-test and related formulas and examples. For a detailed read on linear regression, check out my related blog – Linear regression explained with real-life examples.
T-tests are used in linear regression to determine if a particular variable is statistically significant in the model. A statistically significant variable is one that has a strong relationship with the dependent variable and contributes significantly to the accuracy of the model. T-tests are also used to compare the significance of different variables in the model, which can help to identify which variables are most important for predicting the dependent variable.
In the following sections, we will explain the formula for t-tests in linear regression and provide examples of how t-tests are used in linear regression models. We will also explain how to interpret t-test results and provide best practices for using t-tests effectively in linear regression analysis. By understanding how t-tests are used in linear regression analysis, data scientists can gain valuable insights into the relationships between variables and develop more accurate and reliable predictive models.
What is Linear Regression?
Linear regression is defined as a linear relationship between the response variable and predictor variables. In other words, it is a statistical technique that is used to determine if there is a linear correlation between the response and predictor variables. A linear regression equation can also be called the linear regression model. It can as well be called the statistical linear model.
The simple linear regression line can be represented by the equation such as the following:
Where Y represents the response variable or dependent variable, X represents the predictor variable or independent variable, m represents the linear slope and b represents the linear intercept. The linear slope, m, can also be termed as the coefficient of the predictor variable. The diagram below represents the linear regression line, dependent (response) and independent (predictor) variables.
Linear regression is of two different types such as the following:
- Simple linear regression: Simple linear regression is defined as linear regression with a single predictor variable. An example of a simple linear regression is Y = mX + b.
- Multiple linear regression: Multiple linear regression is defined as linear regression with more than one predictor variable along with its coefficients. An example of multiple linear regression is Y = aX + bZ + cX*Z.
Concepts, Formula for T-test in Linear Regression
When creating a linear regression model, you can come up with multiple different features or independent variables. However, having multiple features will only make the model very complex. Thus, it becomes important to select the most appropriate features while making the model faster to evaluate, easy to interpret and reduced collinearity. This is where the hypothesis tests such as t-test comes into picture.
The linearity of the linear relationship can be determined by calculating the t-test statistic. The t-statistic helps to determine how linear, or nonlinear, this linear relationship is. Let’s look at the hypothesis formulation in relation to determining the relationship between dependent and independent variables and how the value that coefficients take to quantify the relationship.
Null Hypothesis (H₀): The null hypothesis states that there is no relationship between the feature (independent / predictor variable) and the dependent / response variable. In terms of coefficients, it suggests that the coefficient of the feature is equal to zero.
Alternative Hypothesis (H₁): The alternative hypothesis contradicts the null hypothesis and suggests that there is a relationship between the feature and the dependent variable. It implies that the coefficient of the feature is not equal to zero.
Going by the above, in a simple linear regression model such as Y = mX + b, the t-test can be used to evaluate the value of coefficient, m, based on the following hypothesis:
- H0: m = 0
- Ha: m ≠ 0
H0: There is no relationship between Y (response variable) and X (predictor variable)
Ha: There is a relationship between Y and X.
Assuming that the null hypothesis (H0) is true, the linear regression line will be parallel to X-axis such as the following, given Y-axis represents the response variable and the X-axis represent the predictor variable. The following diagram represents the null hypothesis:
The t-test is performed as a hypothesis test to assess the significance of individual coefficients (or features) in the linear regression model. In this case, t-test will be performed to assess the significance of the value, m.
The formula for calculating the t-statistic in the context of linear regression is as follows. The t-statistic measures the number of standard errors the estimated coefficient is away from the hypothesized value.
t = (β – 0) / SE(β)
- t is the t-statistic
- β is the estimated coefficient of the feature. If there are multiple features, we will have value such as β1, β2, β3, …, βn. That way, we will have t-statistics for each of the coefficients.
- 0 is the hypothesized value (usually zero under the null hypothesis)
- SE(β) is the standard error of the estimated coefficient
In linear regression, we estimate the coefficients (such as β1, β2, β3, …, βn) of the features (such as x1, x2, x3, …, xn) using methods like ordinary least squares (OLS). The t-test is then applied to examine the statistical significance of these coefficients. It compares the estimated coefficient to its standard error (as shown above) to determine if the coefficient is significantly different from zero. The standard error measures the variability in the estimated coefficients. Standard errors quantify the uncertainty associated with the coefficient estimates.
The formula for the one-sample t-test statistic in linear regression is as follows:
t = (m – m0) / SE
t is the t-test statistic
m is the linear slope or the coefficient value obtained using the least square method; For multi-variate regression models, it represents the coefficient estimate for the variable.
m0 is the hypothesized value of linear slope or the coefficient of the predictor variable. The value of m0 = 0.
SE represents the standard error of the coefficient estimate which can be represented using the following formula:
SE = S / √N
Where S represents the standard deviation and N represents the total number of data points.
The standard error (SE) of the coefficient estimate is a measure of the variability in the coefficient estimate. It quantifies the average amount of variation in the estimated coefficient. It considers the variability of the data, the complexity of the model (as reflected by the number of features), and the estimated variance of the error term.
The degree of freedom for t-statistics will be N-2 where N is number of data points.
The t-statistic is compared to a critical value from the t-distribution based on the degrees of freedom and a chosen significance level (commonly 0.05). If the absolute value of the t-statistic exceeds the critical or threshold value, it indicates that the relationship between the predictor variable and the dependent variable is statistically significant. In other words, if the t-statistic is greater than the threshold value, we can reject the null hypothesis and conclude that the variable is statistically significant in the model. Conversely, if the t-statistic is less than the threshold value, we cannot reject the null hypothesis and conclude that the variable is not statistically significant in the model.
Alternatively, based on the value of t-statistics, one can also calculate the p-value and compare with the level of significance (0.05). If the p-value less than 0.05 indicates that the outcome of t-test is statistically significant and you can reject the null hypothesis that there is no relationship between Y and X (m=0). This means that the value of m calculated holds good.
When the number of variables is small as in simple linear regression model, an exhaustive search amongst all features can be performed. However, as the number of features or independent variables increases, the hypothesis space grows exponentially and heuristic search procedures can be needed. Using the p-values, the variable space can be navigated in the following three possible ways.
- Forward regression starts from the empty model and always adds variables based on low p-values.
- Backward regression starts from the full model and always removes variables based on high p-values.
- Stepwise regression is a combination of both. It starts off like forward regression, but once the second variable has been added, it will always check the other variables in the model and remove them if they turn out to be insignificant according to their p-value
Examples of T-Tests in Linear Regression
T-tests are a powerful tool for determining the significance of individual variables in linear regression models. Here are some examples of scenarios where t-tests are commonly used in linear regression analysis:
- Testing the Significance of Individual Coefficients: In linear regression, we estimate a coefficient for each independent variable to determine its relationship with the dependent variable. A t-test can be used to determine if the coefficient estimate is statistically significant. For example, consider a linear regression model that predicts the price of a house based on its size and the number of bedrooms. We can use a t-test to determine if the coefficient estimate for the size variable is statistically significant, indicating that house size has a significant impact on the price.
- Comparing the Significance of Different Variables: T-tests can also be used to compare the significance of different variables in the model. For example, consider a linear regression model that predicts a student’s GPA based on their SAT scores and high school GPA. We can use a t-test to determine if the coefficient estimate for the SAT score variable is statistically different from the coefficient estimate for the high school GPA variable, indicating which variable is more significant in predicting the student’s GPA.
Here’s how you can perform t-tests in linear regression using Python:
import statsmodels.api as sm # Load the dataset data = sm.datasets.get_rdataset('cars', 'datasets').data # Fit the linear regression model model = sm.formula.ols('dist ~ speed', data=data).fit() # Print the t-test results print(model.summary())
In the example above, we load the “cars” dataset from the “statsmodels” package and fit a linear regression model to predict the stopping distance of a car based on its speed. We then print the t-test results using the summary() method, which includes the coefficient estimates, standard errors, t-values, and p-values for each variable in the model.
In this blog post, we have explored the formula for t-tests in linear regression, provided examples of when t-tests are used in linear regression models, and explained how to interpret t-test results.
We have learned that t-tests are a powerful tool for determining the significance of individual variables in linear regression models. By performing t-tests, data scientists can identify which variables are most important for predicting the dependent variable and gain valuable insights into the relationships between variables.
By following best practices for using t-tests in linear regression analysis, data scientists can develop more accurate and reliable predictive models and make better-informed decisions. It is recommend that data scientists use t-tests as a standard tool in their linear regression analysis, but with careful consideration of the underlying assumptions and practical implications. By doing so, they can achieve more accurate and reliable results and gain valuable insights into the complex relationships between variables in their datasets.