Pearson Correlation Coefficient & Statistical Significance

In this post, we will discuss what Pearson’s r represents, how it works mathematically, its interpretation, statistical significance, and importance for making decisions in real-world applications such as business forecasting or medical diagnosis. We will also explore some examples of using Pearson’s r with real data sets so you can see how this powerful statistic works in action. As a data scientist, it is very important to understand Pearson’s r and its implications for making decisions based on data.

What is Pearson Correlation Coefficient?

Pearson correlation coefficient is a statistical measure that describes the linear relationship between two variables. It is typically represented by the symbol ‘r’. Pearson correlation coefficient can take on values from -1 to +1 and it is used to determine how closely two variables are related. It measures the strength of their linear relationship, which means that it indicates whether one variable increases or decreases as the other variable increases or decreases. A Pearson correlation coefficient of 1 indicates a perfect positive (direct) linear relationship, while a Pearson correlation coefficient of -1 indicates a perfect negative (inverse) linear relationship. Furthermore, when Pearson’s r is 0 there is no linear relationship between the two variables. 

The picture below represents strong positive linear relationship (r’s value near to 1)

correlation coefficient strong positive relationship

The picture below represents weak positive linear relationship (r’s value closer to 0)

correlation coefficient weak positive relationship

 

Pearson correlation coefficient can be used to examine relationships between variables in a variety of real-world applications such as some of the following:

  • In medicine, Pearson’s r can be used to measure the strength of the relationship between patient age and cholesterol levels.
  • In finance, Pearson’s r can be used to measure the strength of the relationship between stock prices and earnings per share.
  • In business forecasting, Pearson’s r can be used to measure the strength of the relationship between sales and marketing efforts.
  • In lifestyle research, Pearson’s r can be used to measure the strength of the relationship between exercise habits and obesity rates.
  • Another example is measuring correlation between customer loyalty against customer satisfaction levels and ascertain whether customers who report higher levels of satisfaction also demonstrate higher levels of loyalty or vice versa. Another example could include studying height against weight wherein one might use Pearson’s correlation coefficient to measure if taller individuals tend to weigh more than their shorter counterparts on average or if there is no obvious connection present at all between height and weight when considering real-world data sets.

Pearson’s correlation coefficient has implications for hypothesis testing as well as other decision-making processes. By measuring the strength of a linear relationship between two variables, researchers can make informed decisions based on their findings which can help guide future research studies or inform corporate policies and practices. Pearson’s correlation coefficient also provides a basis for making predictions about future outcomes when given certain inputs or conditions–which is incredibly valuable in various business settings where predicting customer behavior or market trends is critical for success.

The Pearson Correlation Coefficient formula is given as the following:

formula for pearson correlation coefficient

Pearson Correlation Coefficients should not be taken as definitive proof that there is a relationship between two variables; rather they should only serve as indicators for further investigation which can then lead to more conclusive results regarding such relationships. In addition, Pearson Correlation Coefficients are considered reliable only when sample sizes are large enough and data points are normally distributed; if these conditions are not met then other statistical tests may be necessary in order to determine the significance of any indicated correlations.

Scatterplots & Pearson Correlation Coefficient

Scatterplots are a powerful way of visualizing data and relationships between two variables.

They are graphs that display data points in which the values for two variables are plotted against each other. The x-axis usually displays one variable, and the y-axis displays the other variable. Each point on a scatter plot represents one data set composed of the independent and dependent variables being studied; when plotted in relation to each other, these points form clusters or patterns which allow us to analyze the strength and direction of the relationship between these variables.

When plotting scatter plots, Pearson’s correlation coefficient can be used to determine how closely related two variables are to each other by measuring the degree of association between them.

Pearson’s correlation coefficient is calculated using the formula:

r = ∑(x – x̅)(y – y̅) / √∑(x – x̅)²∑ (y – y̅)²

where x̅ and y̅ represent mean values for the respective x and y values.

By examining how closely points cluster together on a scatter plot, one can measure both linearity and strength in order to determine Pearson’s correlation coefficient value. The picture below represents correlation coefficient in three different scatter plots.

The picture below might represent a very high correlation coefficient closer to 1.

correlation coefficient scatterplot high value

The picture below might represent decently high correlation coefficient closer to 0.5.

correlation coefficient scatterplot medium value

The picture below might represent a very low correlation coefficient closer to 0.

correlation coefficient scatterplot small value

Statistical Significance of Pearson Correlation Coefficient

In order to determine whether any given Pearson correlation coefficient has a statistically significant result or not, we will need to go through the following steps:

  • Determine null & alternate hypothesis: The null hypothesis can be stated that there is no relationship between the two variables (r = 0) while the alternate hypothesis is that there is a relationship (r != 0).
  • Determine statistics for hypothesis testing: We will calculate t-statistics and perform t-test with (n-2) degree of freedom
  • Determine level of significance: The level of significance chosen is 0.05
  • Calculate & compare t-statistics with critical value: We will test the significance by evaluating t-statistics and comparing it with critical value read from t-distribution table at 0.05 significance level. If the t-statistics value is greater than the critical value at 0.05, the null hypothesis can be rejected. This would mean that there is enough evidence to support the alternate hypothesis that there is some relationship between two variables.

The following is the formula for calculating the value of t-statistics for determining statistical significance of Pearson correlation coefficient:

t-statistic for pearson correlation coefficient

In the above formula, r is correlation coefficient value and n is sample size.

One can also calculate p-value and compare it with 0.05 significance level. If p-value is less than 0.05, the Pearson correlation coefficient can be considered to be statistically significant and the null hypothesis rejected in favor of the alternate hypothesis. Recall that a P-value is a statistic that tells us how likely an observed result is due to chance alone. If a Pearson Correlation Coefficient has an associated P-value below 0.05, then it can be considered statistically significant. This means that the Pearson Correlation Coefficient is unlikely to have occurred by chance and thus supports the hypothesis that there is indeed some kind of relationship between the two variables being studied.

In conclusion, Pearson correlation coefficient is a powerful tool that allows for measuring the strength of linear relationships between two variables. It has implications for decision-making processes and research studies as well as real-world applications such as medicine, finance, business forecasting and lifestyle research. Pearson’s correlation coefficient also provides statistical significance testing which helps researchers make informed decisions based on their findings. Thus, Pearson correlation coefficient is an invaluable resource to have when conducting any form of quantitative analysis or data exploration.

Ajitesh Kumar
Follow me

Ajitesh Kumar

I have been recently working in the area of Data analytics including Data Science and Machine Learning / Deep Learning. I am also passionate about different technologies including programming languages such as Java/JEE, Javascript, Python, R, Julia, etc, and technologies such as Blockchain, mobile computing, cloud-native technologies, application security, cloud computing platforms, big data, etc. For latest updates and blogs, follow us on Twitter. I would love to connect with you on Linkedin. Check out my latest book titled as First Principles Thinking: Building winning products using first principles thinking
Posted in Data Science, statistics. Tagged with , .

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload the CAPTCHA.