Author: Lenka Fiřtová

This article explains how to compute the correlation coefficient between two variables, and the correlation matrix between multiple variables.

How to compute the correlation coefficient

In our calculations, we are going to use the trees dataset which is integrated in R. This dataset contains information about 31 trees, namely their girth, height and volume (of wood). First let’s take a look at the first few rows.

> head(trees)
   Girth  Height  Volume
1    8.3    70     10.3 
2    8.6    65     10.3
3    8.8    63     10.2
4   10.5    72     16.4
5   10.7    81     18.8
6   10.8    83     19.7

To compute correlation in R we use the cor function.

If we want to compute the correlation of two variables (for example, the girth of the trees and their height), we simply enter the names of these two variables into the cor function (the syntax is as follows: name of the dataset, dollar sign ($), name of the variable; or alternatively name of the dataset[ , number of the column]. No other argument is needed.

This function computes the so-called Peason’s correlation coefficient, which is the correlation coefficient we usually have in mind when talking about „correlation“. It is the covariance of the variables divided by the sum of their standard deviations. The cor function can also compute Spearman’s rank correlation coefficient and Kendall’s correlation coefficient, which are, however, not the subject of this article.

> cor(trees$Girth, trees$Height)

[1] 0.5192801


> cor(trees[ , 1], trees[ ,2])

[1] 0.5192801

As we would expect, the correlation is positive – the taller the tree, the larger its girth.

A problem may arise when the dataset contains missing values (NA). Let us create a new dataset, trees2, into which we add a new row using the rbind function. This row will contain a missing value in the girth column (the values of the remaining variables are just made up).

> trees2 = rbind(trees, c(NA, 90, 60))

As expected, the cor function returns an error.

> cor(trees2$Girth, trees2$Height)

[1] NA

Therefore, in case of missing observations, we have to specify that only complete observations (i.e. those without any missing values) should be used. This is done by adding one more argument when using the function, which is: use = „complete.obs“.

> cor(trees2$Girth, trees2$Height, use = "complete.obs")

[1] 0.5192801


Correlation matrix

Let us go back to the original dataset, trees. We want to display the correlation coefficients for each pair of variables at the same time. To do this, we simply enter more variables into the cor function, or even the whole dataset (when it only contains numeric variables).

> cor(trees)
         Girth    Height    Volume
Girth  1.0000000 0.5192801 0.9671194 
Height 0.5192801 1.0000000 0.5982497
Volume 0.9671194 0.5982497 1.0000000

If the trees dataset contained another, non-numeric variable (for example the location of the trees), we would have to specify we only want to use the first three columns:

> cor(trees[,1:3])

When there are more than two variables, the cor function returns the so-called correlation matrix. On its main diagonal the elements are equal to one (the correlation of the variable with itself), the other elements are the respective correlation coefficients. The matrix is symmetrical (the elements above and below the main diagonal are identical).

For example, we can see that the volume of the trees correlates more strongly with their girth (correlation equal to 0.97) than with their height (correlation equal to 0.598).


Testing the significance of the correlation coefficient

When we want to test whether the correlation coefficient is significant, we use the cor.test function. This test is used to find out if the correlation is “as high as it is just by chance” (i.e. in our specific sample), or if we can make a general conclusion that there is indeed a non-zero correlation in the whole population.

Let us explore the significance of the correlation coefficient between the variables girth and height.

> cor.test(trees$Girth, trees$Height)

R returns the following. The test-statistic value (t) is 3.2722. We could compare it with the critical value, but there is a simpler way. The function also displays the p-value, so we can compare the ­-value with the significance level, which is usually set to 0.05. If the p-value is smaller than 0.05 (as is true in our case), then we conclude that there truly is a statistically significant linear relationship between the variables.

data: trees$Girth and trees$Height
t = 3.2722, df = 29, p-value = 0.002758
alternative hypothesis: true correlation is not equal to 0
95 confidence interval:
0.2021327   0.7378538
sample estimates

We can see that 0.002758 is smaller than 0.05. The girth and the height of the trees are significantly correlated.

Leave a Reply

Your email address will not be published.


clear formPost comment