When dealing with data, it's important to establish the unknown relationships between various variables. Show
Other than discovering the relationships between the variables, it is important to quantify the degree to which they depend on each other. Such statistics can be used in science and technology. Python provides its users with tools that they can use to calculate these statistics. In this article, I will help you know how to use SciPy, Numpy, and Pandas libraries in Python to calculate correlation coefficients between variables. Table of ContentsYou can skip to a specific section of this Python correlation statistics tutorial using the table of contents below: What is Correlation?The variables within a dataset may be related in different ways. For example, One variable may be dependent on the values of another variable, two variables may be dependent on a third unknown variable, etc. It will be better in statistics and data science to determine the undelying relationship between variables. Correlation is the measure of how two variables are strongly related to each other. Once data is organized in the form of a table, the rows of the table become the observations while the columns become the features or the attributes. There are three types of correlation. They include:
Correlation goes hand-in-hand with other statistical quantities like the mean, variance, standard deviation, and covariance. In this article, we will be focussing on the three major correlation coefficients. These include:
The Pearson's coefficient helps to measure linear correlation, while the Kendal and Spearman coeffients helps compare the ranks of data. The SciPy, NumPy, and Pandas libraries come with numerous correlation functions that you can use to calculate these coefficients. If you need to visualize the results, you can use Matplotlib. Correlation Calculation using NumPyNumPy comes with many statistics functions. An example is the 4 function that gives a matrix of Pearson correlation coefficients.To use the NumPy library, we should first import it as shown below:
Next, we can use the 5 class of NumPy to define two arrays.We will call them 6 and 7:
You can see the generated arrays by typing their names on the Python terminal as shown below: First, we have used the 8 function to generate an array given the name 6 with values ranging between 10 and 20, with 10 inclusive and 20 exclusive.We have then used 00 function to create an array of arbitrary integers.We now have two arrays of equal length. You can use Matplotlib to plot the datapoints:
This will return the following plot: The colored dots are the datapoints. It's now time for us to determine the relationship between the two arrays. We will simply call the 4 function and pass to it the two arrays as the arguments.This is shown below:
Now, type 02 on the Python terminal to see the generated correlation matrix:The correlation matrix is a two-dimensional array showing the correlation coefficients. If you've observed keenly, you must have noticed that the values on the main diagonal, that is, upper left and lower right, equal to 1. The value on the upper left is the correlation coefficient for 6 and 6.The value on the lower right is the correlation coefficient for 7 and 7.In this case, it's approximately 8.2. They will always give a value of 1. However, the lower left and the upper right values are of the most signicance and you will need them frequently. The two values are equal and they denote the pearson correlation coefficient for variables 6 and 7.Correlation Calculation using SciPySciPy has a module called 09 that comes with many routines for statistics.To calculate the three coefficients that we mentioned earlier, you can call the following functions:
Let me show you how to do it... First, we import numpy and the 09 module from SciPy.Next, we can generate two arrays. This is shown below:
You can calculate the Pearson's r coefficient as follows:
This should return the following: The value for Spearman's rho can be calculated as follows:
You should get this: And finally, you can calculate the Kendall's tau as follows:
The output should be as follows: The output from each of the three functions has two values. The first value is the correlation coefficient while the second value is the p-value. In this case, our great focus is on the coefficient correlation, the first value. The p-value becomes useful when testing hypothesis in statistical methods. If you only want to get the correlation coefficient, you can extract it using its index. Since it's the first value, it's located at index 0. The following demonstrates this:
Each should return one value as shown below: Correlation Calculation in PandasIn some cases, the Pandas library is more convenient for calculating statistics compared to NumPy and SciPy. It comes with statistical methods for DataFrame and Series data instances. For example, if you have two Series objects with equal number of items, you can call the 11 function on one of them with the other as the first argument.First, let's import the Pandas library and generate a Series data object with a set of integers:
To see the generated Series, type 6 on Python terminal:It has generated numbers between 20 and 30, with 20 inclusive and 30 exclusive. We can generate the second Series object: 0To see the values for this series object, type 7 on the Python terminal:To calculate the Pearson's r coefficient for 6 in relation to 7, we can call the 11 function as follows: 1This returns the following: The Pearson's r coefficient for 7 in relation to 6 can be calculated as follows: 2This should return the following: You can then calculate the Spearman's rho as follows: 3Note that we had to set the parameter 19 to 20.It should return the following: The Kendall's tau can then be calculated as follows: 4It returns the following: The 19 parameter was set to 22.Linear CorrelationThe purpose of linear correlation is to measure the proximity of a mathematical relationship between the variables of a dataset to a linear function. If the relationship between the two variables is found to be closer to a linear function, then they have a stronger linear correlation and the absolute value of the correlation coefficient is higher. Pearson Correlation CoefficientLet's say you have a dataset with two features, 6 and 7.Each of these features has n values, meaning that 6 and 7 are n tuples.The first value of feature 6, 28 corresponds to the first value of feature 7, 30.The second value of feature 6, 32, corresponds to the second value of feature 7, 34.Each of the x-y pairs denotes a single observation. The Pearson (product-moment) correlation coefficient measures the linear relationship between two features. It is simply the ratio of the covariance of 6 and 7 to the product of their standard deviations.It is normally denoted using the letter 37 and it can be expressed using the following mathematical equation:r = Σᵢ((xᵢ − mean(x))(yᵢ − mean(y))) (√Σᵢ(xᵢ − mean(x))² √Σᵢ(yᵢ − mean(y))²)⁻¹ The parameter 38 can take the values 1, 2,...n.The mean values for 6 an 7 can be denoted as mean(x) and mean(y) respectively.Note the following facts regarding the Pearson correlation coefficient:
So, a larger absolute value of 37 is an indication of a stronger correlation, closer to a linear function.While, a smaller absolute value of 37 is an indication of a weaker correlation.Linear Regression in SciPySciPy can give us a linear function that best approximates the existing relationship between two arrays and the Pearson correlation coefficient. So, let's first import the libraries and prepare the data: 5Now that the data is ready, we can call the 53 function and perform linear regression.This is shown below: 6We are performing linear regression between the two features, 6 and 7.Let's get the values of different coefficients: 7These should run as follows: So, you've used linear regression to get the following values:
Pearson Correlation in NumPy and SciPyAt this point, you know how to use the 56 and 57 functions to calculate the Pearson correlation coefficient.This is shown below: 8Run the above command then access the values of 37 and 59 by typing them on the terminal.The value of 37 should be as follows:The value of 59 should be as follows:Here is how to use the 56 function: 9It should return the following: Note that if you pass an array with a 63 value to the 57 function, it will return a 65.There are a number of details that you should consider. First, remember that the 4 function can take two NumPy arrays as arguments.You can instead pass to it a two-dimensional array with similar values as the argument. Let's first create the two-dimensional array: 0Let's now call the function: 1This returns the following: We get similar results as in the previous examples. So, let's see what happens when you pass nan data to 56: 2This returns the following: In the above example, the third row of the array has a nan value. Every calculation that didn't involve the feature with nan value was calculated well. However, all the results that dependent on the last row are nan. Pearson correlation in PandasFirst, let's import the Pandas library and create Series and DataFrame data objects: 3Above, we have created two Series data obects named 6, 7, and 70 and two DataFrame data objects named 71 and 72.To see any of them, type its name on the Python terminal. See the use of the 73 library.It has helped us provide output information to the Python interpreter. At this point, you know how to use the 11 function on Series data objects to get the correlation coefficients. 4The above returns the following: We called the 11 function on one object and passed the other object to the function as an argument.The 11 function can also be used on DataFrame objects.It can give you the correlation matrix for the columns of the DataFrame object: For example: 5The above code gives us the correlation matrix for the columns of the 71 DataFrame object.To see the generated correlation matrix, type its name on the Python terminal: The resulting correlation matrix is a new instance of DataFrame and it has the correlation coefficients for the columns 78 and 79.Such labeled results are very convenient to work with since they can be accessed with either their labels or with their integer position indices. This is shown below: 6The two run as follows: The above examples show that there are two ways for you to access the values:
Rank CorrelationRank correlation compares the orderings or the ranks of the data related to two features or variables of a dataset. If the orderings are found to be similar, then the correlation is said to be strong, positive, and high. On the other hand, if the orderings are found to be close to reversed, the correlation is said to be strong, negative, and low. Spearman Correlation CoefficientThis is the Pearson correlation coefficient between the rank values of two features. It's calculated just as the Pearson correlation coefficient but it uses the ranks instead of their values. It's denoted using the Greek letter rho (ρ), the Spearman’s rho. Here are important points to note concerning the Spearman correlation coefficient:
Kendall Correlation CoefficientLet's consider two n-tuples again, 6 and 7.Each 84, pair, 85..., denotes a single observation.Each pair of observations, 86, and 87, where 88, will be one of the following:
The Kendall correlation coefficient helps us compare the number of concordant and discordant data pairs. The coefficient shows the difference in the counts of concordant and discordant pairs in relation to the number of 84 pairs.Note the following points concerning the Kendall correlation coefficient:
SciPy Implementation of RankThe 09 can help you determine the rank of each value in an array.Let's first import the libraries and create NumPy arrays: 7Now that the data is ready, let's use the 03 to calculate the rank of each value in a NumPy array: 8The commands return the following: The array 6 is monotonic, hence, its rank is also monotonic.The 05 function also takes the optional parameter 19.This tells the Python compiler what to do in case of ties in the array. By default, the parameter will assign them the average of the ranks. For example: 9This returns the following: In the above array, there are two values with a value of 2. Their total rank is 3. When averaged, each value got a rank of 1.5. You can also get ranks using the 07 function.For example: 0Which returns the following ranks: The 08 function returns the indices of the array items in the asorted array.The indices are zero-based, so, you have to add 1 to all of them. Rank Correlation Implementation in NumPy and SciPyYou can use the 09 to calculate the Spearman correlation coefficient.For example: 1This runs as follows: The values for both the correlation coefficient and the pvalue have been shown. The rho value can be calculated as follows: 2This will run as follows: So, the 10 function returns an object with the value of Spearman correlation coefficient and p-value.To get the Kendall correlation coefficient, you can use the 11 function as shown below: 3The commands will run as follows: Rank Correlation Implementation in PandasYou can use the Pandas library to calculate the Spearman and kendall correlation coefficients. First, import the Pandas library and create the Series and DataFrame data objects: 4You can now call the 11 and 13 functions and use the 19 parameter to specify the correlation coefficient that you want to calculate.It defaults to pearson. Consider the code given below: 5The commands will run as follows: That's how to use the 19 parameter with the 11 function.To calculate the Kendall's tau, use 17.This is shown below: 6The commands will return the following output: Visualizing CorrelationVisualizing your data can help you gain more insights about the data. Luckily, you can use Matplotlib to visualize your data in Python. If you haven't installed the library, install it using the pip package manager. Just run the following command: 7Next, import its pyplot module by running the following command: 8You can then create the arrays of data that you will use to generate the plot: 9The data is now ready, hence, you can draw the plot. We will first demonstrate how to create an x-y plot with a regression line, equation, and Pearson correlation coefficient. You can use the 18 function to get the slope, the intercept, and the correlation coefficient for the line.First, import the 19 module from SciPy: 0Then run this code: 1Also, you can get the string with the equation of regression line and the value of correlation coefficient. You can use f-strings: 2Note that the above line will only work if you are using Python 3.6 and above (f-strings were introduced in Python 3.6). Now, let's call the 20 function to generate the x-y plot: 3The code will generate the following plot: The blue squares on the plot denote the observations, while the yellow line is the regression line. Heatmaps of Correlation MatricesThe correlation matrix can be big and confusing when you are handling a huge number of features. However, you can use a heat map to present it and each field will have a color that corresponds to its value. You should have a correlation matrix, so, let's create it. First, this is our array: 4Now, let's generate the correlation matrix: 5When you type the name of the correlation matrix on the Python terminal, you will get this: You can now use the 21 function to create the heatmap, and pass the name of the correlation matrix to it as the argument: 6The following heatmap will be generated: The result shows a table with coefficients. The colors on the heatmap will help you interpret the output. We have three different colors representing different numbers. Final ThoughtsThis is what you've learned in this article:
If you enjoyed this article, be sure to join my Developer Monthly newsletter, where I send out the latest news from the world of Python and JavaScript: |