A Comparison of Correlation Measures - Michael Clark

Preparing to load PDF file. please wait...

0 of 0
100%
A Comparison of Correlation Measures - Michael Clark

Transcript Of A Comparison of Correlation Measures - Michael Clark

MICHAEL CLARK CENTER FOR SOCIAL RESEARCH UNIVERSITY OF NOTRE DAME
A C O M PA R I S O N O F C O R R E L AT I O N M E A S U R E S

Contents

Comparing Correlation Measures 2

Preface 3
Introduction 4
Pearson Correlation 4
Spearman’s Measure 5
Hoeffding’s D 5 Distance Correlation 5 Mutual Information and the Maximal Information Coefficient 6
Linear Relationships 7
Results 7
Other Relationships 8
Results 8
Less noisy 8 Noisier 9
Summary 9
Appendix 11
Mine Results 11 Statistics: Linear Relationships 12 MIC tends toward zero for independent data 12 Statistics: Nonlinear Relationships 13
Pearson and Spearman Distributions 13 Hoeffding’s D Distributions 13 Less Noisy Patterns 14 Noisier Patterns 15
Additional Information 16

3 Comparing Correlation Measures

Preface
This document provides a brief comparison of the ubiquitous Pearson product-moment correlation coefficient with other approaches measuring dependencies among variables, and attempts to summarize some recent articles regarding the newer measures. No particular statistical background is assumed besides a basic understanding of correlation. To view graphs as they are intended to be seen, make sure that the ’enhance thin lines’ option is unchecked in your Acrobat Reader preferences, or just use another pdf reader.

Current version date May 2, 2013. Original April 2013. I will likely be coming back to this for further investigation and may make notable changes at that time.

Comparing Correlation Measures 4

Introduction
The Pearson correlation coefficient has been the workhorse for understanding bivariate relationships for over a century of statistical practice. It’s easy calculation and interpretability means it is the go to measure of association in the overwhelming majority of applied practice.
Unfortunately, the Pearson r is not a useful measure of dependency in general. Not only does correlation not guarantee a causal relationship as Joe Blow on the street is quick to remind you, a lack of correlation does not even mean there is no relationship between two variables. For one, it is best suited to continuous, normally distributed data1, and is easily swayed by extreme values. It is also a measure of linear dependency, and so will misrepresent any relationship that isn’t linear, which occurs very often in practice.
Here we will examine how the Pearson r compares to other measures we might use for both linear and nonlinear relationships. In particular we will look at two measures distance correlation and the maximal information coefficient.

Pearson Correlation

As a reminder, the sample Pearson r is calculated as follows: covxy = ∑N (xi−XN¯ )−(y1i−Y¯ )
i=1
rxy = vacroxvvxayry In the above, we have variables X and Y for which we have N paired observations. The Pearson r is a standardized covariance, and ranges from -1, indicating a perfect negative linear relationship, and +1, indicating a perfect positive relationship. A value of zero suggests no linear association, but does not mean two variables are independent, an extremely important point to remember. The graph to the right shows examples of different correlations with the regression line imposed. The following code will allow you to simulate your own.

library(MASS) cormat = matrix(c(1, set.seed(1234) # empirical argument mydat = mvrnorm(100, cor(mydat)

0.25, 0.25, 1), ncol = 2) #.25 population correlation
will reproduce the correlation exactly if TRUE mu = c(0, 0), Sigma = cormat, empirical = T)

##

[,1] [,2]

## [1,] 1.00 0.25

## [2,] 0.25 1.00

1 Not that that stops people from using it for other things.

−0.75

−0.25

y

0

2 0 −2

2 0 −2

2 0 −2

2 0 −2

2 0 −2

−4

−2

0

2

4

x

0.25

0.75

5 Comparing Correlation Measures

Spearman’s Measure

More or less just for giggles, we’ll also take a look at Spearman’s ρ. It is essentially Pearson’s r on the ranked values rather than the observed values2. While it would perhaps be of use with extreme values with otherwise normally distributed data, we don’t have that situation here. However since it is a common alternative and takes no more effort to produce, it is included.

cor(mydat, method = "spearman") #slight difference

##

[,1] [,2]

## [1,] 1.0000 0.1919

## [2,] 0.1919 1.0000

cor(rank(mydat[, 1]), rank(mydat[, 2]))

## [1] 0.1919

2 If you have ordinal data you might also consider using a polychoric correlation, e.g. in the psych package.

Hoeffding’s D

Hoeffding’s D is another rank based approach that has been around a while3. It measures the difference between the joint ranks of (X,Y) and the product of their marginal ranks. Unlike the Pearson or Spearman measures, it can pick up on nonlinear relationships, and as such would be worth examining as well.

library(Hmisc) hoeffd(mydat)$D

##

[,1] [,2]

## [1,] 1.00000 0.01162

## [2,] 0.01162 1.00000

Hoeffding’s D lies on the interval [-.5,1] if there are no tied ranks, with larger values indicating a stronger relationship between the variables.

Distance Correlation

Distance correlation (dCor) is a newer measure of association (Székely

et al., 2007; Székely and Rizzo, 2009) that uses the distances between

observations as part of its calculation. If we define a transformed distance matrix4 A and B for the X and

Y variables respectively, each with elements (i,j), then the distance

covariance is defined as the square root of:

n

Vx2y

=

1 n2



Aij Bij

i,j=1

3 Hoeffding (1948). A non-parametric test of independence.
4 The standard matrix of euclidean distances with the row/column means subtracted and grand mean added. Elements may be squared or not.

Comparing Correlation Measures 6

and dCor as the square root of R2 = VVxx2Vyy
Distance correlation satisfies 0 ≤ R ≤ 1, and R = 0 only if X and Y are independent. In the bivariate normal case, R ≤ |r| and equals one if r ± 1 .
Note that one can obtain a dCor value for X and Y of arbitrary dimension (i.e. for whole matrices, one can obtain a multivariate estimate), and one could also incorporate a rank-based version of this metric as well.

Mutual Information and the Maximal Information Coefficient

Touted as a ’correlation for the 21st century’ (Speed, 2011), the maximal information coefficient (MIC) is based on concepts from information theory. We can note entropy as a measure of uncertainty, defined for a discrete distribution with K states as:
k
H(X) = − ∑ p(X = k) log2 p(X = k)
i=k
Mutual Information, a measure of how much information two variables share, then is defined as:

I(X; Y) = H(X) + H(Y) − H(X, Y)

or in terms of conditional entropy:

I(X; Y) = H(X) − H(X|Y)

Note that I (X; Y) = I (Y; X). Mutual information provides the amount of information one variable reveals about another between variables of any type, ranges from 0 to ∞, does not depend on the functional form underlying the relationship. Normalized variants are possible as well.
For continuous variables, the problem becomes more difficult, but if we ’bin’ or discretize the data, it then becomes possible. Conceptually we can think of placing a grid on a scatterplot of X and Y, and assign the continuous x (y) to the column (row) bin it belongs to. At that point we can then calculate the mutual information for X and Y.

I(X; Y) =



p(X, Y)log2

p(X,Y) p(X ) p(Y )

X,Y

With p(X, Y) the proportion of data falling into bin X, Y, i.e. p(X, Y) is the joint distribution of X and Y.
The MIC (Reshef et al., 2011) can be seen as the continuous variable counterpart to mutual information. However, the above is seen as a naïve estimate, and typically will overestimate I (X; Y). With the MIC a search is done over various possible grids, and the MIC is the

A uniform distribution where each state was equally likely would have maximum entropy.

7 Comparing Correlation Measures

maximum I found over that search, with some normalization to make

the values from different grids comparable.

MIC(X; Y) = max

I (X;Y)

X,Ytotal
In the above, I is the naïve mutual information measure, which is

divided by the lesser number of X or Y bins. X, Ytotal is the total number of bins, B is some number, somewhat arbitrarily chosen, though Reshef et al. (2011) suggest a default of N.6 or N.55 based on their

experiences with various data sets. The authors make code available on the web5, but just for demon-
stration we can use the minerva library as follows6.

library(minerva) mine(mydat, n.cores = 3)$MIC

##

[,1] [,2]

## [1,] 1.0000 0.2487

## [2,] 0.2487 1.0000

MIC is on the interval [0, 1] where zero would indicate independence and a 1 would indicate a noiseless functional relationship. With MIC the goal is equitability- similar scores will be seen in relationships with similar noise levels regardless of the type of relationship. Because of this it may be particularly useful with high dimensional settings to find a smaller set of the strongest correlations. Where distance correlation might be better at detecting the presence of (possibly weak) dependencies, the MIC is more geared toward the assessment of strength and detecting patterns that we would pick up via visual inspection.

5 See the appendix regarding an issue I found with this function. 6 While the package availability is nice, the mine function in minerva works relatively very slow (even with the parallel option as in the example).

Linear Relationships
Just as a grounding we will start with examination of linear relationships. Random (standardized) normal data of size N = 1000 has been generated 1000 times for population correlations of -.8, -.6, -.4, 0, .4, .6, and .8 (i.e. for each correlation we create 1000 x,y data sets of N = 1000). For each data set we calculate each statistic discussed above.
Results
Distributions of each statistic are provided in the margin7 (squared values for Pearson and Spearman measures). Means, standard deviations, and quantiles at the 2.5% and 97.5% levels are provided in the appendix. No real news to tell with the Pearson and Spearman correlations. The other measures pick up on the relationships in a manner

7 For this and the later similar graph, rotate the document in for a better view.

0.0

0.2

pearson

0.4

0.6

0.0

0.2

spearman

0.4

0.6

0.0

Linear Relationships

0.1

hoeffd

0.2

0.3

0.0

0.2

dcor

0.4

0.6

Comparing Correlation Measures 8
we’d hope, and assign the same values whether the original linear relation is positive or negative. Interestingly, the Hoeffding’s D and MIC appear to get more variable as the values move away from zero, while for the dCor gets less so. At first glance the MIC may seem to find something in nothing, in the sense it doesn’t really get close to zero for the zero linear relationship. As seen later, it finds about the same value for the same pattern split into four clusters. This is a function of sample size, where MIC for independent data approaches 0 as N → ∞. See the supplemental material for Reshef et al. (2011), particularly figure S1. I also provide an example of MICs for different sample sizes and standard deviations in the appendix.
The take home message at this point is that one should feel comfortable using the other measures for standard linear relationships, as one would come to the same relative conclusions as one would with the Pearson r. Note that for perfect linear relations all statistics would be 1.0.
Other Relationships
More interesting is a comparison of these alternatives when the relationship is not linear. To this end, seven other patterns are investigated to see how these statistics compare. In the margin are examples of the patterns at two different noise levels. The patterns will be referred to as wave, trapezoid, diamond, quadratic, X, circle and cluster.
Results
Distributional results for the distance correlation and MIC can again be seen in the margin, related statistics and distributions for the those and the other statistics can be found in the appendix. Neither Pearson’s r nor Spearman’s ρ find a relationship among any of the patterns regardless of noise level, and will no longer be a point of focus.
Less noisy
Hoeffding’s D shows very little variability within its estimates for any particular pattern, and does not vary much between patterns (means range 0 to .1). In this less noisy situation Hoeffding’s D does pick up on the quadratic pattern relative to the others, followed by the X and circle patterns, but in general these are fairly small values for any pattern.
The dCor and MIC show notable values for most of these patterns. The dCor finds the strongest relationship for the quadratic function,

0.8

0.0

0.2

MIC
Correlation 0.8 −0.8 0.6 −0.6 0.4 −0.4 0

0.4

0.6

9 Comparing Correlation Measures

0.0

0.1

0.2

dcor

0.3

Nonlinear Relationships

0.4

followed by the X, circle, the trapezoidal and wave patterns in a group, with the cluster pattern near zero. The MIC finds a very strong relationship for the wave pattern, followed by the quadratic, about the same for the circle and X pattern, and the rest filling out toward its lower values. Just as before, the MIC won’t quite reach 0 for N = 1000, so those last three are probably reflective of little dependence according to MIC.
Noisier
With more noise, Hoeffding’s D does not pick up on the patterns well; the means now range from 0 to .02. The dCor maintains its previous ordering, although in general the values are smaller, or essentially the same in the case of the trapezoidal and cluster patterns for which strong dependency was not uncovered previously. The MIC still finds a strong dependency in the wave pattern, as well as the X and circle, but the drop off for the quadratic relationship is notable, and is now deemed less of a dependency than the circle and X patterns. The remaining are essentially the same as the less noisy situation. This suggests that for some relationships the MIC will produce the same value regardless of the amount of noise, and the noise level may affect the ordering of the strength one would see of different relationships.

0.00

0.25

MIC

0.50

0.75

1.00

Less Noise

More Noise

Pattern circle cluster quadra trap1 trap2 wave X

Summary
Pearson’s r and similar measures are not designed to pick up nonlinear relationships or dependency in a general sense. Other approaches such as Hoeffding’s D might do a little better in some limited scenarios, and a statistical test8 for it might suggest dependence given a particular nonlinear situation. However it appears one would not get a good sense of the strength of that dependence, nor would various patterns of dependency be picked up. It also does not have the sort of properties the MIC is attempting to hold.
Both distance correlation and MIC seem up to the challenge to find dependencies beyond the linear realm. However neither is perfect, and Kinney and Atwal (2013) note several issues with MIC in particular. They show that MIC is actually not equitable9, the key property Reshef and co. were claiming, nor is it ’self-equitable’ either (neither is dCor). Kinney and Atwal also show, and we have seen here, that variable noise may not affect the MIC value for certain relationships, and they further note that one could have a MIC of 1.0 for differing noise levels.
As for distance correlation, Simon and Tibshirani (2011, see additional information for a link to their comment) show that dCor

8 The hoeffd function does return a p-value if interested.
9 The R2 measure of equitabilitity Reshef et al. were using doesn’t appear to be a useful way to measure it either.

Comparing Correlation Measures 10
exhibits more statistical power than the MIC. We have also seen it will tend to zero even with smaller sample sizes, and that it preserved the ordering of dependencies found across noise levels. Furthermore, dCor is straightforward to calculate and not an approximation.
In the end we may still need to be on the lookout for a measure that is both highly interpretable and possesses all the desirable qualities we want, but we certainly have do have measures that are viable already. Kinney and Atwal (2013) suggest that the issues surrounding mutual information I that concern Reshef et al. were more a sample size issue, and that those difficulties vanish with very large amounts of data. For smaller sample sizes and/or to save computational costs, Kinney and Atwal suggest dCor would be a viable approach.
I think the most important conclusion to draw is to try something new. Pearson’s simply is not viable for understanding a great many dependencies that one will regularly come across in one’s data adventures. Both dCor and mutual information seem much better alternatives for picking up on a wide variety of relationships that variables might exhibit, and can be as useful as our old tools. Feel free to experiment!
MicRelationshipValuesRelationshipsDcor