
A statistical problem I've been pondering:
The normalization of the variance (well, standard deviation) can be accomplished by dividing by the mean to produce the coefficient of variation. However, the normalization of the covariance is accomplished by dividing by the standard deviations. Why can't the covariance be normalized by dividing by the means, producing a coefficient of covariation (CCV)?
If I take, say, N=10 one-acre samples from a forest, each sample may contain a certain number of two species of trees, A and B. You'll get an average number of each species per acre, XA and XB. Each will also have a standard deviation (SD), sA and sB. You can calculate a coefficient of variation (CV) of each species as X/s, and make meaningful comparisons of their dispersion, irrespective (largely) of the magnitude of their relative means--i.e., the dispersion measures have been normalized. I just generated some fake data, and species A has a mean of 14.4 and a SD of 4.2 individuals/acre, while species B has a mean of 7.9 and a SD of 3.2 individuals/acre. The CVs of the species, however, are 0.3 and 0.4, respectively, showing that B has a higher dispersion than A, even though its SD is lower.
You can find the covariance between A and B as sum((Xi-XA)(Xj-XB))/N, where Xi is the number of species A in the ith acre, and j for species B. (I realize the denominator may or may not need to be N-1 for arcane reasons, but let's not complicate this any more than it already is.) So far it still seems to be a measure of dispersion (or squared dispersion), although it's getting harder to say dispersion of what. This covariance reduces to the variance if one species is being compared against itself (sum(Xi-XA)^2/N). For this made-up data, the covariance is -11.46 individuals^2/acre^2, which indicates that they strongly negatively covary: When the population of A is high, B tends to be low (which I know because I set the data up that way). But exactly how strong? Is it significant? The respective variances are 17.8 and 10.0 (obviously: the squares of the SDs), so that at least gives us some sense of the magnitude of the covariation, but nothing very precise. If there was a third species, C, with a covariance to A of -12, it wouldn't really be clear if it was more or less covariant to A than B is, or by how much.
I know that the standard practice to normalize the covariance is to divide by the product of the SDs (sAsB) and generate a correlation coefficient (r=sAB/sAsB=-0.8 in this case--pretty strongly negatively correlated). I also understand the nice property of the correlation coefficient that it's capped at -1 and 1 (the normalization of the variance by itself; r=sAA/sAsA), which allows you to say if something is absolutely correlated, inversely correlated, uncorrelated, or somewhere in between. But the CV has a nice intuitive feel to it, because you're normalizing the SD by something easily understandable: the mean. Removing the central tendency component from a dispersion measure, leaving just the dispersion, and utilizing both parameters of the Normal distribution. Conversely, while the correlation coefficient itself is fairly intuitive, the method of generating it doesn't seem as concrete: You divide something that looks like a dispersion squared by two dispersions and get...well, something pretty darn useful, but it doesn't seem like what one would try a priori. To answer the a priori question "How do I normalize these covariances" (not necessarily the question "How do I generate a metric of correlation"), it seems like one would try the same trick that worked before--dividing something dispersion-y by something central tendency-y. Interestingly, the result of doing so is not very intuitive.
To put it another way, if you can normalize the variance/SD in two ways (by dividing by the SD, or by the mean, generating the correlation coefficient and coefficient of variation, respectively), why can you only normalize the covariance/co-SD one way (by dividing by the SDs)? Where's the fourth entry in this table--the memristor, if you will? The covariance (sum((Xi-XA)(Xj-XB))/N) reduces to the variance (sum(Xi-XA)^2/N) as a special case; the correlation coefficient (sAB/sAsB) reduces to unity (sAA/sAsA); it seems that some coefficient of covariance (sign(sAB)(sAB/XAB)^0.5) should exist where the CV (sign(sAA)(sAA/XAA)^0.5=sA/XA) is the special case reduction. (The CCV for this made-up data was -0.32.)
I'm not trying to use the CCV for anything in particular; r works perfectly well. I'm just wondering what the heck (if anything) this CCV thing is or would be, what its properties are, if it's been tried before, if it's called something else, if it's discussed anywhere, if anyone's heard of it, thought of it, tried it out, played with it, etc. None of my statistics books seem to mention it or anything like it, and Web of Science returned zero results. The one paper I found through Google, from a South African scientist published in the Pakistan Journal of Applied Sciences, is incomplete in the available online PDF (missing two middle of the five total pages, where presumably some important math occurs), is unavailable from my interlibrary loan office from any source, and the author hasn't replied to my emails yet. But I can't see why it's not more mentioned, since a priori it so seems like the way to approach things? It was certainly the first thing that popped into my head when I was sitting there tiredly looking at a few covariances I happened upon and wondering how to legitimately compare them. Normally, if I was trying to correlate something I would have gone straight to, duh, the correlation coefficient, without ever explicitly thinking about the covariances. But when unexpectedly confronted with covariances, I thought to myself, "Self, I need to normalize these. How? Well, covariances are, as far as I can tell, mathematical generalizations of variances. You normalize variances with the mean; I should be able to normalize these against an appropriately 'combined' mean--something like the square root of the product of the means." Google "coefficient of covariation" and, sure enough, the predominant hit is an (incomplete) paper containing precisely this equation. Spend 20 minutes calculating these CCVs before doing a face-palm and realizing you're computing an obscure or non-existent metric for something that would've been perfectly obvious if you'd had more caffeine. Redo calculations, and spend the next week wondering what the heck you were calculating, and how two such seemingly similar objects can be normalized in such strangely different fashions. Spend too much time on Wikipedia getting your brain warped by excessive Greek notation.
Surely, if there's such a commonly utilized non-normalized matrix (which seems pretty useless otherwise), and there's a common normalization that applies to the diagonal of that matrix, somebody would have tried the mathematically simple analogization of that normalization, and it'd be mentioned pretty early in the literature. And if that obvious attempt is a mathematical or interpretational disaster, you'd think textbooks would give a warning as to why it isn't useful to do, or someone would have published on why exactly it doesn't work; how it violates some assumption of normality, or is a biased estimator of thus-and-such, or some other statistic-ish problem. But so far I haven't seen anything...

No comments:
Post a Comment