Correlations and Studies

Some studies of which we hear or in which we indulge are simply designed to find something out. How many citizens are overweight? How many teenagers smoke? The techniques we characterized in the preceding topic are typically used to answer these questions. But often, if not incessantly, we are inundated with reports that tell us, for example, that consumption of a particular food is found to be related to a decrease in heart attacks. Many persons take astrology seriously to at least some degree. This involves the belief that there is some sort of correlation between astronomical phenomena at the time of our birth and what befalls us or what character we have. So let's turn to a study of correlations.

2.1 / The Basics of Correlations

In some cases phenomena vary together in a very direct way. Consider, for example, atmospheric pressure and altitude. But more often we find only what might be characterized as a tendency for phenomena to vary together. Weight and caloric intake are related, but we cannot simply say that the more you eat the more you will weigh.

We noted above that, at least in some cases, when we find correlations we suspect that there is a causal relation. But it is important to recognize that the notion of a correlation is a general one. A correlation is not in and of itself a causal relation. Let us look at a simple case to see exactly what correlations are. Suppose we are looking at a population which consists of round objects and square objects. The objects are either yellow or green. Here is the distribution in our population of 100 objects:
 
40 round and yellow
20 round and green
15 square and yellow
25 square and green

In this population being round is positively correlated with being yellow. This is not the claim that there are more round yellows than there are square yellows. It is rather the claim that the percentage of rounds that are yellow is greater than the percentage of squares that are yellow. Notice that we could equally have said that being green is positively correlated with being square. The correlation relation is symmetric (see topic 9). And here there is a negative correlation between being round and being green. Given the structure here where we have two alternatives this is equivalent to our statement that there is a positive correlation between being round and being yellow.

Be very careful, in thinking about correlations, to make sure you understand what kinds of evidence do not at all support any claim about correlations. Suppose, for example, you are told that 70% of those who use cocaine have previously smoked marijuana. Does this show that there is a positive correlation between smoking marijuana and using cocaine? The answer is no. To show that there was such a correlation we would need to know that the percentage of those who have smoked marijuana and use cocaine is higher than the percentage of cocaine users who have not smoked marijuana. Here is one population that illustrates that there need not be a correlation.
 
 
85 marijuana no cocaine
7 marijuana cocaine
2 no marijuana no cocaine
3 no marijuana cocaine

Note that 70% of the cocaine users have smoked marijuana. But here there is a positive correlation between not having smoked marijuana and using cocaine. 3/5 of those who have not smoked are cocaine users while only 7/92 of those who have smoked are cocaine users. Notice that the above distribution might well be reported by the claim that more than twice as many cocaine users have smoked marijuana as cocaine users who have not. These kinds of claims do not support any claim that there is a correlation either. For all I know there might be a correlation between pot and cocaine, but statistics such as those cited have not the slightest tendency to show this.

We now need some way of specifying the strength of a correlation. The strength is typically reported as falling in the interval -1 £ 0 £ 1. One simple idea, not normally used in practice, is to subtract the one from the other. In the example we looked at above, 2/3 of the rounds were yellow and 1/3 were green. We would, utilizing this measure, say that the strength of the correlation is +.333. There is a negative correlation between being round and being green. The strength of the negative correlation between round and green is - .333.

Of course we are typically looking at evidence based on a sample. Let us just take a quick look at some possible situations. We will utilize a .95 confidence level and start with a sample size of 1000. Recall our quick chart from the preceding topic:
 
 
 

N (Sample Size)

Margin of Error (p=.5)

(95% Confidence Level)

100
±.1
500
±.045
1000
±.032
2000
±.022
10000
±.01

Out of our sample we find that 500 have used marijuana and 500 have not. Out of these, 100 use cocaine and 400 do not. (Notice that obtaining a reliable sample would be difficult. Perhaps people might be honest with respect to the question as to whether they have used marijuana. But it is not very plausible to suppose that they will be honest with respect to whether they use cocaine. We shall ignore this problem. One way in which people attempt to sidestep it is to guarantee anonymity, presuming that this will make people more honest.) In doing this sampling we have examined a sample of 500 marijuana smokers. In this sample 100- that is, 20% - are cocaine users. The margin of error at this confidence level is 4.5%. So between 15.5% and 24.5% of the marijuana smokers are cocaine users. To decide whether or not there is a correlation we need to know how many cocaine users have not smoked marijuana. Let us suppose that there are 50, that is to say 10%. So we can say that between 5.5% and 14.5% of the cocaine users have not smoked marijuana. Here there is a positive correlation. We cannot say precisely what it is but we do know that it is quite weak. With our .95 confidence level it could be as small as .01 (the difference between 15.5% and 14.5%) but only as high as .19. Suppose we had found 100 cocaine users in both the smoking and non-smoking groups. Notice that here we could have no correlation, a weak positive correlation or a week negative correlation. That is, our data is compatible with 20% of both smokers and non-smokers being cocaine users in which case there would be no correlation. Or there could be a positive correlation if 24.5% of the smokers, for example, are cocaine users and only 19.5% in the non-smoking population. Or there could be a negative correlation were those numbers reversed. But the evidence we have does not allow us to make any of these particular judgements.

More on Correlations

More commonly a rather more complex means of identifying the strength of a correlation is used. We will now turn to contexts in which the data we have is quantitative (height and weight, for example) so that we can utilize our old friends- the mean and the standard deviation- and one new concept, that of covariance. We introduced the notion of variance in a situation where we were concerned with only one feature. We now need to introduce the concept of covariance. We now have two variables Xi (height, for example) and Yi (weight, for example). Covariance is then:
 
Covariance

Covariance (X,Y)=å [( Xi - m x) x ( Yi - m y)]/N

Let's use rXY for "the correlation or relation between X and Y". Now we can introduce the correlation coefficient:
 
Correlation Coefficient (a computational formula)

rXY = Covariance (X,Y)/ s X x s Y

That is, we divide the covariance by the product of the standard deviations. This is a pretty dreadful formula. Another very common way to compute the correlation coefficient it to do the following. We spoke of standard deviation units, which we ended up calling z scores, in the preceding topic. When we use those we always have a mean of 0 and a standard deviation of 1. Our two variables are then measured in those units. Given this, we can use the following overtly simpler formula:
 
 

Correlation Coefficient (a computational formula)

rXY = å (ZXi x ZYi)/N

Here ZXi is the distance of Xi from the mean expressed as a z score. The computations are still complex if done by hand.

Here is one simple example. Suppose we have given a standardized test to the 10 students in our class. Our interest is in whether there is any correlation between age and test performance. We will present the data using z scores.
 
Student
Age
Mark
1
-1
-.5
2
-.5
+1
3
-.5
-1.5
4
0
+1
5
0
-1
6
+.5
+.5
7
+.5
-.5
8
+1
+1
9
+1.5
+2
10
+2
+1

Recall that we are using z scores. So student 1's age is one standard deviation younger than the mean student age. Student 10's mark is one standard deviation above the mean student mark. Our formula tells us to calculate (I have omitted the ones which are 0):

(-1x-.5)+(-.5x1)+(-.5x-1.5)+(.5x.5)+(.5x-.5)+(1x1)+(1.5x2)+(2x1)

and divide by the number of cases, in this case 10. The result is .675. But of course this bare number, while it does indicate a positive correlation, is not self-interpreting. This is, for example, a very small class. We will later look at at least some of the ways of assessing the significance of such results.

Correlations and Causality

Tobacco companies conceded quite early on that there was a correlation between smoking and lung cancer. But they denied that there was a causal relation. We have already conceded that they could have been right. There was a correlation between shape and color but, presumably, there was no causal or explanatory relation between those two features. That would remain true even if the correlation had been 1.

There are a number of points to keep in mind. Correlations are indifferent to temporal order. The shape does not come before the color. Under our standard way of thinking, causes come before their effects. But do not make the mistake of thinking that where we have two correlated phenomena one of which precedes the other we can automatically conclude that the earlier phenomena are causes of the later phenomena. To proceed in such a fashion is to argue fallaciously.
 
 

Fallacy of Post Hoc ergo Propter Hoc

Arguing that phenomena are causally related simply on the basis that they are correlated and that one precedes the other.

What is sometimes taken as a particularly striking example of this occurred on a radio talk show quite recently. One of the topics under discussion had been the use of fluorides. A caller announced, with some distress, that her mother had had a fluoride mouthwash and then had died four days later. The clear suggestion, though it was not made explicitly, was that there was some causal connection. Clearly the caller had taken notice that one unusual event was followed by another unusual event. However nothing was said which indicated that the caller had any beliefs as to whether there is any correlation between fluoride treatments and deaths. So we will not take this as a case of this fallacy. It would be better viewed as a fallacious application of the method of difference.
 
Case

1
 
 
 

2

Circumstances

Ordinary Routine save
Fluorides Taken
 

Ordinary Routine

Phenomena

Death
 
 
 

Life

We are not in a position to assert for sure that the fluoride treatment played no role. But I would count this as a fallacious application of the method of difference because here we are in a situation where there are a number of alternate techniques available for investigating the death and a number of standard causes of death that we need to consider. There seems no reason initially even to consider the fluorides. This is particularly true as there is no (known) correlation between the treatment and deaths. Eclipses of the sun are quite rare events. Suppose the caller had said that there had been an eclipse and that her mother had died four days after seeing an eclipse.

Let us grant that there is a correlation between smoking marijuana and then using cocaine. If we were, on that basis alone, to assert that smoking marijuana causes cocaine use, we would be committing the fallacy that we have labelled post hoc ergo propter hoc. This has nothing to do with the character of this particular example. It simply reflects the fact that there are far more correlations than there are causal relationships. The only legitimate conclusion we can draw on the basis of this kind of correlational evidence is that there may be a causal relationship.

The way in which this example was stated reveals another problem. The temporal relation between phenomena is not always clear. (One might call this a chicken-egg problem.) Suppose that 'behaving badly' (b in the chart below) is correlated with physical punishment (p in the chart below). Of course it might well be true that a first punishment might follow a first bit of bad behaviour. But that does not really help us at all. During any given time period they both may occur. Which is 'first' will depend simply on our choice of these periods. Look at the following sequence:
 
b, p, b, p, b, p, b, p, b

There is another possibility with respect to such sequences. b might be causally related to p, and p to b. For example eating more might cause one to weigh more, and weighing more- if for example one ceased to care about one's weight- might contribute to one's eating more. The presence of such sequences does not of course guarantee that there is any causal relation.

In some cases there may be an extraordinary high correlation but no causal relationship. One not uncommon circumstance in which this might happen is when the phenomena that are correlated have a common cause. Storms and falling barometers are correlated. But neither causes the other. There is an underlying atmospheric condition responsible for both. One of the early claims of tobacco companies, after they had conceded the presence of a correlation, was that there might well be some underlying causal condition which both disposed people toward smoking and toward lung cancer. In the absence of any other evidence or any other kind of evidence this claim is correct. What later happened was that various studies were undertaken which indicated that there was no such underlying condition.

And we may discover correlations that are to be explained in a way other than what we initially think. Suppose we find that women pay more than men for car repairs for any particular kind of repair. There is a positive correlation between being a woman and paying more. We might then be inclined to believe that repair shops are more inclined to rip off women than they are men, or that men are better bargainers. But if we were to investigate further we might find the following. There is one group of repair shops, which are expensive, and another group which are 'reasonable'. For whatever reason women go more often to the more expensive shops. But at each shop men and women are treated equally. So our speculative explanation regarding repair shops being more inclined to rip women off is simply misguided. It is perhaps true, and I will suppose true, that the number of churches in any given area is correlated with the number of murders in that area. Cities have more murders and more churches. Both of these examples are examples of what are often called spurious correlations. But this term is very misleading. It is not that the correlation is spurious in the sense of not being there. It is that we draw inappropriate conclusions from the correlations. Or that the correlation in question is irrelevant to the hypothesis which we are considering.

What happens if we find that there is no correlation? Should we conclude that there is no causal relation? We should not do this. Our correlations capture only what we might call linear relations. But consider the following situation (presented using z scores):
 
Cases
X
Y
1
-2
+2
2
-1
0
3
0
0
4
+1
0
5
+2
+2

Here the correlation is 0. But there are situations like this where there is nonetheless a causal relation. Think of a measuring device that responds to a very light weight and a very heavy weight but does not respond to intermediate weight. The weight is causally relevant but the relation between weights and responses is not a linear one. Plants might die with too little sunlight or too much sunlight but grow much the same with intermediate amounts of sunlight. A somewhat different case sort of case is one in which the ingestion of small amounts of a substance is only minimally related to an increase in the rate of incidence of a certain disease, but in which large amounts are dramatically related. This would not be a strong correlation from our current point of view either, but clearly there might a causal relation. The primary moral of all this is that the absence of the sort of correlation obtained with our measures does not show the absence of causal relationships.

We have talked much of problems regarding the relation of correlations and causes. But it must be emphasized that we are not denying that the presence of the correlation gives us reason to suspect that there might be a causal relation. Nor are we are denying that we might have sufficient reason to investigate further. All we are saying is that such evidence does not in and of itself justify us in claiming that there is a causal relationship. Again we must ensure that our conclusion is proportionate to the evidence at hand. And we should go on to look at some of the other techniques available which will enable us to provide good evidence for our hypotheses.