| We often think of experimentation and testing as central
features of the pursuit of science. That is certainly true. But it is also
true that experimentation, at least at a simple level, occurs on a day
to day basis. Let us turn to a brief examination of experimentation.
Experimental Studies Many experimental studies are undertaken in the following circumstances. We have found a correlation between F and G, and want to find out whether there is a causal relation. Or we might suspect that there is a correlation, but not yet know. We have seen that we are rarely in a position to examine whole populations. So we have relied upon sampling. Ideally, when we undertake an experimental study we proceed as follows. We randomly select a test group from the population with which we are concerned. What population we are concerned with depends, of course, on the question we have. If we are interested in whether a particular vaccine is effective against a childhood disease our population would be children who do not have and have not had the disease. If we are interested in the efficacy of a particular treatment we would choose from a population of those who suffered from the illness or condition with which we were concerned. How large a sample we choose depends upon on all sorts of considerations, not all of which are related to such factors as a desired margin of error. (We might have very limited funding, for example.) Our sample is then divided randomly into two groups. One group is called the control group, the other the experimental group. One might wonder why it is important to get a random sample. Suppose that we are testing a vaccine. There might well be some characteristic I such that those with that characteristic are immune to the disease in question. Let us suppose that there is such a factor, but that we have no feasible way of distinguishing those with and those without I. By randomly choosing the study group and then randomly putting subjects into the experimental and control groups we ensure, in so far as we can, that it will not be the case that more people in the experimental group will have I than will people in the control group. Particularly if our population consists of humans, a number of additional techniques are frequently utilized. We often would not want our subjects to know which of the two groups, control or experimental, they were in. We call a study blind if the subjects do not know which group they are in. If a treatment involves, for example, taking a pill three times a day any subject, it could not be blind if we simply gave nothing to the people in the control group. So often something known to be irrelevant, a placebo, is given to those in the control group. A placebo is sufficiently similar to what is given to the subjects in the experimental group to prevent the subjects in the control group from telling that they are not members of the experimental group. There are many kinds of cases in which it is important to use a blind study. For example, the study may rely upon reports from the subjects. Clearly a subject might, even quite unintentionally, give one report if she believed she was a member of the experimental group, another if she believed that she was not. You most likely have noticed placebo effects in children. Sometimes they say they feel bad. You give them something, most anything, emphasizing that it will make them feel better. Often they will report that it has made them feel better. And in certain other kinds of cases it is important that one use not just a blind study but what is called a double-blind study. Here the experimenters who are assessing the results are also kept uninformed as to which group the subjects are in. This is particularly important where assessment involves a judgment call. For example, the experimenter might have to judge whether a subject is 'more responsive''. Even the most conscientious of researchers can suffer from unconscious biases. Here are some aspects of a typical study, one now known as the BCPT (Breast Cancer Prevention Trial). As you know, breast cancer is a very serious problem for women. For some time the drug tamoxifen has been used to treat some cases of breast cancer. (Indeed it is one of the most widely used treatments). What tamoxifen does is to some extent known: it interferes with the activity of estrogen, a female hormone which promotes the growth of cancer cells. It was thought that the drug might also be effective in preventing breast cancer, because it seemed to inhibit the development of cancer in the other breast of women receiving the treatment. As you can imagine the question of whether it inhibits the development of breast cancer is not something that can be determined by any quick glance. Here is a brief account of the way in which the study proceeded. (More complete information is available from the National Cancer Institute in the US). The population to be studied was that of women at high risk for breast cancer. Previous studies and computer modeling were used in order to identify "high risk". Of course, it is not possible to simply go out to the population and randomly grab people and start experimenting. Persons must enter such trials willing after people fully informed of exactly how the study will proceed. That is, informed consent is required. Some 13,000 women were involved in the study. Three age groups were involved: 40% of the women were in the 35-49 age group, and 30% in each the 50-59 and 60+ age groups. This was necessary as age is a factor in determining risk. All the participants were identified as having a risk of getting breast cancer in the next five years equal to or greater than that of a woman aged 60+. In this group 17 out of 1000 can be expected to develop breast cancer within the next five years. The study was to last five years, most were tracked for four years. The study was a double-blind study. That is, approximately half of the women received a placebo. And the women's physicians did not know whether their patients were in the experimental or the control group. The participants were placed in the groups via a random procedure. The preliminary results indicated the following. In the control group 154 women developed invasive breast cancer, whereas 85 women in the experimental group developed it. Notice that the absolute numbers here are very small. This, of course, is one reason why such a large-scale study was required. But the result is nonetheless a significant one. (Simply to report that there was a 45% decrease does not show this. Such a percentage decrease is not always significant though in this case it is. We will look at some of the ways in which significance is measured in 4.1.) The study results do not of course 'prove' that the drug is effective, but we now have very good reason to believe that it is. Should women now rush out and start taking the drug? Not necessarily, for there are some other considerations. For example, there are, as is not uncommon in such cases, side effects that have to be taken into account. Drugs like tamoxifen were antecedently known to increase the risk of endometrial cancer. As was expected, the study indicated that this risk was still there in this study where the drug was used for prevention rather than cure. And there was as well an increased risk of blood clots. So at the very least those who contemplate using the drug have to weigh the benefits and risks. Further this study only focussed upon the high-risk population. We do not know, on the basis of the study, whether similar results would be obtained in other populations. The results obtained were sufficiently impressive for the experimenters to make the study 'unblind'. So now all participants can decide for themselves whether to use the drug. One further point has to be mentioned. If an experiment is to be of any value care has to be taken, in so far as it is possible, to ensure that the members of the group are not treated differently. Oral contraceptives have certain kinds of effects that could have affected the results of the study. So steps were taken to ensure that none of the women were using these contraceptives. This is, of, course an aspect of experimentation in general. Experimenters try to ensure that there are no other differences present in the groups that might make a difference. In many cases the only way to do this is by way of using random procedures. That we have tried to ensure that there are no other differences is one of the reasons we take the results of experimental studies to indicate that we have causal relations. The BCPT is an example of a randomized experimental design. A slightly different kind of design is known as a prospective study. Here we start off with a population in which one portion has the feature that we suspect might be a causal factor. For example, we might look at police officers and non-police officers who have not yet had heart attacks. We then track them and see what the incidence of heart attacks is in the future. Notice one difference between this study and the BCPT. Here the subjects are initially different, that is some are police officers and some are not. In the BCPT the women were all the same in the sense of being in the high-risk population. So the prospective study has the problem that it will not identify any factor which might dispose people both for becoming a police officer and for heart attacks. That we initially find it implausible to suppose that there is a factor that disposes us both toward becoming a police officer and having a heart attack is not itself evidence that there is no such factor. So in a good study we try to ensure that those we study are as similar as possible precisely so as to guard against this problem. But this is very difficult to do. It should be clear that the randomized design would be preferable. But we are not in a position to randomly select, say, a group of 18-year olds and divide them into an experimental and control group. We would then have to get the members of the experimental group to become police officers. This is neither feasible nor morally permissible. And consider replacing a prospective study of the health effects of smoking. Here we would, for example, obtain a sample of 12-year olds and get the members of the experimental group to become heavy smokers. Even if this were feasible, it would be morally unacceptable. Prospective studies may be contrasted with retrospective studies. Here we look into the past. We looked at a population at a given point. For example, we look at the population of persons who have lung cancer. We then attempt to determine how many of these were smokers and how many were not. Let us suppose that 60% of the patients were smokers. Note that at this point we cannot even draw a conclusion regarding a correlation. (We saw this when we considered the claim that a huge percentage of cocaine users had been marijuana smokers.) What is typically done is to sample those who do not have lung cancer and determine how many of them have smoked and how many not. (Note that we would typically try to make them as similar as possible in terms of such factors as age, weight and lifestyle.) If only 10% of those who do not have lung cancer have smoked we would have some evidence for the dangers of smoking. But, as you can see, this approach has problems similar to those in prospective studies. In some cases animal experimentation is used. For example, animal experiments are typically used to determine whether or not a given substance is a carcinogen (a cancer-causing agent). One of the reasons that it is now taken for granted that smoking is causally related to lung cancer is that it was found, via animal experimentation, that 'tar' contained carcinogens. And tar is taken into the lungs when smoking. Animals are not, of course, humans and what happens to animals is not necessarily the same as what would happen to humans. So how to decide when such studies are justified and how to assess the results of such studies raise difficult issues which we cannot here undertake to resolve. Nonetheless, animal experiments have sometimes provided us with useful information which it would have been difficult, if possible at all, to obtain in any other way. This is not to say that animal experimentation is always legitimate. Deciding whether it is depends upon moral considerations, upon the value of finding out whatever it is the study is meant to find out and numerous other considerations as well. Some Experiments Most of us think of experimentation, testing and careful observation as premier characteristics of science. They are, but it should not be forgotten that they play a role in ordinary life as well. Clearly in any experiment we attempt (in so far as it is possible) to ensure that there are no variables that make a difference, other than the ones in which we are interested. Suppose we are interested in whether a particular fertilizer enhances the growth of grass. We plant our grass and carefully fertilize one area while not fertilizing another area. Sure enough, the grass in the fertilized area grows well. We have a lush, well-rooted lawn. The grass in the other area is not nearly so nice. Is this a good experiment? Well, given the information we have, we cannot tell. For example, did we ensure that the soil in both areas is much the same. Did both receive the same amount of sunlight and water? What one typically attempts to do, as we in effect noted in the preceding section, is to control for these other variables which might be expected to have an effect. So one might use areas that are all sunny or areas that are all shady. Or one might break down into sunny and shady areas for both the fertilized and unfertilized plots. (Note that this might give us additional information.) Clearly it is difficult to assess a bare report of the result of an experiment if we do not know which, if any, potentially relevant variables were controlled for by the experimenters. This experiment was devised in order to test the hypothesis that the fertilizer was effective. But in many cases it might be more apt to characterize an experiment as oriented toward collecting data or toward finding something out rather than being oriented toward the testing of a hypothesis. Suppose we noticed the following. Grass tends not to grow well under pine trees. The rest of our yard has no pine trees and is rather sunny. We suspect that it might be either the shade or the soil. We might devise an experiment to find out. As an initial crude stab we could remove some of the soil from each portion of the yard, put it in containers and plant some grass in the containers. Some containers might be placed under the pine trees, some in the sunny portion of our yard. This experiment is not a particular good one. But it should indicate that most of us do use experimental techniques when we are curious about something. Notice that there is no significant difference between experiments that test hypotheses and experiments that collect data. The primary difference is that we tend to speak of testing hypotheses when someone or another has put forward a claim. (Suppose I had been told that the soil under pine trees is very bad. I might then characterize what I described in the preceding paragraph as testing the hypothesis that the soil is bad.) If no one has put forward a claim we tend to speak of ourselves as collecting data. Let us follow the history of one classic episode in the history of science. If you pull a piston up in a tube filled with water the water will 'follow' for around 33 feet. If you use this sort of pump you can only pump around 33 feet. One question is simply, why does this happen at all? During the seventeenth century the theory was that 'nature abhors a vacuum'. Given this the water will rush upwards to fill the vacuum created by pulling up the piston. This might not be the sort of explanation with which we are familiar, but it is an explanation nonetheless. The scientist Evangelista Torricelli (a friend of Galileo in his later life) contemplated another explanation. Given that air has weight and that we live in, as it were, a sea of air the weight of the air 'pushing down' would equal the weight of the liquid 'going up'. Note that here we have as well as a potential explanation of the particular heights which we could expect. Torricelli proposed the following experiment. Take a glass tube closed at one end. Fill it with mercury. In some way close the end of the tube (for example, hold your finger over the end), invert it and stand it vertically in a bowl of mercury. The mercury will level in the tube should fall if Torricelli is right. Torricelli did not himself perform this experiment, but the device was soon used in further experiments. Blaise Pascal (after whom the programming language was named) proposed the following experiment. If there is a sea of air then the weight should decrease with altitude. If nature abhors a vacuum, then altitude should not matter. Our experiment will simply consist of carrying a barometer up a mountain. It there is such a thing as atmospheric pressure, the level of the mercury should decrease as we go upward. If nature abhors a vacuum, the level should not change. As it happens the height does change: the level goes down. The device we have described is of course the first barometer, a device we use to measure atmospheric pressure. Notice one additional feature of the sequences we have described: the inversion of the tube would seem to create a vacuum. If nature abhors a vacuum, why does the mercury fall at the outset? In some cases what are typically called crucial experiments are devised. Typically we speak of crucial experiments when there are two important hypotheses or two general theories between which scientists wish to decide. The task is then to devise an experiment which will should have one result if the one theory is correct, another if the other is correct. There need not be anything 'special' about the structure of a crucial experiment. We speak of it as crucial given the context in which it is carried out. The barometer experiment that we just discussed could well be described as a crucial experiment. But we have neglected one point. Did our experiment show conclusively that the atmospheric pressure hypothesis? Not really, for after all perhaps nature's abhorrence of a vacuum is a function of altitude. This suggestion was made. It is an example of what is typically called an ad hoc hypothesis. We are not in a position to enter into any extended discussion of what makes for good experimental design. But before we turn from the subject, there is one final point that should be made. In most areas of science it is very important that experiments be replicable. That is, if the same experiment is carried out in another place it should yield the same results. This is in effect a means of guarding against errors that can creep in by accident or by design. Some time ago some scientists announced the discovery of what is called cold fusion. Fusion, one way of producing energy, is a thermonuclear reaction that is normally supposed to require very high temperatures if it is to take place. The experiment involved placing an electrode into heavy water. It was said that excess energy (in the form of heat) was produced. This report quickly received vast press attention. If there is such a thing as cold fusion we might well have a source of virtually cost- and pollution-free energy. The problem that arose was that the results did not, initially at any rate, seem to be replicable. That is, the experiment when performed in other laboratories did not yield the excess. Of course, the fact that this happened does not resolve the question. Perhaps the further experiments involved some kind of mistake. But the consequence, after further investigation, was that mainstream scientists decided that there was no such thing as cold fusion. The controversy has not ended. A quick visit to the Internet will provide anyone with numerous reports regarding cold fusion and its immense promise. But it is not our task to decide. What we should note is that if there is such a thing as cold fusion then anyone who does follow the same procedures should get the same result. That is the primary force of the demand for replication. If the phenomenon is there, it should be found by anyone who does the same thing in the same way in circumstances which are the same. Note that anyone who carried a barometer up the mountain would obtain much the same result. |