On Appeal From the United States District Court For the District of New Jersey (Trenton); D.C. Civil Action No. 87-226.
Becker, and Stapleton, Circuit Judges and Robert F. Kelly, District Judge*fn*
STAPLETON, Circuit Judge.
This is an appeal in a diversity action brought under New Jersey law by the DeLuca family against Merrell Dow Pharmaceuticals Corporation, the manufacturer of Bendectin.*fn1 The DeLucas seek damages for severe birth defects suffered by Cindy DeLuca's daughter Amy. Amy was born with limb reduction defects of the lower extremities: the lower portion of her left leg is deformed with anterior bowing of the tibia, absence of the fibula and three toes, and considerable shortening; and her right foot is missing a toe. The DeLucas allege that these birth defects were caused by Cindy DeLuca's use of Bendectin during the time she was pregnant with Amy.
Merrell Dow filed a motion for summary judgment alleging that the only causation evidence produced by the DeLucas was inadmissible because all relevant epidemiological studies have determined there is no statistically significant link between the use of Bendectin during pregnancy and the type of birth defects suffered by Amy DeLuca and these studies were the only reasonable basis for expert opinions. In response, the DeLucas proffered affidavits and deposition testimony by Dr. Alan Done, an expert in pediatric pharmacology, in which Dr. Done opined that the available epidemiological data does support the conclusion that Bendectin causes limb reduction defects and that he believed, to a reasonable degree of medical certainty, Bendectin caused Amy's defects. The district court held that Dr. Done's testimony would be inadmissible at trial because it was not based on data of a type reasonably relied upon by experts in the pertinent fields in issuing opinions on these subjects, as is required by Federal Rule of Evidence 703. Since Dr. Done's testimony was the sole causation evidence the DeLucas tendered in response to Merrell Dow's motion, the district court entered summary judgment for Merrell Dow. On appeal, the DeLucas argue that the district court misapplied Federal Rule of Evidence 703 in excluding Dr. Done's testimony. We agree and we will reverse and remand for proceedings consistent with the principles articulated herein.
I. THE LEGAL AND SCIENTIFIC SETTING
This is one of the last of over 1,000 suits alleging that birth defects were caused by the drug Bendectin. Bendectin, a prescription drug prescribed for morning sickness in pregnant women, was first approved for sale by the Food and Drug Administration in 1956. Public expressions of concern about Bendectin's relationship to birth defects mounted in the 1970's. In response, Bendectin's safety was examined by the FDA, and in 1980, the FDA's Advisory Committee on Fertility and Maternal Health concluded that the relevant information "did not demonstrate an increased risk of birth defects with Bendectin use" but urged that studies be continued. App. at 195. The FDA continues to approve its sale for use during pregnancy.
Despite the committee report and the fact that no published study has concluded that Bendectin increases the risk of birth defects, thousands of tort cases were filed by plaintiffs alleging that Bendectin had caused their children's birth defects. While Merrell Dow prevailed in the most prominent of the trials arising out of these numerous cases, a multi-district common issues trial involving over 800 cases, it has also had large verdicts entered against it in other suits, though most of these have been reversed on appeal or overturned on a motion for judgment n.o.v. As a result of escalating insurance and litigation costs resulting from these cases, and decreased use of Bendectin flowing from the controversy surrounding its safety, Merrell Dow has ceased production of Bendectin.
In this case, the district court faced one of the difficult questions that has pervaded Bendectin litigation to this point: whether an expert may testify, in light of existing scientific knowledge, that Bendectin is a teratogen, i.e., an agent that causes birth defects. The district court held Dr. Done's testimony to be inadmissible, citing the requirement of Federal Rule of Evidence 703, that expert opinion be based on data reasonably relied upon by experts in the relevant field. The district court reached this conclusion despite the fact that most of the data relied upon by Dr. Done was data from peer reviewed articles in medical journals that was relied upon by the authors of these articles, as well as by Merrell Dow's own expert.
In the record that served as the basis for the district court's decision, Merrell Dow did not identify particular data sets it believed Dr. Done could not reasonably rely upon. Nor did it address the specific methodology and reasoning underlying Dr. Done's conclusion that Bendectin is a teratogen.*fn2 Instead, Merrell Dow relied upon the great weight of scientific opinion in its favor and upon prior cases in which testimony that Bendectin is a teratogen was held to be inadmissible or insufficient to support a verdict.*fn3 This was consistent with its apparent litigation strategy which was to emphasize that "in all material respects, the instant case is identical to the cases where summary judgment has been granted in Merrell Dow's favor." App. at 38.
Following Merrell Dow's lead, the district court did not point to specific deficiencies in the data utilized by Dr. Done and while it cited Rule 703, it made no record-supported, factual finding that Dr. Done had relied upon data experts in the field would have considered unreliable. Instead, the district court devoted most of its opinion to surveying the case law cited by Merrell Dow. In only two brief sentences of its opinion did the district court address Dr. Done's statistical analysis of the available epidemiological evidence. The first sentence states that the authors of the studies used by Dr. Done concluded that a "statistically significant" link between Bendectin and birth defects existed only for defects other than limb reduction defects or concluded that Bendectin does not cause birth defects. App. at 29. Dr. Done, as we shall see, readily admits that his interpretation of the data collected for these studies differs from the authors'. The second sentence appears to discard Dr. Done's analysis because he is not an epidemiologist, id., despite Merrell Dow's express agreement to assume, for purposes of its motion for summary judgment, that Dr. Done was qualified to read and interpret epidemiological studies. On this basis, the district court held that the DeLucas had "not approached a showing that Dr. Done's opinion has a foundation as required by Federal Rule of Evidence 703." Id.
Our review of a district court's decision to exclude the testimony of an expert is ordinarily limited to ensuring there has been no abuse of discretion, but to the extent the district court's ruling turns on an interpretation of a Federal Rule of Evidence our review is plenary. In re Japanese Electronic Products Litigation, 723 F.2d 238, 276 (3d Cir. 1983), rev'd on other grounds sub. nom., Matsushita Electronic Industrial Co. Ltd. v. Zenith Radio Corp., 475 U.S. 574, 89 L. Ed. 2d 538, 106 S. Ct. 1348 (1986); United States v. Furst, 886 F.2d 558, 571 (3d Cir. 1989). The standard of review of a district court's entry of summary judgment is plenary, and we apply the same standard as the district court. Erie Telecommunications, Inc. v. City of Erie, 853 F.2d 1084, 1093 (3d Cir. 1988). Summary judgment is appropriate when, after considering the record evidence in the light most favorable to the nonmoving party, no genuine issue of material fact exists and the moving party is entitled to judgment as a matter of law. Fed. R. Civ. P. 56(c).
B. The Relevant Scientific Principles and Tendered Evidence
To competently analyze the legal issues presented by this appeal, an understanding of the relevant scientific principles, albeit necessarily a rudimentary one drawn primarily from the relevant sources cited to by the parties, is essential. Problematic issues of causation arise in Bendectin cases because the etiology of most birth defects is unknown. There is no apparent way to determine from clinical examinations of Amy DeLuca whether her limb defects were the result of her mother's exposure to Bendectin, as opposed to another possible teratogen, or whether her birth defects are simply an inexplicable natural occurrence not induced by her mother's exposure to an outside agent. Rather, the only particularistic evidence the DeLucas can show to strengthen the inference that Amy DeLuca's birth defects were caused by Bendectin is to rule in Bendectin as a possible cause by showing that Amy was exposed to it during the time her limbs were developing, i.e., during organogenesis, and to rule out other possible causes by showing that Amy was not exposed to them during the critical period of organogenesis. Merrell Dow did not contend before the district court that the DeLucas failed to present sufficient evidence in this regard.*fn4
Thus, the DeLucas must rely primarily on inferences drawn from epidemiological data to show causation in Amy's case. Epidemiology, a branch of science and medicine, uses studies to "observe the effect of exposure to a single factor upon the incidence of disease in two otherwise identical populations." Black & Lilienfeld, Epidemiological Proof In Toxic Tort Litigation, 52 Fordham L. Rev. 732, 755 (1984).*fn5 In the Bendectin context, an epidemiological study ideally attempts to determine the incidence of birth defects among the children of two groups of women, identical in all respects except for their use of Bendectin during pregnancy. Epidemiological studies do not provide direct evidence that a particular plaintiff was injured by exposure to a substance.*fn6 Such studies have the potential, however, of generating circumstantial evidence of cause and effect through a process known as hypothesis testing, a process which "amounts to an attempt to falsify the null hypothesis and by exclusion accept the alternative." K.J. Rothman, Modern Epidemiology 116 (1986) ("Rothman"). The null hypothesis is the hypothesis that there is no association between two studied variables, id.; in this case the key null hypothesis would be that there is no association between Bendectin exposure and an increase in limb reduction defects. The important alternative hypothesis in this case is that Bendectin use is associated with an increased incidence of limb reduction defects.
The great weight of scientific opinion, as is evidenced by the FDA committee results, sides with the view that Bendectin use does not increase the risk of having a child with birth defects. Sailing against the prevailing scientific breeze is the DeLucas' expert Dr. Alan Done, formerly a Professor of Pharmacology and Pediatrics at Wayne State University School of Medicine,*fn7 who continues to hold fast to his position that Bendectin is a teratogen. In spite of his impressive curriculum vitae, Dr. Done's opinion on this subject has been rejected as inadmissible by several courts.
Dr. Done's opinion that Bendectin is a teratogen largely rests on inferences he draws from epidemiological data, most of which he contends are the same that was utilized by the experts, including the FDA committee, to whom Merrell Dow cites to bolster its contention that Bendectin does not cause birth defects.*fn8 The principal difference is that Dr. Done analyzes that data using an approach, advocated by Professor Kenneth Rothman of the University of Massachusetts Medical School, that places diminished weight on so-called "significance testing." See K.J. Rothman, Modern Epidemiology (1986) ("Rothman"); see also, Rothman, A Show of Confidence, New Eng. J. of Medicine, Dec. 14, 1978, 1362.
Epidemiological studies, of necessity, look to the experience of sample groups as indicative of the experience of a far larger population. Epidemiologists recognize, however, that the experience of the sample groups may vary from that of the larger population by chance. Thus, a showing of increased risk for birth defects among women using Bendectin in a particular study does not automatically prove that Bendectin use creates a higher risk of having a child with birth defects because the discrepancy between the exposed and unexposed groups could be the product of chance resulting from the use of only a small sample of the relevant populations.*fn9 As a result of the acknowledged risk of this so-called "sampling error," researchers typically have rejected the associations suggested by epidemiological data unless those associations survive the rigors of "significance testing." This practice has also found favor in the legal context. A number of judicial opinions, discussed @f a, have found Bendectin plaintiffs' causation evidence inadmissible because every published epidemiological study of the relationship of Bendectin exposure to the incidence of birth defects has concluded that there is not a "statistically significant" relationship between these two events.
Significance testing has a "P value" focus; the P value "indicates the probability, assuming the null hypothesis is true, that the observed data will depart from the absence of association to the extent that they actually do, or to a greater extent, by actual chance." Rothman, supra, at 116. If P is less than .05 (or 5%) a study's finding of a relationship supportive of the alternative hypothesis is considered statistically significant, if P is greater than 5% the relationship is rejected as insignificant. Accordingly, the results of a particular study are reported as simply "significant" or "not significant" or as P.05.
Use of a .05 P value to determine whether to accept or reject the null hypothesis necessarily enhances one of two types of possible error. Type one error is when the null hypothesis is rejected when it is in fact true. Type two error is when the null hypothesis is in fact false but is not rejected. Rothman notes that at .05, the null hypothesis will "be rejected about 5 per cent of the time when it is true," a relatively small risk of type one error. Id. at 117. Unfortunately, the relationship between type one error and type two error is not simple; however, one study in the context of an employment discrimination case concluded that when the risk of type one error equalled 5%, the risk of type two error was 50%. Cohen, Confidence in Probability: Burdens of Persuasion in a World of Imperfect Knowledge, 60 N.Y.U.L. Rev. 329, 411 & n. 116 (1985) (citing Dawson, Investigation of Fact - The Role of the Statistician, 11 Forum 896, 907-08 (1976)). Type one error may be viewed here as the risk of concluding that Bendectin is a teratogen when it is not. Type two error is the risk of concluding that Bendectin is not a teratogen, when it in fact is.
Rothman contends that there is nothing magical or inherently important about .05 significance; rather this is just a common value on the tables scholars use to calculate significance. Rothman, supra, at 117; see also Cohen, supra, at 412 (noting that the .05 level of significance used in the social and physical sciences is a conservative and arbitrary value choice not necessarily valuable in the legal setting); Kaye, Is Proof of Statistical Significance Relevant?, 61 Wash. L. Rev. 1333, 1343-44 (1986). He stresses that the data in a certain study may indicate a strong relationship between two variables but still not be "statistically significant" and that the level of significance which should be required depends on the type of decision being made and the relative values placed on avoiding the two types of risk.
To convey both the extent to which two variables are associated in the data, and the extent to which this association might be the product of chance, Rothman advocates reporting both a "relative risk" (or point estimate) and "confidence intervals." In the context of an epidemiological study of Bendectin's relationship to birth defects, the relative risk is the ratio of the incidence rate of birth defects in the study group exposed to Bendectin divided by the rate in the control group not exposed to Bendectin. Black & Lilienfeld, supra, at 758. If a study found no difference in the rate of birth defects between the Bendectin exposed group and the control group, it yields a relative risk identical to the null hypothesis that Bendectin exposure is not associated with an increased incidence of birth defects. The relative risk would thus be reported as "1", signifying no difference between the rate of birth defects in each group.
A confidence interval is a way of graphically representing the probability that the relative risk figure or any other relationship between two studied variables is the actual relationship. The interval is a range of sets of possible values for the true parameter that is consistent with the observed data within specified limits. Rothman, supra, at 119; D. Barnes & J. Conley, Statistical Evidence in Litigation, § 3.15 at 107 (1986) (defining a confidence interval as a limit above or below or a range around the sample mean, beyond which the true population is unlikely to fall). A 95% confidence interval is constructed with enough width so that one can be confident that it is only 5% likely that the relative risk attained would have occurred if the true parameter, i.e., the actual unknown relationship between the two studied variables, were outside the confidence interval. If a 95% confidence interval thus contains "1", or the null hypothesis, then a researcher cannot say that the results are "statistically significant," that is, that the null hypothesis has been disproved at a .05 level of significance. Kaye, Is Proof of Statistical Significance Relevant?, supra, at 1348.
The result of a study should be reported, in Rothman's view, by reference to the confidence intervals at various confidence levels, e.g., 90%, 95%, 99%. The inclusion of confidence intervals of a variety of levels reflects Rothman's view that the predominating choice of a 95% confidence level is but an arbitrarily selected convention of his discipline. More importantly, however, Rothman insists that the precise locations of the boundaries of the confidence intervals, the all important focus of "significance testing," are far less important than their size and location. According to Rothman, statistical theory suggests that it is "much more likely that the [true] parameter [i.e. the true relationship between the studied variables] is located centrally within an interval than it is that the parameter is located near the limits of the interval." Rothman, supra, at 124. As such, the primary focus should not be on the ends of an interval but rather on the "approximate position of the interval as a whole on its scale of measurement. . . ." Id.
Finally, Rothman contends that the use of significance testing is especially unhelpful when a decisionmaker is attempting to draw inferences from more than one study. Different studies may each be rejected as insignificant, yet, when the studies are looked at collectively, a majority of the data may be moderately or strongly contradictory to the null hypothesis. By failing to look at the collective data in the context of confidence intervals and the most likely estimate for the true parameter ...