Introduction
Only what is exhaustive is of interest. The truth comes from an
accumulation of details.
Thomas Mann (1875-1955)
Meta-analysis can be defined as the process of combining information from several studies that address the same question.[1] When attempting to answer an important clinical question, such as the value of some new therapy, clinicians informally combine pieces of knowledge that come from various sources. Meta-analysis is a systematic, quantitative approach to this combination process. As such, it is an essential tool for evidence-based medicine.[2]
This article will focus on the combination of results from randomized clinical trials. For the sake of simplicity, we will assume that the comparison of interest in all trials is between two groups of patients randomly allocated to an experimental treatment (treatment arm) or a control treatment (control arm). We will further assume that the arms are being compared with respect to their effect on some untoward event, such as the patients death,[3] progression of the tumor,[4] the lack of documented tumor shrinkage in patients with measurable metastases,[5] or some form of toxicity.[6]
Table 1 shows the basic data needed for each trial in order to perform a meta-analysis. Essentially, these data consist of the number of patients randomized and the number of untoward events (or failures) per treatment arm in each trial.
How Is a Meta-analysis Done?
The basic statistical techniques needed to perform a meta-analysis were described in the 1950s[7] and are rather straightforward,[8] even though more sophisticated approaches have sometimes been advocated.[9] The essential tenet in a meta-analysis is that all trials measure a common treatment effect, such that any differences between the treatment effects observed in the trials is due primarily to chance. This common treatment effect can then be estimated as the weighted average of the treatment effects in the individual trials.
Stratification by Trial
The essential statistical technique used in a meta-analysis is stratification by trial, which guarantees that patients are compared within each trial and not across different trials.[10] Medically, it makes sense to compare like with like; ie, patients randomized to the treatment arm in a particular trial are compared with those randomized to the control arm in that same trial.[3] Statistically, comparing like with like is achieved through stratification by trial. The need for such stratification is illustrated through the data presented in Table 2.
Table 2 shows hypothetical failure rates in two randomized trials. The ratio of failure rates, which measures the effect of treatment, is equal to 0.5 in both trials, indicating that the risk of the event is halved by treatment as compared to control.
The three panels in Table 2 show the overall failure rates that would be estimated in three different scenarios if the data of the two trials were simply pooled without stratification (ie, considering all patients together). In the upper panel, equal numbers of patients are allocated to control and treatment arms in both trials A and B. In this case, the overall failure rate is 0.5, as expected.
In the middle panel, three patients are allocated to the treatment arm for every patient allocated to the control arm in trial A, whereas the converse is true in trial B. In this case, the overall failure rate is 1.0; ie, the overall results show no benefit of the experimental treatment.
In the lower panel, 19 patients are allocated to the treatment arm for every patient allocated to the control arm in trial A, whereas the converse is true in trial B. In this case, the overall failure rate is 2.0; ie, the overall results now favor the control arm! This third scenario is admittedly rather extreme, but it illustrates the paradoxical situations that can arise if a naive pooling is used, rather than a properly stratified analysis.
A Real-Life Example
We now describe a specific, real-life example to illustrate the concept of stratification. Consider three trials that were performed in patients with advanced colorectal cancer to compare a standard treatment, fluorouracil(Drug information on fluorouracil) and leucovorin (5-FU/LV), with the investigational drug raltitrexed (Tomudex).[11,12] One of these three trials was performed in the United States (study 10[13]), while the other two were performed in Europe, South Africa, and Australasia (study 3[14] and study 12[15]).
Table 3 shows the outcomes of these trials in terms of tumor response.[12] In this table, nonresponders refers to patients with stable or progressive disease, while responders refers to patients who achieved a partial or complete response. Note that the numbers inTable 3 are taken from a compilation of published resultsan approach that is not generally recommended and is used here only for illustrative purposes.
Statistical Calculations
The O, E, and V StatisticsFirst, we will consider study 3.[14] This trial randomized 223 patients to raltitrexed and 216 patients to 5-FU/LV (Table 3). The response rates were 43/223 (19%) with raltitrexed therapy vs 36/216 (17%) with 5-FU/LV therapy.
One way to express the benefit of raltitrexed is to calculate the tumor response odds ratio, which is the odds of failing to achieve a tumor response with raltitrexed divided by the odds of failing to achieve a tumor response with 5-FU/LV therapy. This definition implies that the odds ratio is smaller than 1 if raltitrexed is better than 5-FU/LV, equal to 1 if there is no difference between raltitrexed and 5-FU/LV, and larger than 1 if raltitrexed is worse than 5-FU/LV.
The basic statistics needed to perform the meta-analysis, denoted O, E, and V are calculated as follows:
O is the observed number of failures on treatment (ie, of nonresponders on raltitrexed). Using the notation of Table 1, O = Ft. Using the data from the US trial (study 3) in Table 3, O = 180.
E is the expected number of failures on treatment if there was no difference between the treatment and control arms in terms of response rate. In that case, one would expect the same rate of response and nonresponse on both treatments. Using the notation of Table 1, the overall rate of nonresponse for all patients is F/N, and since there are Nt patients in the treatment arm, the expected number of responses in this arm is E = Nt × F/N. Using the data from study 3 in Table 3, E = 223 × 360/439 = 182.9.
V is the variance of the number of failures on treatment. It can be computed by the formula, V = Nt × Nc × F × S / [N² × (N 1)]. If approximately equal numbers of patients are randomized to the treatment and control arms, Nt @ Nc @ N/2; therefore, V @ F × S / [4 ×(N 1)]. Using the data from study 3 in Table 3, V @ 360 × 79/(4 × 438) = 16.2.
Odds Ratio and its Statistical SignificanceHaving calculated O, E, and V, one can estimate the odds ratio and test for its significance as follows:
-
An estimate of the logarithm of the odds ratio is provided by the equation, log OR = (O E) / V, and its standard error by 1/ÖV. Note that if O = E, then log OR = 0 and OR = 1, showing again that the odds ratio is equal to 1 when there is no difference between the two treatment arms.
Using the data from study 3 inTable 3, log OR = (180 182.9)/16.2 = 0.18; therefore, OR = e 0.18 = 0.84. The 95% confidence limits of the odds ratio are 0.52 and 1.37. Since these limits include the value of 1, the odds ratio is not significantly different from 1, and thus, there is no statistical evidence that raltitrexed is different from 5-FU/LV in terms of the response rate.
-
The statistical significance of the odds ratio can be tested via the statistic X2 = (O E)2/V, which is compared to a table of the c² distribution with 1 degree of freedom. Using the data from study 3 in Table 3, X2 = (180 182.9)2/16.2 = 0.52. A table of the c² distribution with one degree of freedom indicates that the probability (P) of observing a value of the statistic at least that large is equal to .48. Again, this shows that the difference between the response rates with raltitrexed (19%) and 5-FU/LV (17%) may be ascribed to chance fluctuations.
Combination of Results from Different TrialsWe will now consider the other two trials. Similar calculations yield O = 186, E = 185.0, and V = 13.5 for study 10 and O = 201, E = 201.6, and V = 18.6 for study 12. We can now combine the results of both trials in terms of tumor response rates:
An estimate of the logarithm of the overall odds ratio is provided by the formula, log OR = S(O E)/SV, with standard error 1/ÖSV, where S stands for summation over all trials. In our example, log OR = (2.9 + 1.0 0.6) / (16.2 + 13.5 + 18.6) = 0.05; therefore, OR = e0.05 = 0.95. The 95% confidence limits of the overall odds ratio are equal to 0.72 and 1.27. Again, since these limits include the value of 1, the over-all odds ratio is not significantly different from 1, and there is no statis-tical evidence that raltitrexed is different from 5-FU/LV in terms of the response rate.
The statistical significance of the overall odds ratio can be tested via a c2 statistic, given simply by X²overall = [S(O E)]2/SV; this is compared to a table of the c² distribution with 1 degree of freedom. In our example, X2overall = (2.9 + 1.0 0.6)2/(16.2 +13.5 + 18.6) = 0.12. A table of the c2 distribution with 1 degree of freedom indicates that the probability (P) of observing a value of the statistic at least that large is equal to .72, which is far from statistical significance.
Test for HeterogeneityIt is often important to test that the odds ratios are roughly equal in the trials being combined. This hypothesis can be tested by the formula, X2hetero = SX2 X2overall, where S , again, stands for summation over all trials, and the X2 statistics are the same as those calculated above. The X2hetero statistic is compared to a table of the S2 distribution with n 1 degrees of freedom, where n is the number of trials.
In our example, X2hetero = 0.48 with 2 degrees of freedom, and the test for heterogeneity is far from statistically significant; hence we have no evidence that treatment effects differ between the three trials.
Graphical PresentationThe results of our meta-analysis can also be presented in graphical form, as suggested by Peto and as shown in Figure 1. Each line of the graph shows the data and results of one trial.[3] The point estimate of the odds ratio is depicted as a square whose surface area is proportional to the variance. (Hence, a trial with many patients will be represented by a large square and a trial with few patients, by a small square.) The 95% confidence interval of the odds ratio is depicted by a horizontal line extending from the lower to the upper limit.
The last line of the graph shows the combined data and results of the overall analysis. (The overall odds ratio is depicted as a diamond, the extremities of which are the lower and upper limits of the 95% confidence interval.)
This representation, sometimes called a forest plot (an opportune name when many trials are included in the meta-analysis), has become fairly standard, as it provides an exhaustive and intuitively appealing summary of the available information. One nice feature of forest plots is that they present summary data on all individual trials, so that the reader can assess the impact of each trial on the overall results.
Time-Related FailuresAlthough the above calculations are quite simple, they only apply to treatment failures that can be tabulated as 2 ×2 tables of the type shown in Table 1. If interest focuses on time-related failures, such as death or tumor progression, one needs to consider not only the number of failures in each treatment arm but also the time at which each failure is observed.[16] This approach is more tedious than that explained above but is conceptually similar. The formulas for O, E, and V are derived from life tables based on individual survival times instead of a single 2 × 2 table as above, and (O E)/V is an estimate of the log hazard ratio, rather than the log odds ratio. The interpretation of hazard ratios is similar to that of odds ratios, however, in that values smaller than 1 indicate a survival benefit for the experimental treatment, a value of 1 indicates no difference, and values larger than 1 indicate a survival benefit for the control treatment.
Figure 2 shows the statistics and forest plot for survival data given in the compilation of published results of the three trials comparing raltitrexed to 5-FU/LV. As emphasized above, the use of such published data is less than satisfactory, but Figure 2 includes some interesting features, as compared with Figure 1.
First, the survival hazard ratio is significantly higher than 1 in one of the three trials (study 10[13]), indicating a statistically longer survival with 5-FU/LV than with raltitrexed. The survival hazard ratios in the other two trials also favor 5-FU/LV, although not significantly so. The combination of the three trials confirms a significantly worse survival with raltitrexed therapy than with 5-FU/LV treatment (P = .009). Finally, the test for heterogeneity fails to indicate significant heterogeneity among the three trials (P = .38), with a consistent tendency for raltitrexed to be inferior to 5-FU/LV in terms of overall survival.
Again, these analyses are based on the published results of selected trials, and, therefore, their results should be viewed with caution.[17] A meta-analysis based on updated individual patient data from all of the randomized trials comparing raltitrexed with 5-FU/LV would undoubtedly be more reliable, should these data become available for analysis. Also, with individual patient data, more informative analyses would have been possible, such as the estimation of survival curves and hazard rates over time, and an adjustment for covariates through a Cox model.
