Metaanalysis: Methods, Strengths, and Weaknesses
Metaanalysis: Methods, Strengths, and Weaknesses
Introduction
Only what is exhaustive is of interest. The truth comes from an
accumulation of details.
Thomas Mann (18751955)
Metaanalysis can be defined as the process of
combining information from several studies that address the same
question.[1] When attempting to answer an important clinical
question, such as the value of some new therapy, clinicians
informally combine pieces of knowledge that come from various
sources. Metaanalysis is a systematic, quantitative approach to this
combination process. As such, it is an essential tool for
evidencebased medicine.[2]
This article will focus on the combination of results from randomized
clinical trials. For the sake of simplicity, we will assume that the
comparison of interest in all trials is between two groups of
patients randomly allocated to an experimental treatment
(treatment arm) or a control treatment (control
arm). We will further assume that the arms are being compared
with respect to their effect on some untoward event, such as the
patients death,[3] progression of the tumor,[4] the lack of
documented tumor shrinkage in patients with measurable metastases,[5]
or some form of toxicity.[6]
Table 1 shows the basic data needed for each trial in order to
perform a metaanalysis. Essentially, these data consist of the
number of patients randomized and the number of untoward events (or
failures) per treatment arm in each trial.
How Is a Metaanalysis Done?
The basic statistical techniques needed to perform a metaanalysis
were described in the 1950s[7] and are rather straightforward,[8]
even though more sophisticated approaches have sometimes been
advocated.[9] The essential tenet in a metaanalysis is that all
trials measure a common treatment effect, such that any differences
between the treatment effects observed in the trials is due primarily
to chance. This common treatment effect can then be estimated as the
weighted average of the treatment effects in the individual trials.
Stratification by Trial
The essential statistical technique used in a metaanalysis is
stratification by trial, which guarantees that patients are compared
within each trial and not across different trials.[10] Medically, it
makes sense to compare like with like; ie, patients
randomized to the treatment arm in a particular trial are compared
with those randomized to the control arm in that same trial.[3]
Statistically, comparing like with like is achieved through
stratification by trial. The need for such stratification is
illustrated through the data presented in Table 2.
Table 2 shows hypothetical failure rates in two randomized trials.
The ratio of failure rates, which measures the effect of treatment,
is equal to 0.5 in both trials, indicating that the risk of the event
is halved by treatment as compared to control.
The three panels in Table 2 show the overall failure rates that would
be estimated in three different scenarios if the data of the two
trials were simply pooled without stratification (ie,
considering all patients together). In the upper panel, equal numbers
of patients are allocated to control and treatment arms in both
trials A and B. In this case, the overall failure rate is 0.5, as expected.
In the middle panel, three patients are allocated to the treatment
arm for every patient allocated to the control arm in trial A,
whereas the converse is true in trial B. In this case, the overall
failure rate is 1.0; ie, the overall results show no benefit of the
experimental treatment.
In the lower panel, 19 patients are allocated to the treatment arm
for every patient allocated to the control arm in trial A, whereas
the converse is true in trial B. In this case, the overall failure
rate is 2.0; ie, the overall results now favor the control arm! This
third scenario is admittedly rather extreme, but it illustrates the
paradoxical situations that can arise if a naive pooling is used,
rather than a properly stratified analysis.
A RealLife Example
We now describe a specific, reallife example to illustrate the
concept of stratification. Consider three trials that were performed
in patients with advanced colorectal cancer to compare a standard
treatment, fluorouracil and leucovorin (5FU/LV), with the
investigational drug raltitrexed (Tomudex).[11,12] One of these three
trials was performed in the United States (study 10[13]), while the
other two were performed in Europe, South Africa, and Australasia
(study 3[14] and study 12[15]).
Table 3 shows the outcomes of these trials in terms of tumor
response.[12] In this table, nonresponders refers to
patients with stable or progressive disease, while
responders refers to patients who achieved a partial or
complete response. Note that the numbers inTable 3 are taken from a
compilation of published resultsan approach that is not
generally recommended and is used here only for illustrative purposes.
Statistical Calculations
The O, E, and V StatisticsFirst, we will consider study
3.[14] This trial randomized 223 patients to raltitrexed and 216
patients to 5FU/LV (Table 3). The response rates were 43/223 (19%)
with raltitrexed therapy vs 36/216 (17%) with 5FU/LV therapy.
One way to express the benefit of raltitrexed is to calculate the
tumor response odds ratio, which is the odds of failing
to achieve a tumor response with raltitrexed divided by the odds of
failing to achieve a tumor response with 5FU/LV therapy. This
definition implies that the odds ratio is smaller than 1 if
raltitrexed is better than 5FU/LV, equal to 1 if there is no
difference between raltitrexed and 5FU/LV, and larger than 1 if
raltitrexed is worse than 5FU/LV.
The basic statistics needed to perform the metaanalysis, denoted
O, E, and V are calculated as follows:
O is the observed number of failures on treatment (ie,
of nonresponders on raltitrexed). Using the notation of Table 1, O =
Ft. Using the data from the US trial (study 3) in Table 3, O = 180.
E is the expected number of failures on treatment
if there was no difference between the treatment and control arms in
terms of response rate. In that case, one would expect the same rate
of response and nonresponse on both treatments. Using the notation of
Table 1, the overall rate of nonresponse for all patients is F/N, and
since there are Nt patients in the treatment arm, the expected number
of responses in this arm is E = Nt × F/N. Using the data from
study 3 in Table 3, E = 223 × 360/439 = 182.9.
V is the variance of the number of failures on
treatment. It can be computed by the formula, V = N_{t} ×
N_{c} × F × S / [N² × (N 1)]. If
approximately equal numbers of patients are randomized to the
treatment and control arms, N_{t} @ N_{c}@ N/2;
therefore, V @ F × S / [4 ×(N
1)]. Using the data from study 3 in Table 3, V @
360 × 79/(4 × 438) = 16.2.
Odds Ratio and its Statistical SignificanceHaving
calculated O, E, and V, one can estimate the odds ratio and test for
its significance as follows:

An estimate of the logarithm of the odds ratio is provided by the
equation, log OR = (O E) / V, and its standard error by 1/ÖV.
Note that if O = E, then log OR = 0 and OR = 1, showing again that
the odds ratio is equal to 1 when there is no difference between the
two treatment arms.
Using the data from study 3 inTable 3, log OR = (180
182.9)/16.2 = 0.18; therefore, OR = e ^{0.18} =
0.84. The 95% confidence limits of the odds ratio are 0.52 and 1.37.
Since these limits include the value of 1, the odds ratio is not
significantly different from 1, and thus, there is no statistical
evidence that raltitrexed is different from 5FU/LV in terms of the
response rate.

The statistical significance of the odds ratio can be tested via the
statistic X^{2} = (O E)^{2}/V, which is
compared to a table of the c²
distribution with 1 degree of freedom. Using the data from study 3 in
Table 3, X^{2} = (180 182.9)^{2}/16.2 = 0.52.
A table of the c² distribution with
one degree of freedom indicates that the probability (P) of
observing a value of the statistic at least that large is equal to
.48. Again, this shows that the difference between the response rates
with raltitrexed (19%) and 5FU/LV (17%) may be ascribed to chance fluctuations.
Combination of Results from Different TrialsWe will now
consider the other two trials. Similar calculations yield O = 186, E
= 185.0, and V = 13.5 for study 10 and O = 201, E = 201.6, and V =
18.6 for study 12. We can now combine the results of both trials in
terms of tumor response rates:
An estimate of the logarithm of the overall odds ratio is
provided by the formula, log OR = S(O
E)/SV, with standard error 1/ÖSV,
where S stands for summation over
all trials. In our example, log OR = (2.9 + 1.0
0.6) / (16.2 + 13.5 + 18.6) = 0.05; therefore, OR = e^{0.05}
= 0.95. The 95% confidence limits of the overall odds ratio are equal
to 0.72 and 1.27. Again, since these limits include the value of 1,
the overall odds ratio is not significantly different from 1, and
there is no statistical evidence that raltitrexed is different from
5FU/LV in terms of the response rate.
The statistical significance of the overall odds ratio can be
tested via a c2 statistic, given simply by X²_{overall}
= [S(O E)]^{2}/SV;
this is compared to a table of the c²
distribution with 1 degree of freedom. In our example, X^{2}_{overall}
= (2.9 + 1.0 0.6)^{2}/(16.2 +13.5 + 18.6) =
0.12. A table of the c2 distribution with
1 degree of freedom indicates that the probability (P) of observing a
value of the statistic at least that large is equal to .72, which is
far from statistical significance.
Test for HeterogeneityIt is often important to test that
the odds ratios are roughly equal in the trials being combined. This
hypothesis can be tested by the formula, X^{2}_{hetero}
= SX^{2} X^{2}_{overall},
where S , again, stands for
summation over all trials, and the X^{2}
statistics are the same as those calculated above. The X^{2}_{hetero}
statistic is compared to a table of the S^{2}
distribution with n 1 degrees of freedom, where n is the
number of trials.
In our example, X^{2}_{hetero} = 0.48 with 2 degrees
of freedom, and the test for heterogeneity is far from statistically
significant; hence we have no evidence that treatment effects differ
between the three trials.
Graphical PresentationThe results of our metaanalysis
can also be presented in graphical form, as suggested by Peto and as
shown in Figure 1. Each line of the graph shows the data and results
of one trial.[3] The point estimate of the odds ratio is depicted as
a square whose surface area is proportional to the variance. (Hence,
a trial with many patients will be represented by a large square and
a trial with few patients, by a small square.) The 95% confidence
interval of the odds ratio is depicted by a horizontal line extending
from the lower to the upper limit.
The last line of the graph shows the combined data and results of the
overall analysis. (The overall odds ratio is depicted as a diamond,
the extremities of which are the lower and upper limits of the 95%
confidence interval.)
This representation, sometimes called a forest plot (an
opportune name when many trials are included in the metaanalysis),
has become fairly standard, as it provides an exhaustive and
intuitively appealing summary of the available information. One nice
feature of forest plots is that they present summary data on all
individual trials, so that the reader can assess the impact of each
trial on the overall results.
TimeRelated FailuresAlthough the above calculations are
quite simple, they only apply to treatment failures that can be
tabulated as 2 ×2 tables of the type shown in Table 1. If
interest focuses on timerelated failures, such as death or tumor
progression, one needs to consider not only the number of failures in
each treatment arm but also the time at which each failure is
observed.[16] This approach is more tedious than that explained above
but is conceptually similar. The formulas for O, E, and V are derived
from life tables based on individual survival times instead of a
single 2 × 2 table as above, and (O E)/V is an estimate
of the log hazard ratio, rather than the log odds ratio. The
interpretation of hazard ratios is similar to that of odds
ratios, however, in that values smaller than 1 indicate a survival
benefit for the experimental treatment, a value of 1 indicates no
difference, and values larger than 1 indicate a survival benefit for
the control treatment.
Figure 2 shows the statistics and forest plot for survival data given
in the compilation of published results of the three trials comparing
raltitrexed to 5FU/LV. As emphasized above, the use of such
published data is less than satisfactory, but Figure 2 includes some
interesting features, as compared with Figure 1.
First, the survival hazard ratio is significantly higher than 1 in
one of the three trials (study 10[13]), indicating a statistically
longer survival with 5FU/LV than with raltitrexed. The survival
hazard ratios in the other two trials also favor 5FU/LV, although
not significantly so. The combination of the three trials confirms a
significantly worse survival with raltitrexed therapy than with
5FU/LV treatment (P = .009). Finally, the test for heterogeneity
fails to indicate significant heterogeneity among the three trials (P
= .38), with a consistent tendency for raltitrexed to be inferior to
5FU/LV in terms of overall survival.
Again, these analyses are based on the published results of selected
trials, and, therefore, their results should be viewed with
caution.[17] A metaanalysis based on updated individual patient data
from all of the randomized trials comparing raltitrexed with
5FU/LV would undoubtedly be more reliable, should these data become
available for analysis. Also, with individual patient data, more
informative analyses would have been possible, such as the estimation
of survival curves and hazard rates over time, and an adjustment for
covariates through a Cox model.