In an effort to standardize study reporting using this objective PSA parameter, the American Society for Therapeutic Radiology and Oncology (ASTRO) assembled an expert panel in 1996 and charged the members with reaching a consensus on "the significance of the depth of the PSA nadir, the definition of a rising PSA, the optimal PSA surrogate end point for total eradication of tumor or for relapse after irradiation, and guidelines for using PSA end points for reporting (publishing) success or failure after irradiation."[7] Data on patient outcomes with various PSA characteristics and trends after external-beam radiation along with illustration of various failure definitions were supplied by investigators from seven institutions, who had made significant contributions to the literature in the PSA era. Following presentations by these individuals and additional information obtained from recursive partitioning techniques, the ASTRO panel agreed on four guidelines:
- Biochemical failure is not justification per se to initiate additional treatment. It is not equivalent to clinical failure. It is, however, an appropriate early end point for clinical trials.
- Three consecutive rises in PSA is a reasonable definition of biochemical failure after radiation therapy. For clinical trials, the date of failure should be the midpoint between the postirradiation nadir PSA and the first of the three consecutive rises.
- No definition of PSA failure has, as yet, been shown to be a surrogate for clinical progression or survival; and
- Nadir PSA is a strong prognostic factor, but no absolute level is a valid cutoff point for separating successful and unsuccessful treatments. Nadir PSA is similar in prognostic value to pretreatment prognostic variables.
As is not uncommon in any situation where attention is focused on one primary recommendation, the less salient, seemingly minor points that arose from the consensus conference, such as requiring a minimum of 24 months follow-up before reporting a study and in some way indicating patients who had two but not three rises in PSA, were largely overlooked. Additionally, as frequently occurs in piloting a new procedure, more experience with the proposed definition and mature follow-up in the PSA era led to the emergence of deficiencies and interpretative issues. Major criticisms of the ASTRO definition included the lack of consideration of laboratory variation and standard error, the extensive period of time (which was follow-up interval- dependent) required to document three PSA rises, the bias associated with backdating the failure date, the potential for greater sensitivity and specificity of other definitions in predicting clinical failure, and the substantial difference between this and surgical definitions of failure. Several methods have been proposed to deal with the areas of concern. To accommodate the variation in laboratory testing of PSA levels, to reduce the problems associated with PSA values near the lower limit of detection appearing to change by large percentages when increases occurred, and to discount minor fluctuations in PSA production in normal prostate tissue, a definition that quantified the minimum amount of each rise in PSA was proposed-for example, three PSA rises of at least 0.5 ng/mL each.[9] Another method recommended as a way to compensate for small but consecutive increases in PSA level, which also reportedly enhanced the predictive power of the ASTRO definition, was to stipulate a required minimum total PSA level (eg, 1.5 ng/mL) in addition to the requirement of three consecutive rises in PSA.[10] Proposals to decrease the amount of time necessary to document three PSA rises, especially with follow-up intervals of 6 months or more, included defining failure as fewer rises, such as two but of a certain value each, or any elevation above an absolute nadir value, such as 0.2 ng/mL.[9,11] Some authors suggested that the bias introduced by backdating could be remedied by allowing for adequate length of follow-up, perhaps an additional 3 years beyond the point chosen for analysis.[12] The bias stems from reporting failure at the backdated date, when it actually takes considerably longer (ie, time for three PSA rises) to declare a treatment failure. If there is not enough follow-up time available to allow for three rises in PSA, the failure rate may be significantly underestimated at the backdated date. This bias could also be dealt with by moving the reported failure date to the date when the failure was actually determined-in the case of the ASTRO definition, to the date of the third PSA rise.[13] This, of course, would remove the backdating aspect of the definition. Finally, the backdating bias could be handled by a more complicated option that would involve backdating the censor date for patients with one or two PSA rises for whom no additional information was available.[14] To address the lack of uniformity between the definitions of failure used for differing treatment modalities, a single, surgically oriented definition using a solitary cutoff point was proposed.[ 11,15] As might be expected, without appropriate sensitivity and specificity testing, enthusiasm was decidedly lacking. Comparing Failure Definitions External-Beam Irradiation
To expand the work of Shipley et al and the first multi-institutional outcome study,[8] nine institutions recently contributed 4,839 T1/2 prostate cancer patients to a single, combined database.[16] All patients were treated in the PSA era and, therefore, had both pretreatment and a series of posttreatment PSA measurements. These men were treated with definitive external- beam irradiation alone no more recently than 1995, to provide potential follow-up of at least 5 years. Median follow-up was calculated at 6.3 years with 2,049 patients still available for analysis at 5 years, 616 at 8 years, and 179 at 10 years posttreatment. Not only did this database provide the most robust outcome report on external-beam radiation to date, but it also provided a valuable resource with which to test and compare definitions of failure. Although multiple failure definitions have been suggested for various reasons, this large body of data provided a sound basis for the testing and objective comparison of definitions.
- Sensitivity and Specificity-After
examining the work on this subject,
Thames et al tested 101 definitions of
PSA failure with regard to their sensitivity
and specificity in predicting clinical
failure using the 4,839 patient
multi-institution database.[17] A selection
of these findings is listed in
Table 1. The investigators defined
current nadir as the lowest PSA measurement
prior to the current measurement.
Thus, current nadir will
change during follow-up. Absolute
nadir is the lowest PSA value during
the entire follow-up period as assessed
retrospectively. (The absolute nadir
is the minimum of all the current nadirs
measured during follow-up.) Failure
definitions based on two, three, or
four PSA rises are dated either at the
time of the last rise ("call date"), or
are "backdated" to the date halfway
between the first PSA rise and the
previous PSA. Clinical failure was
defined as local disease recurrence,
distant metastasis, PSA > 25 ng/mL,
or the administration of hormone therapy.
Although the last two of these
factors are subject to debate, this definition
took into account reasonable
criterion and opinion to date.
The last issue-the administration
of hormone therapy before clinical
failure-is especially problematic.
While the reality of the situation found
in any dataset is that hormone therapy
is at times administered prior to clinical
failure, altering the natural progression
of disease, it is not certain
that these patients would indeed have
developed clinical failure if treatment
had not been administered. One must
assume, however, that they would be
the very ones with the most suspicion
of failure such that this circumstance
cannot be ignored and the patients
simply censored. Therefore, each of
these four criteria was included in the
definition of clinical failure.
To calculate the sensitivity and
specificity of each definition of failure,
each patient was scored as either
true-negative (no biochemical failure
and no clinical failure), true-positive
(biochemical failure followed by clinical
failure), false-negative (clinical
failure without a preceding biochemical failure or biochemical
failure occurring after clinical failure), or falsepositive
(biochemical failure not followed by clinical
failure). Then:
Sensitivity = true-positive/
true-positive + false-negative
Specificity = true-negative/
true-negative + false-positive
Thus, the sensitivity of a definition is the proportion
of patients with a clinical recurrence who have had a
prior biochemical failure, and the specificity is the proportion
of patients without clinical failure who have not
had a biochemical failure. As seen in Table 1, only four
definitions of biochemical failure were both more sensitive
and specific than the ASTRO definition. In retrospect,
the ASTRO definition appears to have been a
remarkably good choice.
