Genome-Wide Association Studies of Cancer: Principles and Potential Utility
Genome-Wide Association Studies of Cancer: Principles and Potential Utility
ABSTRACT: Genome-wide association studies (GWAS) have emerged as a new approach for investigating the genetic basis of complex diseases. In oncology, genome-wide studies of nearly all common malignancies have been performed and more than 100 genetic variants associated with increased risks have been identified. GWAS approaches are powerful research tools that are revealing novel pathways important in carcinogenesis and promise to further enhance our understanding of the basis of inherited cancer susceptibility. However, “personal genomic tests” based on cancer GWAS results that are currently being offered by for-profit commercial companies for cancer risk prediction have unproven clinical utility and may risk false conveyance of reassurance or alarm.
Until relatively recently, the field of medical genetics has focused on the identification and treatment of rare, single-gene disorders that are usually associated with a high-risk of a particular disease or trait. Since Watson and Crick’s groundbreaking explanation of DNA structure in 1953, our understanding of patterns of inheritance has progressed considerably. We now recognize that the genetic architecture of complex diseases such as cancer is probably better characterized by polygenic and multifactorial inheritance, wherein heritability is determined by the joint action of multiple genes and their interaction with environmental factors.
Technological advances have facilitated detailed interrogation of the human genome and moved investigations of genetic causation from linkage studies in high-risk cancer families to genome-wide association studies (GWAS). Through a hypothesis-neutral genome-based approach, GWAS compare common DNA variations in a large set of unrelated cases and controls to identify genetic variants associated with disease risk. Common genetic variations that contribute to susceptibility to more than 40 different diseases, including heart disease, diabetes, asthma, psychiatric disorders and inflammatory bowel disease, have been identified.
GWAS have also been performed for most common malignancies and more than 100 genomic variants associated with risks of various cancers have been described. This unprecedented rapid amassing of new genetic risk variants associated with cancer risk has generated hope that these germ-line markers may also prove useful for cancer prevention, improve our understanding of cancer pathogenesis, and possibly direct treatment of patients. However, considerable scientific barriers must be overcome before these genomic data can meaningfully contribute to patient care. In this review we detail the role of genomic variation in cancer susceptibility and discuss the potential clinical implications of recent discoveries. We review warnings by professional societies and government advisory bodies regarding the use of genomic information to guide clinical decisions, with particular emphasis on the potential risks posed by for-profit, direct-to-consumer marketing of genomic risk panels.
The Genetics of Cancer Predisposition
Genetic linkage studies performed in the 1980s and 1990s identified a number of highly penetrant syndromes responsible for cancer predisposition within certain families. Molecular insight into breast-ovarian cancer (BRCA1/2), Lynch (mismatch repair genes), Li Fraumeni (p53) and Cowden (PTEN) syndromes, among others, has had a powerful impact on current clinical management of families affected by these syndromes. These syndromes, however, account for only a fraction of the familial risk of cancer, leaving much of the heritability of cancer unexplained.
Epidemiologic evidence provides strong support for a hereditary component to cancer. Twin studies demonstrating concordance for most cancers among monozygotic vs dizygotic twins and siblings, and population studies of familial clustering, suggest that genetic factors predispose to even environmentally induced diseases such as lung cancer. Using a candidate gene approach, pathways thought to be important in carcinogenesis have been investigated. In breast cancer, for example, a small number of candidate genes involved in the response to DNA damage (ATM,[4-6] CHEK2,[7-9] BRIP1, and PALB2) have been associated with a modest increase in breast cancer risk (approximately two-fold). It is estimated that these genes, in combination with the high penetrance but rare cancer-predisposition–syndrome genes such as BRCA, account for only about one-quarter to one-third of familial risk, with the remainder remaining unexplained. One explanation for this seemingly “missing” fraction of heritable disease is the common variant common disease (CVCD) hypothesis, which assumes that many different common genetic variants, each with a small effect size, collectively cause disease.
Recent developments, including the completion of the Human Genome Project, the HapMap Project, and the emergence of high-throughput genotyping, have allowed further investigation of this hypothesis, through the GWAS approach.
Experimental Design of Genome-Wide Association Studies
Genomic variation in GWAS reported to date is largely represented by single base pair changes known as single nucleotide polymorphisms (SNPs). Although structural genomic variations such as deletions, duplications, inversions, and copy-number variations also exist and can be measured, assessment of these more complex genomic changes are just beginning to be incorporated into large genome-wide studies. GWAS rely on the phenomenon of linkage disequilibrium, wherein SNPs are not inherited individually but instead are in linkage disequilibrium blocks, with many nearby SNPs being highly correlated. This enables the selection of one SNP (the tag SNP) that represents up to 50,000 surrounding base pairs. Through the HapMap project we have learned that genotyping sets of 500,000 to 1,000,000 tag SNPs can cover approximately 80% of all common SNPs in the genome.
These striking technological advances have facilitated GWAS studies using multistep designs (Figure 1). As large GWAS are often cost-prohibitive, a tiered, multistage approach is commonly used. In a simple two-stage design, all tag SNPs are tested during the first stage on a small subset of cases and controls, usually between 25% to 50% of total participants. During the second stage, significant SNPs, comprising approximately 10% of the SNPs tested, are genotyped on all remaining samples. The same sample of SNPs is then tested on the initial subset,allowing for a combined analysis. For study designs with three or more stages, the significant SNPs are included for replication testing in different case-control sample sets.
A typical GWAS, containing 1,000 to 2,000 cases with often a larger number of controls, is powered to detect differences in common SNPs with an effect size of 1.5–2.0-fold. Importantly, some smaller GWAS reporting effect sizes of 1.1–1.3 must be interpreted with caution, as these studies may have been underpowered to detect such small differences. In fact, the detection of modest genetic effects with odds ratios of 1.3 or less and minor allele frequencies under 10% may require more than 10,000 cases and 10,000 controls for adequate statistical power. Most recent significant findings have been for alleles with relative risks of 1.1 to 1.3, and their detection has required large, global, collaborative group efforts.
GWAS approaches require careful selection and accurate characterization of case and control participants. In breast cancers, for example, gene-expression profiling has defined different subtypes of breast tumors (eg, luminal, basal subtype) with variations in response to treatment and clinical outcomes. As we learn more about the heterogeneity of malignancy it may be necessary to perform individual GWAS for different subtypes of disease, to accurately detect underlying genetic heterogeneity. As an example, the 10q26 locus that maps to FGFR2 and has been implicated in a number of breast cancer GWAS, and is most strongly associated with the diagnosis of estrogen receptor (ER)-positive disease.
GWAS for women with triple-negative breast cancer are ongoing and may demonstrate different genetic associations. A well-matched and carefully characterized control cohort is another key component of GWAS design that is a challenging but essential prerequisite for a precise study. For example, cases and controls must be well-matched to avoid population stratification, which can bias results due to differences in race or ethnicity of those with and without the disease of interest.[20-23]
Genome-Wide Association Study Interpretation
Analysis of GWAS data produces an odds ratio and P value for each SNP, but great care is required to avoid false-positive or false-negative reporting. A quality control ‘data cleaning’ is first performed to detect problems such as poor genotyping results or unexpected relatedness among participants. Allele frequencies among controls, for example, are compared with an accepted standard (Hardy-Weinberg equilibrium), as deviations from this equilibrium often indicate a methodologic flaw in the study. Population heterogeneity can introduce a bias in studies of participants with different ancestries, although a post-hoc principal components analysis can be performed to account for this. Further stringent statistical analysis must be applied to account for the multiple testing performed in GWAS and is usually achieved using a Bonferroni correction, in which the threshold P value (usually 5 × 10-2) is divided by the number of tests performed (~500,000 depending on array used) resulting in a P value in the region of < 1 × 10-7.
Only associations between a SNP and an outcome of interest that fall below this threshold should be considered statistically significant, but despite this rigid approach false-positive results frequently occur, as evidenced by failure to replicate findings in subsequent studies using the same SNPs and similar populations. Despite optimal design and careful analysis, interpretation of GWAS results has been complicated by a publication bias whereby positive associations are preferably published.
Lessons From Genome-Wide Association Studies in Cancer
Genomic variation has not yet contributed positively to patient management in cancer care; nonetheless, the 50-plus GWAS that have been performed for more than 15 malignancies have generated many interesting associations, stimulating widespread interest in the clinical utility of this tool. Breast cancer (nine GWAS),[25-33] prostate cancer (12 GWAS),[34-45] colorectal cancer (seven GWAS),[46-52] and lung cancer (seven GWAS),[53-57] have been extensively investigated, and similar studies have been performed in pancreatic,[58,59] gastric, esophageal, bladder,[62,63] testicular,[64,65] ovarian, skin,[67-71] thyroid, neuroblastoma,[73,74] brain,[75,76] and hematological cancers,[77-81](Tables 2 and 3). A detailed discussion of each study is beyond the scope of this manuscript, but some notable lessons have been learned from individual findings.
The 8q24 locus, first reported in Icelandic patients, is of particular interest. It is located in a region containing no known genes (referred to as a ‘gene desert’) but has been associated with prostate, bladder, breast, and colorectal cancer, and the association has been successfully replicated in independent studies. It is possible that genomic variation at this locus marks a carcinogenic pathway common to many cancers, and recent evidence suggests a role in regulation of the myc oncogene.[82-84] A second locus at 5p15.33 has also been associated with multiple cancers. A multicancer study of > 30,000 cases and > 45,000 controls identified an association between the 5p15.33 locus and risk of lung, bladder, prostate, cervical, and basal cell cancer, with a trend towards a protective effect for cutaneous melanoma.
Epidemiological studies have identified a higher incidence of aggressive prostate cancer in African-American men, and there has been much debate as to whether this represents a socioeconomic difference in how African-American men are screened and treated for prostate cancer or a distinct natural history. Recent GWAS data suggest that seven independent risk variants identified in the 8q24 region are more commonly associated with prostate cancer in African-American men, possibly suggesting a genetic link for this epidemiological observation. The first GWAS in pancreatic cancer provided another example of genomic data supporting prior epidemiological observations. Pancreatic cancer risk had long been associated with ABO blood type, and a SNP located in the first intron of the ABO blood group gene on chromosome 9q24 significantly increased the risk of pancreatic cancer by 1.2-fold.
Cancer GWAS have also identified associations in genes encoding proteins known to be important in organogenesis and organ function. Prostate cancer GWAS have identified a number of significant associations between prostate cancer diagnosis and SNPs in genes that encode key products of the prostate; serum levels of micro-seminoprotein and a number of kallikreins have been identified as potential prostate cancer biomarkers, and GWAS have now offered associated genomic markers for further study.[34,35] In testicular germ cell tumors, in which genetic susceptibility is supported by an 8- to 10-fold and a 4- to 6-fold increased risk of disease in brothers and fathers of patients, respectively, the two GWAS both identified risk variants at 12p22,[64,65] a region that contains KITLG, which is necessary for germ cell development. The risk variant at 12p22 has one of the highest associated risks of any GWAS, conferring more than a 2.5-fold increase in cancer risk. The only thyroid cancer GWAS performed to date identified two significant predisposing SNPs, one at 9q22.33 near FOXE1, which has been implicated in thyroid organogenesis, and the other at 14q13.3 near NKX2-1, a gene important in thyroid gland differentiation.
GWAS in hematological malignancies have identified a number of predisposing SNPs for
chronic lymphocytic leukemia, non-Hodgkin lymphoma, and childhood leukemia[77-80]; however, in a study of myeloproliferative neoplasms (MPN) we noted an interesting finding. A somatic point mutation in JAK2 is commonly found in MPN, and the GWAS identified a germline SNP in this same gene that resulted in a three-fold-increased risk of developing MPN. This finding linked this germline and somatic variation, a finding similar to the I1307K APC variant and colorectal cancer.
Melanoma GWAS have provided further results of biological interest. CDKN2A is a well-recognized high-penetrance melanoma susceptibility gene, and a recent GWAS associated an adjacent locus with increased risk of melanoma, confirming this as an important region in melanoma pathogenesis. This same study reported a risk allele at 16q24, a region encompassing MC1R that is involved in cell cycle regulation, and this SNP was previously reported in a GWAS of hair color and skin pigmentation. Significant associations with variants affecting hair, eye, and skin color were identified in another melanoma GWAS, confirming the ability of this approach to reflect underlying biology.