Oncologist Dr. Vivek Subbiah and computer scientist Dr. Jason Roszik, both from MD Anderson, discuss optimizing “omics” databases to inform cancer patient care.
Cancer Network spoke with oncologist Vivek Subbiah, MD, and mathematician and computer scientist Jason Roszik, PhD, MBA, who highlighted important current genetics/genomics and proteomics databases in the United States, and how they can be improved to better inform real-world treatment of patients with cancer.
Dr. Subbiah is an assistant professor in the Department of Investigational Cancer Therapeutics and Associate Medical Director for Clinical Center for Targeted Therapy at The University of Texas MD Anderson Cancer Center, where he studies precision oncology strategies including molecularly targeted therapies, immunotherapies, and radiopharmaceuticals for patients with cancer.
Dr. Roszik is an assistant professor in the departments of Genomic Medicine and Melanoma Medical Oncology at MD Anderson Cancer Center. He is developing computational algorithms and pipelines, and visualization tools for integration and analysis of large data sets to identify novel targets and predictive signatures for targeted therapies and immunotherapies. He participated in the international project known as The Cancer Genome Atlas (TCGA).
-Interviewed by Bryant Furlow
Cancer Network: How many public cancer genomics databases exist currently, and how are they used to inform clinical decision making?
Dr. Subbiah: The one database which is indirectly used in clinical decision making is the COSMIC database. Most of the databases provide objective frequency of aberrations.
Dr. Roszik: Most of these databases that contain genomic information on thousands of tumor samples are from major projects like the TCGA, COSMIC (the Wellcome Sanger Institute’s Catalogue of Somatic Mutations in Cancer), CCLE (Cancer Cell Line Encyclopedia), AACR Project GENIE (Genomics Evidence Neoplasia Information Exchange), and also NCI GDC (the National Cancer Institute’s Genomic Data Commons) and NCBI dbGaP that contain many additional large datasets.
Cancer Network: When you say that COSMIC is used indirectly in clinical decision making, what do you mean by that? Are its data accessed by a decision tool or application?
Dr. Subbiah: COSMIC has its own website where genes and mutations can be queried. In addition to this, decision-supporting tools that offer mutation analysis often use the COSMIC database and provide that information if a variant was found in COSMIC. COSMIC also contains information about whether the mutation is known or predicted to be pathogenic or not. This is very useful indirectly in clinical practice.
Cancer Network: You’ve recently argued that the “siloing” of cancer genomics and proteomics databases has hampered real-world implementation of precision oncology. What do you mean by that?
Dr. Subbiah: First of all, I would like to applaud the various stakeholders for getting together and putting up these databases that Dr. Roszik mentioned. One of the major challenges now is that we do not have the ability to track outcomes of these patients with molecularly targeted therapies. This is probably because of the lack of data structure for this. Also, “siloing” prevents us from including additional data types in analyses.
Cancer Network: Do you believe that the clinical relevance of “variants of unknown significance” (“VUS”) could be clarified if data from different databases were pooled for analysis?
Dr. Roszik: Connecting many databases makes it possible to include additional variables in analyses. In addition, there are databases and also scientific publications that contain the same type of data for different patient cohorts. Sample size is often a limiting factor when studying variants, especially in the case of rare cancers, or when only a very small subset of a common cancer is affected by a variant. Increasing the number of samples with genomics information would definitely help identify whether a variant is pathogenic or not.
Dr. Subbiah: Great question. Let’s discuss what constitutes a VUS. If genetic testing results are neither positive nor negative, then they fall in to the VUS category. Variant classification and attribution relies heavily on building levels of evidence. As we build evidence, the VUS category can be amended to “pathogenic” or “benign.” As I mentioned in my previous answer, we do not have annotated clinical information on molecularly targeted therapies in these public databases. We can, at best, know the prevalence of these VUS’s from these databases-but not the conversion from “unknown” to “known.” The COSMIC database that I mentioned before, from Sanger institute in the UK, is one that updates these data.
Cancer Network: Private companies that provide clinical sequencing services consider their databases to be business-proprietary information. Do these companies share their insights into the clinical relevance of VUS? If not, could or should they do so, in your opinion?
Dr. Subbiah: Although the clinical sequencing companies have access to vast databases, they are not clinically annotated. They will need to work with academic centers and other cancer network centers to clinically annotate their databases and outcomes to matched therapies. It would be terrific if there were a private-academia-NCI partnership to clinically annotate all the databases with respect to the outcomes of patients on targeted therapy, immunotherapy, or standard-of-care therapy.
Cancer Network: How might data from public, nonproprietary isolated databases be integrated or connected to create larger databases with which the significance of particular gene variants or gene expression signatures, for example, might be clarified? Have any such integration efforts been made?
Dr. Roszik: There are a few projects in development, and some of them already offer analysis tools as well or can be accessed through cBioPortal. For example, the goal of AACR Project GENIE is to provide larger data sets and statistical power to improve clinical decision making. NCI GDC already contains clinical sequencing data from Foundation Medicine Inc., for integrated analyses with TCGA and TARGET [The Therapeutically Applicable Research to Generate Effective Treatments] data. This is a good example [of how] a private company helped to more than double the number of cases in a database.
Cancer Network: How big of a challenge would it be to integrate information the large public databases? Do these databases utilize the same data fields and categorize or organize data in easily integrated ways?
Dr. Subbiah: It is a huge challenge to integrate information that is already there in public repositories. The challenge mainly arises from the heterogeneity of the databases. Recognizing this, several stakeholders have come together for defining basic common elements, and [setting] standards to allow data pooling for maximum analytical power for the future. The Center for Medical Technology Policy (CMTP) has identified 49 elements as a core set of data elements essential to understanding the clinical utility of molecularly targeted therapies in oncology. Hopefully this can facilitate future precision oncology efforts.
Cancer Network: What about patient privacy? Do the public databases use patient-specific identifiers? And if they do, is there a concern that pooling data from different databases might create a “pseudo-replication” problem with the same tumor sample being counted more than once, creating the illusion of a stronger association between gene variants and tumor behavior?
Dr. Roszik: Public databases are designed to prevent identification of patients. Unfortunately, when using public, de-identified data, it is difficult to avoid duplications that can happen. The same sample might be used in multiple studies, or multiple versions of a database might be re-analyzed and pooled with other datasets by various researchers. Or a patient may have genomic data from private companies and also participate in studies at cancer centers. A solution could be if the data sources worked together to prevent duplication and protect patient privacy as well.
Cancer Network: Are there online tools or smart-device applications available that allow clinicians to pull what’s known about a given variant or gene from multiple public databases?
Dr. Subbiah: There are several that have been developed. One that I use, developed at MD Anderson Cancer Center, is called “Personalized Cancer Therapy.” This website compiles the available scientific knowledge on cancer-associated abnormal genes and gene products and their implications for cancer therapy. Another useful one is “My Cancer Genome,” which a precision cancer medicine knowledge resource for physicians, patients, caregivers, and researchers. My Cancer Genome provides up-to-date information on what mutations make cancers grow, as well as related therapeutic implications, including available clinical trials.
Dr. Roszik: In addition, as I mentioned earlier, cBioPortal provides tools to analyze and visualize data from TCGA and many other projects, and the GDC Data Portal provides tools for integrated analysis of data from TCGA, TARGET, and Foundation Medicine.
Cancer Network: In what other ways can genomic data be better leveraged for patients?
Dr. Roszik: Many new big data sets are being created and published these days. It would be useful to have a common format, for example for variants, or gene-expression data, to make integration easier. Furthermore, a few scientific journals already require that datasets need to be made available upon publication. However, authors often publish them in a way that makes it difficult or impossible to use the data. I think it would be important for all journals to require depositing the raw data to a safe place, for example NCBI dbGaP, and also to publish all the de-identified, processed genomics data in a common, appropriate format-especially if the data generation is paid for by the NIH.
Dr. Subbiah: As I said earlier, it would be terrific if there were a private-academia-NCI partnership to clinically annotate the all the databases with respect to patient outcomes. In addition, this registry should also add data from real-life patients, who can also contribute a lot to data generation. Ultimately all stakeholders should form a national registry-not just from academic centers but also from community practices with input from everyone, including patients. It may be hypothetical or a dream, but anything is possible when all of us come together for a common cause to end cancer.