Genomic Resource Projects

Genomic Resource Projects

Chapter 10 Cancer Genomics Genomic Resource Projects Matthew Parker Erin Hedlund Jinghui Zhang Computational Biology Department, St Jude Children’s ...

5MB Sizes 1 Downloads 16 Views

Chapter 10

Cancer Genomics Genomic Resource Projects Matthew Parker Erin Hedlund Jinghui Zhang

Computational Biology Department, St Jude Children’s Research Hospital Memphis, TN, USA

Contents Introduction: The Genomic Data Goldmine Key Large-Scale Cancer Genomics Projects

154 156

The Cancer Genome Atlas (TCGA)

156

Other Genomics Projects of Note

158

Catalogue of Somatic Mutations in Cancer (COSMIC) and dbSNP 158 Exome Variant Server (EVS) 158 1000 Genomes Project 158

Cancer Genomics. DOI: http://dx.doi.org/10.1016/B978-0-12-396967-5.00010-4 © 2014 Elsevier Inc. All rights reserved.

Data Sources

158

Archives Portals

159 159

153

154

PART | 2

Processors

Online/Offline Analysis Tools

161

161

cBio Cancer Genomics Portal (MSKCC) PCGP “Explore” Galaxy BioMart My Cancer Genome Standalone Genome Sequencing Viewers: IGV and More. . .

161 163 163 165 166

Cloud Computing Synopsis and Prospects (Genomic Resources and the Clinic) Glossary Abbreviations References

166

166

168 169 170 170

Key Concepts G

G

G

G

G G

Genomic data are widely available online for both bioinformaticians who want to analyze raw data and biologists/clinicians seeking biologically meaningful conclusions from the analyses Large consortia/projects have been launched with the goal of sequencing all types of cancer from pediatric to adult; these groups are producing data at an incredible rate Genomic data exist in four main categories: (1) raw, (2) processed/normalized, (3) interpreted, and (4) summarized within and across diseases Data can be downloaded from large data warehouses or from data portals with varying access restrictions; as such, these data portals provide a means to view interpreted data A number of on/off-line tools exist to view and analyze genomic data To determine the spectrum of disease-related mutations, which help inform clinical decisions, aggregation of a wide variety of genomic knowledge from as many sources as possible is necessary

INTRODUCTION: THE GENOMIC DATA GOLDMINE Cancer is a genetic disease caused by alterations in DNA. Ever since the discovery of HRAS G12 mutations 30 years ago [1,2], identification of somatic mutations has provided important insight into the initiation, progression and prognosis of cancer. With the completion of the reference human genome sequence [3] and advances in sequencing and computing technologies, a systematic survey of the cancer genome landscape truly began in earnest in 2005. The first studies in the modern genomic era initially took the approach of targeted candidate gene re-sequencing using

Genomics Technologies, Concepts and Resources

polymerase chain reaction (PCR) and Sanger sequencing. Notable examples include sequencing of the protein kinase gene family in multiple cancer types [4 6], sequencing of candidate genes chosen based on genetic lesions identified by copy number changes, gene expression profiling, and genes with a functional role in known disease pathways [7 10]. Some of the studies were ultimately expanded to analyze the entire spectrum of protein-coding genes in the human genome using these approaches [11 13]. With the advent of next-generation sequencing (NGS) technology, the comprehensive investigation of somatic alterations of the entire cancer genome at the base-pair resolution has become feasible. The drastic reduction of sequencing cost coupled with enhanced sequencing capacity of NGS has made it the technology of choice for ongoing genome resource projects like the Cancer Genome Project (CGP) and The Cancer Genome Atlas (TCGA). NGS also catalyzed the formation of new initiatives for genome-wide characterization of somatic lesions in cancer such as the International Cancer Genome Consortium (ICGC) and the St Jude Children’s Research Hospital Washington University Pediatric Cancer Genome Project (PCGP). Together, these cancer genome sequencing efforts are expected not only to yield an unparalleled view of the altered signaling pathways in adult and pediatric cancer, but also the identity of new gene targets against which novel therapeutics can be developed. One common feature of the cancer genome sequencing projects is the huge volume of multidimensional data, including sequencing data, copy number variation (CNV), methylation, gene expression and microRNA expression. The raw data files required to carry out these analyses are accessible via public repository databases like the National Center for Biotechnology Information (NCBI) and the European Bioinformatics Institute (EBI). These raw data are of most interest to bioinformaticians who can subject them to reanalysis with new algorithms that may result in the identification of additional novel genetic lesions or may provide new biological insights from a different perspective than the initial scientific report of the data. In addition, the raw data can be used to test novel computational methods that provide further opportunities for reanalysis of individual datasets or meta-analyses across datasets. The overarching theme being that data generated to answer one specific question have the potential to be repurposed to answer another. Although the final analytical results of the genetic lesions found by efforts of the genome resource projects are usually made available as large supplementary files appended to the primary publications, access to the processed data via a data portal is extremely important for both biomedical researchers and computational biologists. Different types of web-based data access portals have been developed to meet the diverse needs of the research

Chapter | 10

Genomic Resource Projects

community. Several cancer genome sequencing projects such as TCGA, ICGC and PCGP have developed their own data portals that allow researchers to access both the open and controlled datasets. The project data portals also provide user interfaces for gene-orientated queries as well as complex queries that integrate genomic, clinical and functional information. A third type of web portal collects data across multiple projects, performs reanalysis and quality control (QC) to ensure consistency and integrity across multiple projects, and provides query and graphical interfaces that support cross-project data interrogation. Through the use of these portals, biomedical researchers who are not

155

specialized in computational analysis can now access processed data for cross-study comparison, for carrying out aggregated analysis across multiple disease cohorts, and for getting information on genes and pathways that may not have been the main focus of the primary publications. The generation of genomic data is increasing at an exponential rate and these data archives and portals will preserve the data for many years to come. As our knowledge and experience in analyzing these data grows it will be possible to go back and reanalyze them with new algorithms that could potentially extract additional biologically meaningful information from the data.

Box 10.1 Types of Genomic Data Next-Generation Sequencing (NGS) Second-generation, or Next-Gen, sequencing that allows for the discovery of multiple types of genomic aberrations: single nucleotide variations, structural variation, small insertions/ deletions, copy number variation, loss-of-heterozygosity and structural variations. NGS can be applied to either the whole genome, the exome or to the transcriptome.

Chromosome Conformation Capture (3C) This technique allows researchers to examine the higher order structure of the genome (the chromatin) by crosslinking the physical interactions of chromatin with formaldehyde followed by enzymatic digestion and ligation with the frequency of ligation of two restriction fragments a measure of the frequency of interaction within the nucleus.

Whole Genome Sequencing (WGS) The most complete DNA sequencing, covering the majority of the genome: exons, introns, and intergenic regions.

Chromosome Conformation Capture Carbon-Copy (5C) 5C is an extension of 3C using NGS technology to identify the ligation products, using ligation-mediated amplification to copy and amplify a subset of the 3C library followed by detection using NGS [14]. 5C is allowing three-dimensional interaction maps of the genome to be generated uncovering long-range interactions of promoters and distal elements that can potentially affect gene regulation [15].

Exome Capture Sequencing (ECS) This method involves capture of just the coding exons of the human genome using specially designed probes which are complementary to these exons. The DNA of interest is hybridized to these probes and all other DNA washed away. Targeted Sequencing Like exon capture, targeted sequencing uses probes but, in this technique, they are designed to specific regions of the genome that are of interest, e.g. mutated genes. Usually targeted sequencing is used for the purpose of validation and high-throughput screening of recurring mutations. RNA-Seq (Transcriptome Sequencing) Next-generation sequencing of cDNA from all transcribed mRNAs or a subset using capture techniques described above. This allows for differential expression analysis like microarray-based methods (see below) but, in addition, facilitates the discovery of SNVs, SVs, and novel isoforms/exons. miRNA-Seq Like RNA sequencing small non-coding RNAs can be prepared from cells and sequenced. MiRNA-seq sequencing data provide the nucleotide sequence in addition to the expression levels. ChIP-seq This technique uses next-generation sequencing to discover the DNA bound by proteins of interest. This is an extension of chromatin immunoprecipitation (ChIP) where specific interactions between proteins and DNA are investigated. Using NGS enables a global view to be discerned.

Microarrays Microarrays contain small sequences of thousands of genes or other genomic regions embedded on a solid surface such as a glass slide, which is subsequently hybridized with DNA or RNA. While less expensive and faster than NGS technologies, they offer only one or two data types per array type. SNP Array This is a DNA microarray (DNA probes immobilized on a glass slide/chip) used to determine single nucleotide changes as well as copy number variation in the genome. It can also be used to estimate copy number variations. RNA Expression RNA expression arrays (“microarrays”) measure the differences in expression of genes from two different populations of samples. RNA “probes” are, like DNA microarrays, spotted onto glass slides or chips. miRNA Expression miRNA microarrays have been developed to measure the expression of these small non-coding RNAs. RPPA Antibody-based, reverse phase protein arrays for protein expression levels and phosphorylation state. Measures levels of phosphorylated isoforms.

156

PART | 2

KEY LARGE-SCALE CANCER GENOMICS PROJECTS A number of large genomics projects have been launched with the aim of cataloging the somatic changes that lead to the development of cancer. These projects have a number of features in common: G G

G

G

G

Sequencing multiple cancer types Sequencing cancer tissue and a matched non-cancer sample Sequencing to a high (bp) resolution with deep coverage Ensuring all samples have rigorous accompanying clinical information Data released to the public (with restrictions).

These projects differ in the types of sequencing being undertaken, with some projects choosing whole genome sequencing (WGS), and some exome. These data are often supplemented with further genome-wide analyses such as RNA-seq, gene expression, methylation, and genotyping.

The Cancer Genome Atlas (TCGA) The Cancer Genome Atlas (TCGA) project was initiated jointly by the United States National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), and aims to catalog and discover major cancercausing somatic lesions in over 20 types of adult cancers. The overarching goal being to improve our ability to diagnose, treat and prevent cancer. Nearly 20 research institutes have participated in TCGA through biospecimen collection, genome characterization, genome sequencing, genome data analysis, data coordination and, most recently, proteome characterization. To ensure scientific rigor and comprehensiveness of cancer genome profiling, DNA copy number alteration, messenger RNA expression, microRNA expression and CpG methylation were characterized using multiple complementary platforms and the resulting genomic data were analyzed by multiple computational algorithms. All analytical results for genome characterization and sequence mutations were deposited in standard common formats in the TCGA Data Coordination Center (DCC) at http://cancergenome.nih.gov/dataportal/. TCGA took a phased-in strategy that began with a 3-year pilot project in 2006 targeting glioblastoma, lung cancer and ovarian cancer. The first published TCGA study focused on glioblastoma (World Health Organization grade IV), a deadly cancer with a median survival of approximately 1 year with generally poor responses to all therapeutic modalities. Recurrent somatic lesions were identified in three key signaling pathways, PI-3K/RAS, TP53 and RB. Furthermore, an integrative analysis of mutation, DNA

Genomics Technologies, Concepts and Resources

methylation and clinical treatment data identified a link between MGMT promoter methylation and a hypermutator phenotype caused by mismatch repair deficiency in treated glioblastomas [8]. The next publication was a study concerning ovarian cancer, identifying subgroups based on transcription, microRNA and methylation profiling. TP53 mutations were found in almost all tumors (96%) while mutations with low prevalence but high significance were found in NF1, BRCA1, BRCA2, RB1 and CDK12 along with 113 significant focal DNA copy number aberrations [16]. One important component of TCGA is the development of infrastructure that provides public access to genomic data through the Data Coordinating Center and the TCGA Data Portal, enabling researchers anywhere in the world to make and validate important discoveries. Over 100 peer-reviewed publications have been authored by investigators who are not part of TCGA Research Network but whose work is based on TCGA data. In parallel with TCGA, NCI launched the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) Initiative in 2006 which applied genomic technologies to identify new drug targets in highrisk childhood cancers. The initiative’s goal is to characterize completely the genome, transcriptome, and epigenome of 100 to 200 cancer specimens from patients with five pediatric cancers: acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), Wilms tumor, neuroblastoma and osteosarcoma. Recurrent oncogenic mutations and novel translocations resulting in activation of JAK/ STAT signaling pathway were found in a subgroup of patients with a transcription profile similar to those that have BCR ABL translocation. Genome characterization data for TARGET can be downloaded from the TARGET data portal (http://target.cancer.gov/dataportal/). For both TCGA and TARGET, the sequence trace files generated through Sanger sequencing technologies are stored in the NCBI’s Trace Archive; while the aligned reads (BAM files) from second- and third-generation sequencing technologies are contained in NCI’s Cancer Genomics Hub (CGHub) for TCGA, and NCBI’s Sequence Read Archives (SRA) for TARGET.

International Cancer Genome Consortium (ICGC) Like TCGA, the ICGC is a collaboration between many research groups and aims to sequence 25 000 cancer genomes, supplementing these with epigenomic and transcriptomic studies for each case over a 10-year period [17]. As well as coordinating multiple smaller disparate sequencing projects around the globe, the ICGC also includes the data from the TCGA project described above. The consortium hopes to collect data from at least

Chapter | 10

Genomic Resource Projects

50 cancer types having clinical and societal importance, and aims to create a resource of rapidly updated and freely available cancer genomic data. The ICGC provides a forum for collaboration and coordinates current largescale projects for the elucidation of the genomic changes in cancer. All participants have agreed upon common standards for sample collection and storage that protects the identities of the donors. This is the largest consortium of its type in the world and the number of samples that are to be sequenced is impressive, because it was calculated that 500 samples per tumor type would be required to have the statistical power to detect variations from the normal sequence occurring at a 5% or greater frequency in the tumor sample. However, this number may have to be reduced for rare tumors or increased for heterogeneous tumors that may contain different populations, “clones”, of cancer cells as well as normal and immune cells. Data are stored in local databases at each of the consortium members’ institutions and a web portal, using BioMart, allows access to these datasets via a single interface (see below) alongside public databases such as Ensembl, Reactome, the Kyoto Encyclopedia of Genes and Genomes (KEGG), and the Catalogue of Somatic Mutations in Cancer (COSMIC). Currently, data for 6590 donors including samples from 31 different malignancies of adulthood and childhood are available. The ICGC Data Portal offers a gene search tool that directs users to the datasets available with aberrations in the particular gene of interest. It integrates information from COSMIC and other online sources and directs the user to other databases of interest. The main use of this portal is downloading interpreted data in text files for further offline analysis. It is also possible to perform rudimentary pathway and affected gene analyses and build more complex queries if desired. Some of the notable findings of this project include SF3B1 mutations in low-grade myelodysplasia [18] as well as frequent NOTCH1, XPO1, MYD88 and KLHL6 mutations in chronic lymphocytic leukemia (CLL) [19]. In particular, the CLL study integrated extensive clinical information and showed a significant reduction in survival for those individuals harboring NOTCH1 mutations. These were found to be gain-of-function mutations in NOTCH1 and MYD88 and, as such, could be targeted by inhibitors. Thus, this work highlights how genomic studies can lead to potential new therapeutic approaches in treating this disease.

The St Jude Children’s Research Hospital Washington University Pediatric Cancer Genome Project (PCGP) The St Jude Children’s Research Hospital Washington University Pediatric Cancer Genome Project (PCGP) is a privately funded initiative for identifying the somatic mutations

157

that drive the initiation and biological and clinical behavior of pediatric cancer [20]. It was launched in 2010 aiming to obtain 30-fold haploid coverage of the whole genome of 600 pediatric tumors and matched non-tumor DNA samples (1200 total genomes). The large scope of this pediatricfocused cancer sequencing effort is necessary to explore fully the genetic basis of the unique cancers seen in children because the spectrum of cancers occurring in the pediatric population is markedly different from that seen in adults. For example, the major brain and peripheral solid tumors that arise in children, including medulloblastoma, neuroblastoma, rhabdomyosarcoma, Ewing’s sarcoma, osteosarcoma, and Wilms tumor, are exceedingly rare in adults. Similarly, the specific genetic subtypes of ALL, the most common malignancy in children, differ markedly between children and adults. The PCGP was thus specifically designed to complement the larger government-funded genomic efforts focused on adult cancers such as the TCGA and ICGC projects. Because structural variations (SVs) such as inter- and intrachromosomal rearrangements are a common mechanism of mutagenesis in pediatric leukemias and solid tumors, PCGP chose a WGS approach instead of exome or transcriptome sequencing in order to detect the full spectrum of somatic lesions in pediatric cancers. Given the relative rarity of pediatric cancers coupled with the heterogeneity of tumor subtypes, the analysis of a large number of pediatric cancers from a specific subtype by WGS was unfeasible in the short term. In the PCGP, the pediatric cancer subtypes for which outcome (cure) with current treatment is poor and/or where there is a conspicuous lack of knowledge regarding the genetic basis of the disease were prioritized for sequencing. All samples were analyzed by SNP arrays for quality control and a subset of samples was also analyzed by transcriptome sequencing. In addition to the discovery cohort analyzed by WGS for each subtype of cancer, a validation cohort was analyzed by either exome or transcriptome sequencing to define mutation frequency in a combined larger cohort. Mouse models harboring the most significant mutations were employed in combination with gene expression and epigenetic profiling to aid in the understanding of the functional impact of these gene mutations. Major findings published in 2012 included identification of key pathways mutated in early T-cell precursor acute lymphoblastic leukemia (T-ALL) patients [21], the discovery of SYK as a novel drug target in retinoblastoma [22], the identification of recurrent K27M mutations in histone H3.3 present in 70% of the pediatric GBM but absent in adult GBM [23], as well as highfrequency ATRX mutations in neuroblastoma [24] and subgroup-specific mutations in medulloblastoma [25]. In May 2012, PCGP uploaded 260 tumor and germline DNA sequence files (520 in all) from 15 pediatric cancers to the European Bioinformatics Institute data portal (The European Genome-Phenome Archive, EGA), providing researchers with immediate access to both published and

158

unpublished data. At the time of data release, this resulted in a more than doubling of the high-coverage human WGS data available to the scientific community. The immediate release of PCGP data is expected to catalyze research in pediatric malignancies and lead to improvements in our ability to diagnose, monitor and treat patients with targeted therapies aimed at a subset of the identified alterations.

OTHER GENOMICS PROJECTS OF NOTE Catalogue of Somatic Mutations in Cancer (COSMIC) and dbSNP COSMIC differs significantly from other projects we have described in this chapter in that it mined literature to catalog somatic mutations, SVs, and CNVs in genes that are causally, but not necessarily experimentally, implicated in the development of cancer. Alongside the mutation information, metadata on the sample from the original publication were standardized and recorded. This effort has produced a rich database of aberrations that can be interrogated by gene, sample, tissue type or mutation description [26]. As more data are collected these resources will no doubt be expanded to cover the full spectrum of genetic changes in cancer. The most common type of genetic variations between individuals are single nucleotide polymorphisms (SNPs), which occur at a surprisingly high frequency (every 500 1000 bases) and vary greatly from individual to individual. dbSNP [27] aims to catalog these genetic variations, with the current build (dbSNP 137) of the database containing 22 508 883 SNPs that fall within a gene. dbSNP can therefore be used to filter out those germline polymorphisms that have been erroneously classified as somatic mutations in COSMIC. For example, we have carried out filtering using dbSNP 135 which contains 46 160 022 SNPs. Of the 39 867 SNVs listed in COSMIC, 6659 were found to overlap. Integrating databases in this manner enables more intelligent filtering and makes them more useful to cancer researchers.

Exome Variant Server (EVS) Initiated by the National Heart, Lung and Blood Institute (NHLBI), this project aims to combine large amounts of exome sequencing data of diverse germline samples to uncover coding mutations that could affect disorders of the heart, lung and blood. SNPs are available from 6503 thoroughly phenotyped samples [28] for bulk download and online exploration. A web interface allows searching for SNPs in a gene and its upstream or downstream sequence giving information on variants discovered and their ethnic distribution. The sheer volume of germline data allows this resource to serve both as a tool for

PART | 2

Genomics Technologies, Concepts and Resources

finding new polymorphisms affecting disease, as well as a control population for cancer research.

1000 Genomes Project The 1000 Genomes Project aims to describe and characterize the variations found in human genomes [29] and will hopefully then serve as a way of investigating the relationship between genetic polymorphisms and phenotypes by sequencing 2600 individuals from 26 populations from around the world. Whole genomes will be sequenced from blood samples at low coverage and combined with array-based genotyping (SNP array) and supplemented with deep coverage exome sequencing. With this design, the 1000 Genome Project hopes to pick up where the first generation studies, which may have missed rare variants, left off. The 1000 Genomes Project aims to identify 95% of variants that reside in currently accessible genomic regions and are present at greater than or equal to 1% (i.e. a common polymorphism) in each of five major population groups: African, Ad Mixed American, East Asian, European, and South Asian. To examine those rare coding variants, the frequency threshold is reduced from 1% to 0.1%. Like the NHLBI Project, the 1000 Genomes Project will serve as a useful large-scale control group for detailed cancer genomic analysis.

DATA SOURCES Cancer genomic data can be broadly split into four categories [30], which often determine how it is protected, stored, and distributed: 1 Raw: Data that are produced by genome sequencing machines/array analysis hardware without modification. In the case of NGS data, these are raw FASTQ files. Distribution of these data is rare. 2 Processed/normalized: This category of data has been subject to a small amount of computational postprocessing. In the case of NGS data, this processing usually involves alignment to a reference genome to produce an aligned BAM file. The owners of these data often tightly control their distribution, which is handled by public data archives. For example, to access Category 2 data from the PCGP or TCGA, qualified users must apply to these groups and provide research proposals as to how the data will be stored and analyzed and what research question is to be answered by the applicant. This application is scrutinized to ensure that it meets the requirements of the data provider and is often subject to review by the data requester’s own institution to ensure the requirements can be met.

Chapter | 10

Genomic Resource Projects

3 Interpreted: This category of data often represents the conclusions reached from analysis of the Category 2 data. This includes SNVs, indels, SVs and other biologically relevant findings. Most Category 3 data are freely available after publication either from the publisher’s website in the form of supplemental tables or through data access portals developed by sequencing projects to disseminate interpreted results. 4 Summarized: Cross-sample analyses that result in the discovery of significant events in the cohort under scrutiny. These are the data often reported in detail in the published findings of a genomics study.

Archives Large data warehouses include the European Genome Phenome Archive [31] (EGA, http://www.ebi.ac.uk/ega), NCBI’s Sequence Read Archive (SRA, http://www.ncbi. nlm.nih.gov/sra) and NCI’s Cancer Genomics Hub (CGHub, https://cghub.ucsc.edu/). Access to the raw data require authorization. For both SRA and CGHub, users must apply for authorized access through NCBI’s Database of Genes and Phenotypes [32] (dbGaP, http:// www.ncbi.nlm.nih.gov/gap). These organizations provide permanent data storage primarily to enable three activities: 1 The reproduction of published results as part of due diligence within the community. 2 The discovery of previously unknown aberrations. 3 The development of new methods. Data archiving is often required by the research grant funding the generation of the data, and when publishing studies involving large data files like in NGS, the receiving journal more often than not requires deposition of the data in a public archive. These requirements ensure that the public archive of genomic data continues to expand at an exponential rate. dbGaP stores data from any study that investigates genotype/phenotype relationships such as genome-wide association studies (GWAS), clinical sequencing, disease sequencing, etc. It stores these data in a two-tiered system with some of the data being “open access”, while other more sensitive data are stored under “controlled access”. Data must be submitted, with all relevant metadata, and are then subject to an intensive review process by dbGaP to ensure consent from subjects and approval from the submitter’s organization has been obtained. Open-access data can be browsed without restriction on the website, which includes the studies, associated documents, phenotypes and the genotype phenotype analyses. Any of these categories of data stored on dbGaP may be categorized as controlled access on a per-study basis. EGA operates under a similar umbrella and contains a separate NGS warehouse storing raw Category 1/2 data in its short read

159

archive with other data being stored under relevant databases that are separated from the phenotypic information. Each of these archives provides project/study landing pages that list the datasets contained within each project, the data access policies and how to obtain access if required. For example, at the time of writing, the PCGP project page (https://www.ebi.ac.uk/ega/dacs/ EGAC00001000044) contains links to data for 314 samples contained in nine datasets. Clicking on one of these studies provides further information on the data if they are freely available to download, or contact information if access is controlled. Many of the available files are incredibly large, for example BAM files for a whole genome are on average  90 Gb, depending on the sequencing coverage (Figure 10.2). These large files can take days to download depending upon the speed of the network infrastructure of the archive and the speed of the receiving institution’s Internet and network backbone. Reliability is also an issue with large data downloads, though many archives have the ability to restart failed downloads where the user left off. Once the large data files are downloaded, they must be stored, which can be costly (Figure 10.2). Subsequent processing and analysis of these data requires highperformance computing facilities and fast network resources. For example, the Washington University St Louis genomic data center is 16 000 square feet in size and contains 120 racks of data analysis and storage servers. These high-performance systems allow complex calculations to be carried out simultaneously and are supported by a fast network infrastructure that can transfer large sequencing files to computational servers in minutes. Desktop computers can run these kinds of analyses, but slowly, requiring weeks to months to complete an analysis.

Portals Because of the specialized knowledge and resources required to reanalyze publicly available Category 1 and 2 data, many of the large sequencing projects have developed data access portals as a way of disseminating interpreted data (Categories 3 and 4) in an interactive forum for de novo research, offering the ability to analyze the large datasets produced without the requirement of programming knowledge and specialized infrastructure. Publications resulting from the large sequencing projects often discuss the major findings but there is also a wealth of SNP, indel, SV, CNV and other genomic metrics not featured in the primary publication. These data can often be found as supplementary tables so that researchers with particular interests can browse the data in spreadsheets. They are additionally deposited on data portals that present the data in a more integrated fashion with other datasets and unpublished results. Specific questions such

160

PART | 2

Genomics Technologies, Concepts and Resources

Raw/Processed/Normalized Data

Genome Data Project

EBI dbGaP

Data Generation

Categories 1&2 Categories 3&4

Data Portal Computational Analysis

End User

Publication Journal

Reanalysis

Data Portal

COSMIC dbSNP

FIGURE 10.1 The flow of genomic data. Data produced by a genome sequencing project are minimally processed to create the data files that are either uploaded directly to data warehouses, such as dbGaP, for download by the end user, or are analyzed in-house to discover biologically meaningful aberrations. These data are summarized and published and/or uploaded to a genomic data portal for end users to access. Data can also be summarized by a third party to create additional databases and data portals.

as; “Is RB1 mutated frequently in ovarian carcinoma?” are quickly answerable and often a one word query is all that is required to guide the user into more complex views where it is possible to explore networks, pathways, data summaries and additional data related to their original question. Offering these related data enables researchers to submerse themselves fully in the data, potentially leading to novel discoveries or stimulating questions the user would not have thought initially to ask. All online portals attempt to summarize data in a visually pleasing manner, providing plots of complex information, and also offer downloads of raw data tables so that users may carry out their own analyses offline. A common set of features, implemented slightly differently in each case, can be found on most online data access portals: G G G G

Gene search Disease search Display specific data type (i.e. SNV, CNV, SV) Display overlapping aberrations.

Data access portals differ from institution to institution with respect to the datasets and data types they contain. For example, the TCGA Data Portal features exclusively

adult malignancies while the PCGP Explore portal provides data from childhood malignancies. It is common for the portals to provide a breakdown of what data are contained when the user first enters the site so it is clear what information can be queried. The types of data available vary from portal to portal; while most have sequence mutations, others supplement sequencing data with other analyses such as methylation, RNA expression, or protein levels assayed by high-throughput protein arrays. Most of the portals allow bulk data downloads or programmatic access to the data through application programming interfaces (APIs). Bulk data download formats differ from project to project but contain similar information. For example, sequence mutations are often provided with additional information alongside the prerequisite chromosome, position and variant allele such as the amino acid change for a particular protein, the flanking sequence, the number of sequencing reads containing the variant, etc. It is of note that the level of validation, i.e. the confirmation of the aberration using an independent assay, can differ significantly from portal to portal and, unless explicitly stated, users should treat the data with caution to avoid overinterpretation of the results.

Chapter | 10

Genomic Resource Projects

161

How Many Whole Genomes Can an Average User Store? Internal Drive

Common Analysis Raw Data File Sizes (Gb)

Average File Size (Gb)

DNA

RNA

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

BAM

= 33 Whole Genomes

= 4 Whole Genomes

160-500Gb

100

External Drive

80 60 40 20

RNASEQ

U133

FREQCAP

0.01Gb

WGS

0.81Gb

EXCAP

0.06Gb

SNP6

120-3000Gb

0

What About a Large Scale Sequencing Project?

= 18 External Drives

FIGURE 10.2 Data sizes. Genomic data can be very large depending on the type of analysis being performed. EXCAP 5 Exome capture, FREQCAP 5 Targeted capture, SNP6 5 SNP 6.0 Array, WGS 5 Whole genome sequencing, RNASEQ 5 RNA (transcriptome) sequencing, U133 5 Gene expression array. The average desktop computer drive is  500 Gb which means a user could store around four whole genomes, a large external drive could store 33 and a sequencing project such as the PCGP would require 18 external drives for storage of one copy of all the WGS.

The TCGA portal, for example, provides sequence mutation data as text files in “MAF” format, which includes the validation status and method along with other information on the mutation. We will discuss, in detail, specific examples of data analysis portals later in this chapter.

Processors Genomic data can seem overwhelming and disparate in that sites from different institutions contain small pieces of the puzzle and trying to integrate all these data together is a major challenge for genomic research. Several groups have tried to address these concerns by processing sequencing data into unified resources; therefore, we are categorizing these organizations/services into “processors” of genomic data. There are two major types of genomic data processors, those that create repositories of disparate genomic data in one easily accessible location and those that provide tools to analyze disparate data de novo. Projects like COSMIC can be classified as serving the role of a data processor because they take published Category 3 data and create a powerful summary database of significant utility to the research community. Other organizations, such as BioMart, provide tools to

access multiple data sources in one location, allowing integration of independent datasets. BioMart presents both a unified interface to databases around the world as well as tools to query and combine the data. Other data processing tools, like Galaxy, allow the computational analysis of public or private data in a web interface that gives biomedical researchers access to what is essentially a supercomputer in the cloud. Galaxy can be used to perform simple tasks, such as file formatting, to complex analyses like processing raw WGS data.

ONLINE/OFFLINE ANALYSIS TOOLS A number of online data access portals and on/offline data analysis tools are available to integrate and analyze genomic data. In this section, we provide a summary of a large number of tools and the data that they contain (Table 10.1) as well as a more in-depth description of the best-in-class tools.

cBio Cancer Genomics Portal (MSKCC) The cBio Cancer Genomics Portal currently contains published and preliminary data from TCGA and ICGC [33]

TABLE 10.1 Data Available via Online Data Access Portals. Summary of the Types of Data Contained within Each Data Portal Data Portal

Institution

Data Available

SNV CNV SV GV mRNA miRNA Epigenetics Proteomics Metabolomics Pathways Raw Export URL Expression Expression data

PCGP “Explore”

St Jude Children’s Research Hospital

53 cancer WGS & germline pairs

3

Exome Variant Server

NHLBI

6503 germline exomes

TCGA

NCI

6019 “downloadable” tumor samples

3

3

COSMIC

Sanger Center

4 991 081 experiments

3

3

cBio

MSKCC

5 published datasets & 15 provisional TCGA datasets

3

3

3

Oncomine

Compendia Bioscience

Subscription required for some features

3

3

G-DOC

Georgetown 5385 donors

3

3

3

ICGC

Ontario Institute for Cancer Research

3

3

3

3

3

3

3

3561 donors

3

3

Tumorscape Broad

TCGA copy number data

3

3

TARGET

No summary numbers provided

3

3

NCI

3

3

3

3

3

3

3

http://explore. pediatriccancergenomeproject. org/

3

3

http://evs.gs.washington.edu/ EVS/

3

3

https://tcga-data.nci.nih.gov/ tcga/

3

http://www.sanger.ac.uk/ genetics/CGP/cosmic/

3

http://www.cbioportal.org/ public-portal/

3

3

3

3

3

3

3

https://www.oncomine.org/ resource/login.html

3

3 3

3

https://gdoc.georgetown.edu/ gdoc/ 3

3

3

3

http://dcc.icgc.org/web/

3

http://www.broadinstitute.org/ tumorscape/pages/ portalHome.jsf http://target.nci.nih.gov/ dataMatrix/ TARGET_DataMatrix.html

Chapter | 10

Genomic Resource Projects

on which a user may carry out integrative genomic analysis of pre-computed results and ad hoc statistical testing. In addition to SNV, indel, CNV, and mRNA expression data displayed on most genomics portals, cBio also includes methylation and proteomics data such as protein and phosphoprotein levels. Another feature includes the ability to generate survival curves on the fly for a given query, quickly informing the user of any clinical significance. The authors of this portal have concentrated on a gene-centric view of the data to help consolidate the vast amount of information stored and allow complex analyses to be carried out with no computational knowledge. Once a gene or genes are selected for analysis, multiple tabs allow the user to browse all the data related to that gene or set of genes. Genes can be queried across cancers and mutual exclusivity or co-occurrence of aberrations can be examined in user-selected gene sets. As well as simple one-gene searches, cBio also has the ability to build complex queries using a language dubbed “The Onco Query Language”. For example, to display mutations and amplifications of RB1 in all cancers, the user simply enters a single line “RB1: AMP MUT”. To compare more than one gene, users can enter multiple genes and multiple restrictions to build a complex picture of the aberrations of interest. A summary of cases is displayed on the resulting query page and summaries for each individual cancer are shown below. When clicked, these links reveal a visualization termed “OncoPrint” that gives an immediate representation of the selected aberrations and their frequency. In addition, OncoPrints are editable in the browser and exportable in SVG format, allowing publication quality figures to be produced from the data. Concentrating on a single disease shows a more indepth analysis of the protein protein interaction network, plots of expression along with CNV, survival analysis, a plot of the protein and its domains with mutations displayed, the protein array, an Integrative Genomics Viewer (IGV) viewer link (see below), downloads and the ability to bookmark your search for sharing or returning to it at a later date. Pathway networks are sourced from the pathway commons project [34] and are overlaid with genomic information allowing the user to determine quickly if other genes are disrupted in a related set of genes. The source code is freely available and can be installed on a local machine or users can request an Amazon Virtual Machine (see below) that has been preloaded with the cBio portal to analyze their own data.

PCGP “Explore” The PCGP (described above) is the largest pediatric cancer WGS project in the world. Its Category 2 data are available from EBI while the Category 3 and 4 data are

163

available via a web portal called PCGP Explore (Figure 10.3). PCGP Explore allows users to browse summary data such as CIRCOS plots, CNV heatmaps, gene expression heatmaps, summary data tables, and supplementary files for any or all PCGP published diseases. For all genes or a subset of the user’s choosing, a matrix-like graphical summary of all aberrations can be customized to allow the selection of aberration types, disease types, patient groups, and pathway information. This matrix can be exported to Microsoft Excel for publication. Additional features include highly customizable gene expression boxplots, germline SNP searches, and R libraries that allow direct access to facilitate further statistical analysis of the data. Explore provides users with the opportunity to search a gene of interest and view a downloadable and editable SVG image of all mutations found in that gene on a “Protein Paint”, which also displays domains annotated by the conserved domain database [35]. In contrast to some of the other genome portals, a structural model of the protein is displayed, when available, allowing the user to highlight mutations of interest within the structure providing clues to functional impact. Beneath the images is a table of all aberrations found within that gene in all available PCGP samples, including SNVs, indels, SVs and CNVs.

Galaxy Galaxy is an online library of executable tools that allows de novo manipulation and interactive analysis [36] of data that can be uploaded or chosen from a library of shared data available to Galaxy users. Historically, the user would have had to have knowledge of command line programming and data management skills to carry out the type of analyses Galaxy allows. Now, however, the biomedical scientist can carry out these analyses via a readily accessible web interface. The types of manipulation that can be carried out by Galaxy can be broadly split into three categories: 1 Queries 2 Sequence analysis tools 3 Output displays. When a user submits an analysis for Galaxy to carry out, it is queued and run when computational resources become available (for small, simple analyses, this usually takes a minute or two). Multiple analyses, or jobs, are queued in the order that they were received. Like job submission systems used on high-performance computing clusters, Galaxy can also perform complex job management allowing the output of one analysis to be the input for another job in the queue. These jobs are stored in the

164

PART | 2

Genomics Technologies, Concepts and Resources

FIGURE 10.3 PCGP Explore. Explore displays data from the PCGP. In this example a “Gene Search” for NOTCH1 was performed. NOTCH1 was found to be the subject of aberrations in both T-cell precursor ALL and medulloblastoma. On the left, mutations are displayed on a schematic of the protein, “Protein Paint”, which includes manually curated domains. On the right is a structural model of a region of NOTCH1 encompassing some of the residues that are mutated in these cancers. The mutations are highlighted in the protein structure allowing assessment of the mutation effect.

user’s history and can be saved, shared and re-run. Additionally, these histories can be converted into “workflows” for analyses that are to be carried out multiple times on different sets of data. Galaxy contains an enormous number of data formatting and analysis tools ranging from the simple, such as merging or adding columns to files, computing of an expression on every row in a file, converting formats, etc., to more complex operations such as converting coordinates from different versions of genomes (Figure 10.4), statistical analysis of data, and sequence alignments. It is possible to perform complex command line operations in a graphical manner such as the annotations of SNVs and prediction of functional impact of missense mutations using SIFT [37]. One of the most powerful features of the site is the “NGS Toolbox” that allows users to run complex computational algorithms on raw NGS data. Analysis of BAM files is usually reserved for users of the Unix command line, but Galaxy provides a large selection of popular tools to perform analyses on these

files, such as examining coverage, searching for and filtering genetic abnormalities such as SNVs and indels etc. An extension of Galaxy’s features has been undertaken by the biomedical developer community in an Apple App Store fashion, termed the “Galaxy Tool Shed”, in which developers can create or convert tools that can be installed in a Galaxy instance on the web for use on their own datasets. At the time of writing, 2044 tools were available in the shed. The creation of the tool shed exponentially expands the scope of analyses that can be performed on Galaxy. It is also possible for peers to rate tools so that users can see what the best tools are for use with their analyses. Galaxy has been described/utilized in more than 400 research articles including publications in top-tier journals where it has been utilized for analysis of raw ChIP-seq data [38] and RNA-seq mapping and analysis [39]. Data from NGS studies are also being deposited in Galaxy to facilitate third-party analysis [40]. In addition to being used on the web, Galaxy can also be installed locally to

Chapter | 10

Genomic Resource Projects

165

FIGURE 10.4 Galaxy. Galaxy is a web-based platform for the computational analysis of complex biological data. Galaxy is split into three sections: on the left the available “Tools” are listed while the right panel displays a “History” showing analyses that are either queued, running, or finished. The center panel displays both parameter options for the tool selected and analysis results. In this example, we have converted coordinates from an older to a newer genome build.

increase the speed of data analysis and provide more privacy for clinically sensitive data. Excellent walkthroughs of Galaxy’s features are available to assist new users [41], and the site itself provides a variety of training videos.

BioMart Biological databases can be disparate in their location and access style; often the user has to be familiar with a particular database/web-interface to get the most out of a resource, and gene identifiers frequently differ between databases making direct comparisons clumsy and unintuitive. To solve these issues, a database integration system has been created called BioMart [42]. BioMart provides a generic interface to enable queries of over 40 disparate biological databases around the world, providing users access to a variety of data without requiring them to learn the intricacies of each data center’s interface. It can be used to get gene annotations or convert from one gene ID

to another, problems that can be daunting and frustrating to many biomedical researchers. BioMart enables querying across multiple data sources simultaneously datasets that share common identifiers (e.g. Ensembl gene IDs) allow the linking of BioMarts with integrated queries on the same server and across servers around the globe. Because of its power, BioMart has been integrated into a variety of large website frameworks like the ICGC Data Portal. Each of the centers in the consortium maintains their own BioMart server that can be seamlessly accessed from the ICGC portal alongside external databases like Ensembl and COSMIC. COSMIC has also populated their own instance of BioMart, termed COSMICMart, which has all of the somatic mutations contained within COSMIC as well as associated phenotypic data (http://www.sanger.ac.uk/genetics/CGP/cosmic/ biomart/martview). This makes filtering COSMIC data extremely straightforward, allowing complex integrated queries to be constructed.

166

My Cancer Genome My Cancer Genome (http://www.mycancergenome.org/) is a web resource that primarily enables clinicians and patients to view information on particular genes and mutations that are pertinent to their cancer of interest. It provides mutation-specific information on relevant therapies and clinical trials, thereby providing clinicians with the latest knowledge without them having to perform extensive time-consuming literature searches. Selecting a cancer gives the user the ability to select a gene and then a variant within it before proceeding to the gene information pages. Users are presented with a simple table listing statistics such as the frequency of that mutation in the selected disease (data from COSMIC) and the implications of that mutation for targeted therapeutics. For example, selecting Melanoma . BRAF . V600E quickly informs the physician that this mutation confers increased sensitivity to BRAF inhibitors and provides a reference to the primary literature. After a description of the mutation written by clinicians, a table lists the treatment agents and response rates for that treatment while tabs at the top of the page provide lists of clinical trials in the USA, more information on the gene, and its role in the selected cancer. Actionable mutations are, therefore, readily identifiable from those mutations for which there is no clinical information. Currently, My Cancer Genome is a manually curated resource and has a limited number of mutations in its database. Experts in each disease create and edit sections of the site. Adding new data takes time, but it is an everexpanding resource that will become increasingly important for personalized medicine as more mutations are investigated in clinical trials. It is databases like My Cancer Genome that will help to revolutionize cancer medicine and provide meaningful mutation information affecting patient care.

Standalone Genome Sequencing Viewers: IGV and More. . . Although there is a wide range of NGS viewers available, one of the most powerful and comprehensive is the Integrative Genomics Viewer that allows cross-platform visualization of a broad range of sequencing and other genomic data such as gene expression [43]. IGV is an offline tool that can be downloaded from http://www.broadinstitute.org/igv and has been streamlined to run on almost all desktop computers independent of operating system (Windows, OS X and Unix). Although the tool runs locally, it is possible to import a wide range of data in a variety of formats available on local drives, websites or the cloud. Hundreds to thousands of samples can be viewed concurrently allowing very

PART | 2

Genomics Technologies, Concepts and Resources

large-scale comparisons to be made. A command-line version of IGV is also available allowing automated generation of images for human assessment as well as integration into computational analysis pipelines. Alongside the genomic data, it is also possible to import sample metadata such as clinical information for display in a color-coded matrix beside each sample. These annotations can also be used to group, sort and filter the data. The data are visualized with respect to a reference genome selectable from within the software. Genome navigation is achieved by utilizing a Google Maps-like features such as pan and zoom. Users can zoom from a low-resolution whole-genome view to a highresolution display of individual base pairs. Searching for genes or locations within the genome is also possible. Useful features include the ability to select views of “Gene Lists” which can be customized (or a “multilocus” view), allowing the user to view multiple analyses for a set of genes in an integrated manner. NGS reads can be viewed and SNV calls manually reviewed by the user. Finally, IGV automatically highlights “interesting” features of the reads which fail to match the reference genome. The IGV viewer is a very powerful visualization suite but, for more detailed analyses of specific genetic aberrations, other tools may be more appropriate. During manual review of SNPs, indels and SVs, a sequence viewer like Bambino may be more suitable (Figure 10.5). Bambino is a platform-independent viewer (as well as a variant calling suite) that reads BAM/SAM data and can be used to display the NGS reads of two samples (i.e. normal vs tumor) aligned to the reference genome [44]. It displays protein isoform information (RefSeq) allowing changes to the protein sequence to be displayed. One standout feature of this viewer is the display of read quality information where the background of each base pair is shaded dependent on quality score allowing for a more informed judgment on differences from the reference genome. Changes with respect to the reference are highlighted in red and indels can be readily distinguished. Other similar viewers include; Savant, Tablet, Artemis and EagleView. All viewers have their strengths and weaknesses and the decision on which viewer to utilize is dependent on the data type, platform, feature sets and personal preference. Viewers such as IGV and Bambino allow for the human interpretation of genome-wide analyses as well as computational predictions to be examined and interrogated thoroughly.

CLOUD COMPUTING Cloud computing has been gaining popularity lately with a number of companies offering storage of personal information on their servers via the Internet. Users can store

Chapter | 10

Genomic Resource Projects

167

FIGURE 10.5 Bambino. Bambino is an NGS read viewer, suited for the manual review of computationally predicted genomic aberrations such as single nucleotide and structural variations.

Box 10.2 Common Genomic Formats Here we will produce a box describing briefly the types of genomic formats common to portals and viewers (FASTQ, FASTA, SAM, BAM, MAF, VCF, BED, GFF, segmentation, Wiggle). FASTA A text file containing sequence information. Each sequence has an identifier line indicated by a greater than symbol (.) followed on the next line with the sequence data. FASTQ This text format is an extension of FASTA and facilitates the inclusion of sequencing quality alongside the sequence itself. FASTQ files usually contain 4 lines per sequence with the sequence ID denoted by an @ character, line 2 is the sequence, line 3 is denoted by a 1 and can optionally contain the sequence ID and line 4 encodes the quality values for each base in line 2. SAM Tab-delimited text file containing sequence alignment data.

their photos, music, etc. for access on multiple devices, including smartphones, and often for free for a small amount of storage and pay-as-you-go for larger amounts. Computing in the cloud is quickly becoming a useful activity for biomedical researchers and it is now possible

BAM A binary version of a SAM file. BED A tab-delimited text file that defines genomic features containing genomic position information (chromosome, start and end) and optional features of that region. GFF General feature format files are tab-delimited text files containing genomic regions and their features. There are several versions of this format. MAF The mutation alignment format is a tab-delimited text file containing lists of mutations. This format is required for TCGA mutation reports. VCF Variant call format files are used by the 1000 Genomes Project to record variants. More information can be found at http://www.broadinstitute.org/software/igv/FileFormats

for individuals and research groups, without the infrastructure to store, manage, and analyze raw sequencing data. A number of services like Amazon Elastic Compute Cloud (EC2) or Windows Azure allow access to whole servers or “virtual machines” that can be customized

168

based on the user’s computing requirements. Like the cloud for storing personal data, these services are offered on a pay-as-you-go basis. Amazon EC2 allows users to pay only for the computation time they have used running their analyses. These services provide access to a shared pool of resources that can be utilized on an ad hoc basis dependent on the type of analyses being run. For example, analyses that could take 10 hours on a single machine could be split onto 10 servers with the results available in 1 hour. The highly scalable nature of cloud computing makes it an attractive service to researchers who do not have access to computing clusters. One of the big advantages of cloud computing, besides access to computational resources, is reproducibility. A “snapshot” of a virtual machine can be taken, archived, and restored at any time with all data and software faithfully replicated allowing for reproducible computing. Some research groups are already providing their software in Amazon Virtual Machine snapshots so users can run the software with little configuration required. This ensures that the software will run smoothly with little user intervention, thus providing a more plug and play approach. The user simply requests the virtual machine and uploads data for analysis. cBio has, for example, implemented this with their cancer genomics portal allowing users to run their personal version of the portal for analysis and display of their own data. The 1000 Genomes Project has also uploaded all current data (1700 genomes) to the Amazon cloud as a free public dataset. Users can analyze these data directly using Amazon’s EC2 service without having to download the 200 TB of files stored there. Once a user has set up a computing server, the data are accessible and the user may run analyses freely on the data with the results being stored on the user’s own server. Projects such as 1000 Genomes use technologies like the cloud to demonstrate their wide applicability and encourage development in this area. BioLinux [45] (http://cloudbiolinux.org/) has been pre-configured to run on an Amazon EC2 cloud server or a desktop machine and includes popular NGS analysis software like BWA, Bowtie, bedtools, Picard, SAMtools, and Galaxy. This prevents the user from having to endure lengthy installation procedures that require Unix administration knowledge. Distributions like BioLinux provide a smoother entry into the bioinformatic analysis of sequencing data by providing detailed user guides and easily installable virtual machines that enable users to start to become familiar with running these tools from the command line. Despite the increasing accessibility of the cloud for biomedical research, there are a number of roadblocks. It can be costly because raw NGS data files can be very large and disk space is expensive. Upload times for these

PART | 2

Genomics Technologies, Concepts and Resources

files can also be very long (dependent on connection speeds); therefore, it may be more suitable for downstream analyses that often involve smaller files. However, the deposition of BAM files by large genomics projects could be a positive step towards easing these concerns. Restrictions are also placed on the number of transactions that can be made (i.e. the upload or download of a file). Patient privacy could also be an issue, especially if germline data are being analyzed. Clinical protocols that allow collection of patient material may not allow the analysis of the resulting data on what is, for all intents and purposes, a public server. Identification of a patient from NGS data, though, is extremely unlikely, especially if it has been diligently de-identified.

SYNOPSIS AND PROSPECTS (GENOMIC RESOURCES AND THE CLINIC) Sequencing data generation is no longer a major hurdle to deciphering the cancer genome because of the release by large genome sequencing projects like TCGA, PCGP and ICGC of data from hundreds to thousands of cancer genomes and their matched normal (germline) samples. These data resources are growing exponentially as sequencing technology becomes more accessible. The high standards set by these projects ensure the data being released to the research community are of high quality and suitable for downstream analyses. Important discoveries have already been made using these data, and many more will appear in the coming decade. The cancer genome for the most part is still providing information that could lead to new treatment modalities and, above all, is changing the way we think about delivering cancer therapy. No longer will it be satisfactory to treat patients suffering from cancer of a particular tissue like the breast with one umbrella treatment regimen. Even now breast cancer patients are divided into large treatment subgroups based on hormone receptor status, and soon the standard of care may involve subdivision of patients into even smaller groups in light of recent genomic studies [46]. Clinical genomic sequencing will allow specific treatments to be tailored to the aberrations contained within the individual’s genome. To realize this future of personalized medicine, genomic data resources first must be mined extensively to build databases that provide detailed information on specific aberrations and then their potential transformed into actionable treatments in the clinic. The numerous data sources and analysis tools we described in this chapter serve as the foundations for these databases, but we must first integrate this disparate stockpile of raw genomic information into something that is applicable in the clinic. Projects like My Cancer Genome have started the arduous task of manually populating such a database and, even though it is in its infancy,

Chapter | 10

Genomic Resource Projects

this resource sets an example for the integration of mutation data with actionable clinical information such as identifying drugs known to be effective against certain aberrations as well as providing details of relevant ongoing clinical trials. For NGS to become commonplace in clinical data analysis, pipelines must be standardized and the reproducibility of data must be held in the highest regard. With tools like Galaxy, whose history and workflow features allow the same analysis to be run on multiple datasets and pipelines that can be easily transferred between collaborators or other institutions, these goals seem attainable. As more and more biomedical researchers turn their hand to making sense of the wealth of cancer genomic data, the more understanding we will gain. New cloud computing initiatives such as the deposition of the 1000 Genomes data on the Amazon cloud service put high performance computing and complex data analysis algorithms increasingly at our fingertips. Issues, however, still remain in the utilization of genome sequencing in the clinic, such as how much of the data should be reported to clinicians and ultimately the patients? Sequencing of the patient’s genome may reveal aberrations not directly related to the current diagnosis. For instance, what should be done with information relating to a predisposition for heart disease or an elevated risk of mental illness? Often these data can be difficult to interpret and could lead to further expensive testing or unnecessary worry for the patient; however, not disclosing this information could deny the patient and clinician valuable information. These issues will need to be addressed alongside the need for better education for clinicians in the application of genomics to medical practice.

GLOSSARY ChIP-seq This technique uses next-generation sequencing to discover the DNA bound by proteins of interest. This is an extension of chromatin immunoprecipitation (ChIP) where specific interactions between proteins and DNA are investigated. Chromosome conformation capture (3C) This technique allows researchers to examine the higher order structure of the genome (the chromatin) by cross-linking the physical interactions of chromatin with formaldehyde followed by enzymatic digestion and ligation with the frequency of ligation of two restriction fragments a measure of the frequency of interaction within the nucleus. Chromosome conformation capture carbon-copy (5C) 5C is an extension of 3C using NGS technology to identify the ligation products, using ligation-mediated amplification to copy and amplify a subset of the 3C library followed by detection using NGS14. 5C is allowing three-dimensional interaction maps of the genome to be generated uncovering long-range interactions of promoters and distal elements that can potentially affect gene regulation.

169

Cloud computing The use of computational resources that are delivered over a network, which is often the Internet. Third parties provide software or computing services that can be used on an “as needed” basis, providing the user with instant scalability if required. Examples include Amazon’s EC2 cloud computing service and Microsoft Azure, both of which allow one to rent servers or clusters of servers to carry out complex calculations. Data access control/policy Certain data/categories of data are often subject to controlled access. This is determined by the data producer and usually data access is granted after an application and verification process. Data access portal A user-friendly web-based resource for accessing mainly category 3 and 4 data from large-scale genome sequencing projects. Data integration Data integration takes many types of sequencing data, for example whole genome and transcriptome and attempts to combine them in a meaningful way. If a structural variation is found through whole genome sequencing of the DNA then integrating this with transcriptome sequencing could provide further evidence for the variation. Genomic data categories (the level of analysis usually dictates the category of data): (1) raw; (2) processed/normalized; (3) interpreted; and (4) summarized within and across diseases. Data warehouse/archive A third-party provider of large-scale storage of category 1 and 2 data generated by genome sequencing projects. Data warehouses ensure universal accessibility to raw data under “open” or “controlled” access policies. Exome capture sequencing (ECS) This method involves capture of just the coding exons of the human genome using specially designed probes which are complementary to these exons. The DNA of interest is hybridized to these probes and all other DNA washed away. Microarrays Microarrays contain small sequences of thousands of genes or other genomic regions embedded on a solid surface such as a glass slide, which is subsequently hybridized with DNA or RNA. While less expensive and faster than NGS technologies, they offer only one or two data types per array type. miRNA expression miRNA microarrays have been developed to measure the expression of these small non-coding RNAs. miRNA-seq Like RNA sequencing small non-coding RNAs can be prepared from cells and sequenced. miRNA-seq sequencing data provide the nucleotide sequence in addition to the expression levels. Next-generation sequencing (NGS) Modern high-throughput sequencing technologies that parallelize sequencing allowing thousands to millions of DNA molecules to be sequenced simultaneously, lowering the cost and making the sequencing of large genomes possible in weeks. Whole genome, exome, transcriptome and ChIP-seq all use next-generation sequencing technology. Personalized medicine The adaption of therapeutic modalities to individual patients based on the use of their (mainly) genetic information. An example of this approach would be the use of mutation information to deliver a drug that is known to have activity against this mutation (e.g. patients with a BRAF V600E mutation respond more favorably to the BRAF inhibitor Vemurafenib).

170

Reverse phase protein array (RPPA) Antibody-based, reverse phase protein arrays for protein expression levels and phosphorylation state. Measures levels of phosphorylated isoforms. RNA expression array RNA expression arrays (“microarrays”) measure the differences in expression of genes from two different populations of samples. RNA “probes” are, like DNA microarrays, spotted onto glass slides or chips. RNA-seq (transcriptome sequencing) Next-generation sequencing of cDNA from all transcribed mRNAs or a subset using capture techniques described above. This allows for differential expression analysis like microarray-based methods but, in addition, facilitates the discovery of SNVs, SVs, and novel isoforms/ exons. SNP array This is a DNA microarray (DNA probes immobilized on a glass slide/chip) used to determine single nucleotide changes as well as copy number variation in the genome. It can also be used to estimate copy number variations. Targeted sequencing Like exon capture, targeted sequencing uses probes, but in this technique they are designed to specific regions of the genome that are of interest, e.g. mutated genes. Usually targeted sequencing is used for the purpose of validation and high-throughput screening of recurring mutations. Whole genome sequencing (WGS) The most complete DNA sequencing, covering the majority of the genome: exons, introns, and intergenic regions.

ABBREVIATIONS ALL Acute lymphoblastic leukemia AML Acute myeloid leukemia API Application programming interfaces CGP Cancer Genome Project CLL Chronic lymphocytic leukemia CNV Copy number variation COSMIC Catalogue of Somatic Mutations in Cancer dbGaP Database of Genes and Phenotypes DCC Data Coordination Center EBI European Bioinformatics Institute EC2 Elastic Compute Cloud ECS Exome capture sequencing EGA European Genome-Phenome Archive EVS Exome variant server GWAS Genome-wide association studies ICGC International Cancer Genome Consortium IGV Integrative Genomics Viewer Indels Small insertions or deletions KEGG Kyoto Encyclopedia of Genes and Genomes LOH Loss-of-heterozygosity MSKCC Memorial Sloan-Kettering Cancer Center NCBI National Center for Biotechnology Information NCI National Cancer Institute NGS Next-generation sequencing NHGRI National Human Genome Research Institute NHLBI National Heart, Lung, and Blood Institute PCGP Pediatric Cancer Genome Project QC Quality control RPPA Reverse phase protein array

PART | 2

Genomics Technologies, Concepts and Resources

SNP Single nucleotide polymorphism SNV Single nucleotide variation SRA Sequence read archive T-ALL T-cell precursor ALL TARGET Therapeutically applicable research to generate effective treatments TCGA The Cancer Genome Atlas WGS Whole genome sequencing

REFERENCES [1] Reddy EP, Reynolds RK, Santos E, Barbacid M. A point mutation is responsible for the acquisition of transforming properties by the T24 human bladder carcinoma oncogene. Nature 1982;300:149 52. [2] Tabin CJ, Bradley SM, Bargmann CI, Weinberg RA, Papageorge AG, Scolnick EM, et al. Mechanism of activation of a human oncogene. Nature 1982;300:143 9. [3] International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature 2004;431:931 45. [4] Bardelli A, Parsons DW, Silliman N, Ptak J, Szabo S, Saha S, et al. Mutational analysis of the tyrosine kinome in colorectal cancers. Science 2003;300:949. [5] Bignell G, Smith R, Hunter C, Stephens P, Davies H, Greenman C, et al. Sequence analysis of the protein kinase gene family in human testicular germ-cell tumors of adolescents and adults. Genes Chromosomes Cancer 2006;45:42 6. [6] Davies H, Hunter C, Smith R, Stephens P, Greenman C, Bignell G, et al. Somatic mutations of the protein kinase gene family in human lung cancer. Cancer Res 2005;65:7591 5. [7] Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, Leary RJ, et al. The genomic landscapes of human breast and colorectal cancers. Science 2007;318:1108 13. [8] Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008;455:1061 8. [9] Weir BA, Woo MS, Getz G, Perner S, Ding L, Beroukhim R, et al. Characterizing the cancer genome in lung adenocarcinoma. Nature 2007;450:893 8. [10] Ding L, Getz G, Wheeler DA, Mardis ER, McLellan MD, Cibulskis K, et al. Somatic mutations affect key pathways in lung adenocarcinoma. Nature 2008;455:1069 75. [11] Jones S, Zhang X, Parsons DW, Lin JC, Leary RJ, Angenendt P, et al. Core signaling pathways in human pancreatic cancers revealed by global genomic analyses. Science 2008;321:1801 6. [12] Sjoblom T, Jones S, Wood LD, Parsons DW, Lin J, Barber TD, et al. The consensus coding sequences of human breast and colorectal cancers. Science 2006;314:268 74. [13] Parsons DW, Li M, Zhang X, Jones S, Leary RJ, Lin JC, et al. The genetic landscape of the childhood cancer medulloblastoma. Science 2011;331:435 9. [14] Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, et al. Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res 2006;16:1299 309. [15] Sanyal A, Lajoie BR, Jain G, Dekker J. The long-range interaction landscape of gene promoters. Nature 2012;489:109 13.

Chapter | 10

Genomic Resource Projects

[16] Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature 2011;474:609 15. [17] The International Cancer Genome Consortium. International network of cancer genome projects. Nature 2010;464:993 8. [18] Papaemmanuil E, Cazzola M, Boultwood J, Malcovati L, Vyas P, Bowen D, et al. Somatic SF3B1 mutation in myelodysplasia with ring sideroblasts. N Engl J Med 2011;365:1384 95. [19] Puente XS, Pinyol M, Quesada V, Conde L, Ordonez GR, Villamor N, et al. Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature 2011;475:101 5. [20] Downing JR, Wilson RK, Zhang J, Mardis ER, Pui CH, Ding L, et al. The Pediatric Cancer Genome Project. Nat Genet 2012;44:619 22. [21] Zhang J, Ding L, Holmfeldt L, Wu G, Heatley SL, Payne-Turner D, et al. The genetic basis of early T-cell precursor acute lymphoblastic leukaemia. Nature 2012;481:157 63. [22] Zhang J, Benavente CA, McEvoy J, Flores-Otero J, Ding L, Chen X, et al. A novel retinoblastoma therapy from genomic and epigenetic analyses. Nature 2012;481:329 34. [23] Wu G, Broniscer A, McEachron TA, Lu C, Paugh BS, Becksfort J, et al. Somatic histone H3 alterations in pediatric diffuse intrinsic pontine gliomas and non-brainstem glioblastomas. Nat Genet 2012;44:251 3. [24] Cheung NK, Zhang J, Lu C, Parker M, Bahrami A, Tickoo SK, et al. Association of age at diagnosis and genetic mutations in patients with neuroblastoma. J Am Med Assoc 2012;307:1062 71. [25] Robinson G, Parker M, Kranenburg TA, Lu C, Chen L, Ding L, et al. Novel mutations target distinct subgroups of medulloblastoma. Nature 2012;488:43 8. [26] Reva B, Antipin Y, Sander C. Predicting the functional impact of protein mutations: application to cancer genomics. Nucleic Acids Res 2011;39:e118. [27] Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 2001;29:308 11. [28] NHLBI. Exome Variant Server. Vol. 2012 NHLBI Exome Sequencing Project 2012. [29] The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 2010;467:1061 73. [30] Chin L, Hahn WC, Getz G, Meyerson M. Making sense of cancer genomic data. Genes Dev 2011;25:534 55. [31] Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tarraga A, Cheng Y, et al. The European Nucleotide Archive. Nucleic Acids Res 2011;39:D28 31. [32] Mailman MD, Feolo M, Jin Y, Kimura M, Tryka K, Bagoutdinov R, et al. The NCBI dbGaP database of genotypes and phenotypes. Nat Genet 2007;39:1181 6.

171

[33] Cerami E, Gao J, Dogrusoz U, Gross BE, Sumer SO, Aksoy BA, et al. The cBio cancer genomics portal: an open platform for exploring multidimensional cancer genomics data. Cancer Discov 2012;2:401 4. [34] Cerami EG, Bader GD, Gross BE, Sander C. cPath: open source software for collecting, storing, and querying biological pathways. BMC Bioinformatics 2006;7:497. [35] Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, et al. CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res 2011;39: D225 9. [36] Goecks J, Nekrutenko A, Taylor J. The Galaxy Team. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol 2010;11:R86. [37] Kumar P, Henikoff S, Ng PC. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nat Protoc 2009;4:1073 81. [38] Kadauke S, Udugama MI, Pawlicki JM, Achtman JC, Jain DP, Cheng Y, et al. Tissue-specific mitotic bookmarking by hematopoietic transcription factor GATA1. Cell 2012;150:725 37. [39] Bhatt DM, Pandya-Jones A, Tong AJ, Barozzi I, Lissner MM, Natoli G, et al. Transcript dynamics of proinflammatory genes revealed by sequence analysis of subcellular RNA fractions. Cell 2012;150:279 90. [40] Schuster SC, Miller W, Ratan A, Ratan LP, Giardine B, Kasson LR, et al. Complete Khoisan and Bantu genomes from southern Africa. Nature 2010;463:943 7. [41] Blankenberg D, Von Kuster G, Coraor N, Ananda G, Lazarus R, Mangan M, et al. Galaxy: a web-based genome analysis tool for experimentalists. Curr Protoc Mol Biol 2010;10:1 21 Chapter 19, Unit 19. [42] Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, et al. BioMart biological queries made easy. BMC Genomics 2009;10:22. [43] Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform 2012. [44] Edmonson MN, Zhang J, Yan C, Finney RP, Meerzaman DM, Buetow K. Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format. Bioinformatics 2011;27:865 6. [45] Krampis K, Booth T, Chapman B, Tiwari B, Bicak M, Field D, et al. Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community. BMC Bioinformatics 2012;13:42. [46] Shah SP, Roth A, Goya R, Oloumi A, Ha G, Zhao Y, et al. The clonal and mutational evolution spectrum of primary triplenegative breast cancers. Nature 2012;486:395 9.