The Emerging Importance of Genetics in Epidemiologic Research. I. Basic Concepts in Human Genetics and Laboratory Technology

The Emerging Importance of Genetics in Epidemiologic Research. I. Basic Concepts in Human Genetics and Laboratory Technology

SPECIAL REPORT The Emerging Importance of Genetics in Epidemiologic Research. I. Basic Concepts in Human Genetics and Laboratory Technology DARRELL L...

657KB Sizes 2 Downloads 66 Views

SPECIAL REPORT

The Emerging Importance of Genetics in Epidemiologic Research. I. Basic Concepts in Human Genetics and Laboratory Technology DARRELL L. ELLSWORTH, PhD, AND TERI A. MANOLIO, MD, MHS

PURPOSE: To define a general framework of current approaches to the discovery of disease-associated genes and the role of genetic factors in influencing disease risk through the integration of genome technology and traditional epidemiologic methods. METHODS: An overview of basic concepts in human genetics, laboratory methodology for measuring genetic variation believed to influence common diseases, and issues concerning preparation and utilization of genetic materials is provided as a foundation for genetic epidemiologic research. RESULTS: Identification and characterization of human genetic variation is providing new risk factors for disease in the form of DNA sequence variation. The availability of genetic material from participants in large epidemiologic studies and appropriate informed consent represents an invaluable resource for exploring genetic and environmental influences on disease risk. CONCLUSIONS: Advances in genome technology coupled with vast amounts of genetic data resulting from the Human Genome Project are broadening the scope of epidemiologic research and providing tools to identify individuals at increased risk of disease. Combining diverse expertise from the fields of epidemiology and human genetics provides unique opportunities to localize disease-susceptibility genes and examine molecular mechanisms of complex disease etiology. Ann Epidemiol 1999;9:1–16. Published by Elsevier Science Inc. Epidemiology, Molecular Genetics, Human Genome, Polymorphism, Genetic Techniques, Hereditary Diseases.

KEY WORDS:

EDITOR’S NOTE

PURPOSE

Rapid advances in the understanding of the genetic determinants and environmental triggers of complex disease offers great opportunities for population research. This issue contains the first of a series of three articles on “The Emerging Importance of Genetics in Epidemiologic Research.” The purpose of this series is to acquaint the reader with the basic concepts and terminology of human genetics that are necessary to better understand the burgeoning literature in this area.

Genetic epidemiology relates genetic characteristics and their environmental influences to the distribution of disease within diverse human populations. Primary objectives of genetic epidemiologic research are to: 1) localize genes influencing disease risk in the general population; 2) identify genetic variants in disease susceptibility genes that are responsible for interindividual differences in disease risk (functional mutations); 3) determine physiologic and biochemical mechanisms by which the gene products (proteins) contribute to disease onset and progression; 4) identify modifiable environmental factors that may affect the impact of deleterious genes (gene-environment interactions); 5) facilitate early detection of subclinical disease; and 6) design more effective intervention strategies. Familial clustering of common disorders independent of known risk factors suggests that genetic epidemiology studies

From the Epidemiology and Biometry Program, Division of Epidemiology and Clinical Applications, National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, Maryland. Address reprint requests to: Darrell L. Ellsworth, PhD, Division of Epidemiology and Clinical Applications, National Heart, Lung, and Blood Institute, 6701 Rockledge Drive, MSC 7934, Bethesda, MD 20892–7934. Received July 8, 1998; Revised August 10, 1998; accepted August 12, 1998. Published by Elsevier Scence Inc. 655 Avenue of the Americas, New York, NY 10010

1047-2797/99/$–see front matter PII S1047-2797(98)00047–7

2

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

Selected Abbreviations and Acronyms LDL 5 low-density lipoprotein cholesterol CHD 5 coronary heart disease apo 5 apolipoprotein CFTR 5 cystic fibrosis transmembrane conductance regulator SSCP 5 single-stranded conformation polymorphism RFLP 5 restriction fragment length polymorphism

may uncover important clues to understanding the pathogenesis of many complex diseases and help identify individuals at increased risk. Although genetic factors may influence disease risk through measurable effects on intermediate risk factor traits such as cholesterol levels and blood pressure, genes may also influence risk through novel molecular pathways that are difficult to measure in vivo (1). The multifactorial nature of many complex diseases suggests that numerous genes, each with multiple forms (alleles) having small to moderate effects, may account for the majority of genetic variation involved in defining risk of common chronic diseases on a population basis. Common genetic variants or polymorphisms associated with risk for complex diseases have been identified in a variety of candidate genes, many of which encode products related to known risk factors such as apolipoproteins (2) and blood pressure-regulating neurohormones (3). Recent advances in genome technology, computational methods, and the rapidly increasing amounts of genetic data being generated by the Human Genome Project are providing the necessary tools to uncover the molecular mechanisms of complex disease etiology (4). Genetic information is broadening the scope of epidemiologic research by providing a wide array of potential new risk factors in the form of deoxyribonucleic acid (DNA) sequence variation. However, the identification of genetic variants is merely the first step. Determining the role of such polymorphisms in influencing disease in concert with other geneticallydetermined traits and/or environmental risk factors will be enormous and daunting tasks. The power of traditional epidemiologic methods to identify and correct for bias, confounding, and interaction on patterns of disease occurrence will be critically important to the proper interpretation and application of genetic epidemiologic findings. For these reasons, epidemiologists must be actively involved in investigating the role of genetic factors, and their potential modification by the environment, in human health and disease. The purpose of this three-part series is to provide an overview of basic concepts and principles in human genetics, genomics, and genetic epidemiology with potential relevance to epidemiologic research. Throughout the series, we will define a general framework of current approaches to the discovery of disease-associated genes and their role in

AEP Vol. 9, No. 1 January 1999: 1–16

influencing disease risk. In this first segment, we: 1) review the structure and function of the genetic material with emphasis on molecular mechanisms responsible for genetic differences among individuals; 2) describe common methods for detection of polymorphisms believed to influence common chronic diseases including evolving techniques that use DNA chips for high-throughput genetic analysis; and 3) discuss techniques for proper collection and long-term storage of genetic material including ethical issues related to human subjects.

THE GENETIC MATERIAL Structure of DNA Deoxyribonucleic acid (DNA) is a macromolecule that carries genetic information and represents the molecular basis of heredity. DNA is normally composed of two complementary polynucleotide chains that are oriented in opposite directions (antiparallel) and are connected by hydrogen bonds. Each polynucleotide chain consists of a linear array of deoxyribonucleotides that contain a five-carbon sugar molecule (deoxyribose) linked to a phosphate group and a nitrogenous base. There are four common nitrogenous bases in DNA: two purines—adenine (A) and guanine (G) and two pyrimidines—cytosine (C) and thymine (T). The double-stranded molecule is twisted in the form of a helix with a constant width maintained by restrictions to base pairing such that A only pairs with T and G only pairs with C (Figure 1). Gene Expression Gene expression in eukaryotes generally involves: 1) transcription of a messenger RNA (mRNA) molecule from DNA in the nucleus; 2) processing of the primary mRNA transcript to yield the mature mRNA molecule; 3) transport of the mRNA to the cytoplasm; 4) translation of the mRNA (protein synthesis); 5) post translational modification of the protein; and 6) assembly of amino acid chains (protein subunits) into multimeric proteins. Transcription involves the production of an RNA molecule, using a strand of DNA as a template, by complementary base pairing in the presence of an RNA polymerase. RNA is normally a single-stranded molecule and differs from DNA in containing ribose rather than deoxyribose sugar molecules and uracil (U) instead of thymine (T). Once a mRNA molecule is synthesized and appropriately modified, it must be transported out of the nucleus to the cytoplasm where it will function by specifying the sequence of amino acids during protein synthesis. Ribosomes, which serve as the sites of protein synthesis, move along the mRNA molecule reading the genetic code in units of three nucleotides called

AEP Vol. 9, No. 1 January 1999: 1–16

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

3

THE STRUCTURE OF GENES The 59 flanking portion of a gene usually contains the promoter which regulates gene expression (transcription) by controlling whether the gene will be transcribed and if so, the corresponding level of expression. Many regulatory regions in eukaryotes contain several conserved sequences such as the TATA box [TATA(A or T)A(T or A)] and the CCAAT motif [GC(C or T)CAATCT], which are believed to play an important role in binding other components (including RNA polymerase and transcription factors) necessary for transcription. Exons generally contain genetic information that specifies the sequence of amino acids for constructing the corresponding protein, though some exons may occur in regions that do not ultimately encode protein (59 and 39 untranslated regions). Introns are DNA sequences that are removed from the mRNA after synthesis and thus do not encode amino acid sequence. The downstream or 39 flanking region of a gene contains a termination codon specifying the end of the coding sequences and a polyadenylation signal that functions in processing the mRNA transcript once it has been synthesized (Figure 2).

THE PROCESSES OF CELL DIVISION

FIGURE 1. A portion of double-stranded DNA depicting chemical structure of the antiparallel polynucleotide chains and nature of the hydrogen bonding (dashed lines) between nitrogenous bases from complementary strands. The 59→39 direction of each strand is defined by the orientation of the #5 and #3 carbon atoms in the deoxyribose sugar molecules (numbered at upper left/right). Note that each base pair contains one purine (A or G) and one pyrimidine (C or T). Adapted from Suzuki et al. (52) with permission.

codons. Because some amino acids are coded by only one codon (tryptophan) while others have as many as six (serine), only certain nucleotide substitutions or mutations in the DNA will result in an amino acid substitution. Translation begins at the initiation codon (AUG) and continues until one of the three termination (stop) codons (UAA, UAG, or UGA) is encountered, after which elongation of the polypeptide chain terminates.

A chromosome which is visible under a light microscope during cell division actually consists of a very long molecule of DNA that has been packaged into the compact entity we recognize as a chromosome. While double-helical DNA is normally present in extended linear form in order to function in gene expression and be available for replication, the process of cell division requires the DNA to be carefully condensed and tightly packaged. Somatic cell division (mitosis) produces many cells from a single progenitor cell (e.g., development of a multicellular organism from a single fertilized egg). Mitosis maintains the parental chromosome number by duplicating the genetic material before the cell divides. A photograph of human metaphase chromosomes taken under a microscope (karyotype) reveals that normal humans possess a total of 46 chromosomes—22 pairs of autosomes and two sex chromosomes. Each pair consists of two homologous chromosomes (or homologs) that are very similar but may contain copies of genes or other DNA sequences that differ slightly from each other (alleles). By contrast, the process of germ cell division (meiosis) reduces the number of chromosomes to one-half of the number present in the parental cell. Gametes (sperm and egg) usually contain only one copy of each chromosome (haploid complement) so that when gametes unite, the normal (diploid) number of chromosomes is restored. During meiosis, the homologous chromosomes undergo pairing (synapsis) and may physically exchange segments of DNA

4

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

AEP Vol. 9, No. 1 January 1999: 1–16

FIGURE 2. (A) Schematic drawing representing the overall structure of a typical gene—the human ADP-ribosylation factor 5 (ARF5) gene with exon and intron sizes drawn approximately to scale. The 59→39 orientation of the gene is determined by the direction of transcription. DNA sequence has been determined for the main portion of the gene (solid line) but is unavailable for part of the flanking regions (dashed line). (B) Nucleotide sequence of the 59 portion of the human ARF5 gene surrounding exon 1 which is depicted in upper case letters. Partial sequence from the 59 flanking region and intron 1 are shown in lower case letters (50). The predicted amino acid sequence (abbreviated) is listed above the corresponding exon sequence. (C) DNA sequence from the 39 portion of the human ARF5 gene. The termination codon is indicated by asterisks.

by recombination or crossing-over (Figure 3). Recombination is very important because it promotes genetic variation among the gametes.

represent powerful new tools for identifying chromosomal aberrations associated with disease (8).

CYTOGENETIC MAPS

ERRORS IN DNA REPLICATION: MUTATIONS

Condensed metaphase chromosomes can be stained with specific dyes that differentially stain areas of the chromosome depending on the base composition (% AT versus % GC) and accessibility of the stain to the DNA in that region. Using sophisticated banding and visualization techniques, more than 1000 bands have been distinguished in a normal human karyotype and are depicted in representative drawings or idiograms (Figure 4) (5). These “cytogenetic maps” may play a critical role in early diagnosis and study of human genetic diseases. Banded karyotypes can reveal chromosomal rearrangements, large deletions, and other abnormalities that have potential utility for identifying genes responsible for genetic disorders such as fragile X syndrome (6) and chronic granulomatous disease (7). Recent technological advances, such as spectral karyotyping, which unequivocally discerns all unique human chromosomes in different colors,

DNA must be duplicated (replicated) during cell division for progeny cells to receive a full complement of the genetic material. The process of DNA replication is very accurate due to the high specificity of hydrogen bonding between nitrogenous bases. Occasionally (once in 108–1012 bases) however, mistakes or mutations occur during replication that alter the newly synthesized DNA molecule despite mechanisms to ensure accurate reproduction (such as a proof-reading ability found in certain DNA polymerases that can remove a misincorporated base). Mutations are typically classified by the length of the sequence affected by the mutational event as well as the type of molecular change. For example, a given mutation may involve a single nucleotide (point mutation) or multiple adjacent nucleotides. An incorrect nucleotide may be incorporated during replication resulting in the replacement of one nucleotide

AEP Vol. 9, No. 1 January 1999: 1–16

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

5

FIGURE 3. Physical exchange of DNA segments through recombination or crossing-over between paired homologous chromosomes during germ cell division (meiosis).

by another (substitution). The most common form of nucleotide substitutions are transitions which result in the substitution of one purine for another or one pyrimidine for another (A↔G or C↔T). Less common transversions occur when a purine is replaced by a pyrimidine or vice versa. Conversely, one or more nucleotides may be removed from (deletion) or added to (insertion) the newly synthesized molecule. The ultimate physiologic effects (if any) of mutations will depend on the nature and location of the genetic change. Errors of DNA replication in noncritical regions such as introns usually have no effect on the corresponding RNA molecule or the structure and function of the protein product; though mutations at the junction between an intron and exon (splice-site mutations) have been shown to cause diseases such as neurofibromatosis (9). Mutations in the promoter region, particularly in essential sequences such as the TATA box and the CCAAT motif, do not alter the sequence of amino acids in the protein but may affect promoter activity and thus alter the level of gene expression and the quantity of gene product produced. Silent or synonymous substitutions occur within the protein coding regions of the DNA but do not result in amino acid substitutions due to inherent redundancy in the genetic code. However, other types of mutations (nonsynonymous substitutions) may greatly affect the structure and function of a given protein (Figure 5). Missense mutations result in the replacement of one amino acid by another (a classic example is the substitution of valine for glutamic acid in the hemoglobin b-chain leading to sickle cell hemoglobin) (10), nonsense mutations produce a stop codon that terminates translation prematurely thus truncating the protein (as in b8 thalassemia) (11), and the insertion or deletion of nucleotides may cause a shift in the reading frame (frameshift mutation) resulting in a completely different sequence of amino acids (a nondeletion thalassemia in some African populations) (11). Numerous genetic disorders in humans are attributable to

genetic mutations like those described above. For example, many mutations are known to cause cystic fibrosis, a common hereditary disorder that produces thickened mucosal secretions resulting in respiratory tract inflammation and lung infections (12). Deletion of a three-base-pair sequence in particular, codon 508 in the cystic fibrosis transmembrane conductance regulator (CFTR) gene, which controls chloride ion transport through cell membranes, causes loss of a single amino acid (phenylalanine) in an important functional region of the protein (13). The D-F508 mutation in approximately 80% of Caucasian cystic fibrosis patients results in altered chloride ion transport and subsequent pathologic manifestations. Early research on the genetics of cardiovascular disease also identified a number of single-gene defects that significantly increase disease susceptibility. For example, familial hypercholesterolemia is an autosomal dominant disorder characterized by elevated low-density lipoprotein (LDL) cholesterol levels and premature coronary heart disease (CHD) that is attributable to a variety of mutations in the LDL receptor gene (14). It is important to recognize, however, that the majority of genetic variation involved in defining risk of “complex diseases” such as CHD on a population basis is believed to encompass many genes that encode products related to known risk factors. Mutations in these genes are generally not “deterministic” of disease (as are mutations in the LDL receptor which are usually accompanied by severe systemic disease) but are often associated with only modest increases in disease risk. For example, two point mutations cause amino acid substitutions at positions 112 and 158 of the apolipoprotein (apo) E gene resulting in three common forms of the apo E gene in humans (designated e2, e3, e4) (15) which have small to moderate effects on disease risk (1). The apo E alleles are related to quantitative differences in plasma lipid levels (16) such that persons carrying the e4 allele tend to have significantly

6

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

AEP Vol. 9, No. 1 January 1999: 1–16

higher cholesterol levels than the population mean, while levels in those with the e2 allele are significantly lower (17). The e4 allele has been suggested to increase risk for heart and vascular disease (18), but e2 may protect against the development of CHD (19).

THE GENOME PROJECT The Human Genome Project is a cooperative initiative with the ultimate goal of determining the complete genomic (DNA) sequence of the human genome as well as the genomes of several model organisms (20). To characterize the entire human genome, the project adopted a hierarchical approach to proceed through increasing levels of resolution and detail. The first objective was to complete several types of genomic maps (discussed in the next paper of this series) that organize and position specific landmarks (genetic markers).

GENETIC MARKERS

FIGURE 4. Idiogram depicting a high resolution banding pattern of human chromosome 7 (5). The position of the centromere divides the chromosome into a short (p) and long (q) arm. Numbers on the left identify specific chromosomal bands. Genes known to cause human disease or believed to influence various complex disorders (on the right) have been localized to the indicated regions. Gene abbreviations: CFTR—cystic fibrosis transmembrane conductance regulator; COL1A2—collagen type I, alpha-2; HERG—human ether-a-go-go-related gene; IL6—interleukin-6; KEL—Kell-Cellano system; LEP—leptin; NPY—neuropeptide Y. Reproduced with permission of S. Karger AG, Basel.

The types of genetic markers comprising genome maps, as well as methods for their detection, have evolved along with rapid advancements in genome technology. The first generation of genome maps primarily utilized biallelic markers—usually point mutations where a single base substitution has created two forms of a DNA sequence that differ by a single nucleotide (e.g., GAATTC and GACTTC). Numerous methods (21) have been developed to identify and genotype such single-nucleotide polymorphisms (SNPs). Single-stranded conformation polymorphism (SSCP) analysis, a popular method for detecting unknown mutations (22), relies on characteristic secondary structures formed when single-stranded DNA is allowed to self-anneal (nucleotides will form bonds with other nucleotides in the same molecule). A mutation is anticipated to change the secondary structure of a molecule and alter its migration (or movement) in nondenaturing acrylamide gels. Although SSCP is a popular method for identifying sequence differences between molecules, the technique varies in efficiency and reproducibility and does not identify the precise location or nature of the structural change (23). Modern DNA sequencing methodologies have now been sufficiently refined to permit the search for DNA variation to proceed by direct sequencing without the need to conduct single-stranded conformation or similar analyses (24). Restriction fragment length polymorphism (RFLP) analysis is a common method for genotyping known mutations. Restriction enzymes recognize short, specific DNA sequences known as restriction sites and function by cutting the DNA at those sequences (Figure 6). If a nucleotide substitution occurs within a restriction site, the polymorphism can be detected by subjecting the DNA to a restriction enzyme

AEP Vol. 9, No. 1 January 1999: 1–16

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

7

FIGURE 5. Errors in DNA replication that alter the structure of the corresponding protein. A missense mutation results in an amino acid substitution where one amino acid replaces another, a nonsense mutation produces a stop codon prematurely terminating the protein, and a frameshift mutation resulting from the insertion of a single nucleotide causes a shift in the reading frame and a completely different sequence of amino acids (53).

digestion and then visualizing the resulting fragments. Although RFLP markers were initially assayed by a technique known as Southern blotting (where fragments of genomic DNA were separated on an agarose gel, transferred to a nylon membrane, and then allowed to hybridize with a complementary probe labeled with a radioactive tag), the polymerase chain reaction (PCR) (Figure 7) is now routinely used to generate millions of copies of a short DNA segment containing a restriction site from a minute sample of native DNA. The amplified DNA is then cut with a restriction enzyme and the resulting fragments are visualized by ethidium bromide staining on acrylamide or agarose gels. RFLP genotyping has been used to examine numerous genetic polymorphisms in genes believed to influence complex diseases such as coronary disease and atherosclerosis. For example, a common C-to-T substitution in the N(5, 10)-methylenetetrahydrofolate reductase (MTHFR) gene, which increases thermolability of the enzyme resulting in elevated levels of plasma homocysteine (associated with increased risk for cardiovascular disease (25)), can be efficiently genotyped by PCR/RFLP methods (26). The three common apo E alleles, which have been associated with differential risk for Alzheimer disease (27) as well as heart and vascular disease (28, 29), can also be readily distinguished in a simple restriction digest (30). The oligonucleotide ligation assay (OLA) can also be utilized to genotype most polymorphic single-nucleotide

substitutions, especially those that do not occur within a restriction site. The method requires prior knowledge as to the exact nature of the point mutation as well as DNA sequence information in the surrounding region. Short synthetic segments of DNA specific to each form of the DNA sequence (allele specific oligonucleotides or ASOs) are used to characterize both alleles of the polymorphism in a large number of study subjects (Figure 8) (31, 32). A separate class of genetic markers consists of highly polymorphic regions of DNA where the primary variation of interest reflects differences in the number of tandemly repeated sequences (the same sequence is repeated consecutively numerous times). Variable number of tandem repeat loci (VNTRs) have rapidly become powerful tools with a wide array of applications including: 1) examining migration and gene flow among populations (33); 2) identifying individuals in forensic medicine and paternity analyses (DNA fingerprinting) (34); 3) constructing high resolution genetic maps for genome mapping (35); and 4) locating genes associated with disease through linkage analyses. Repeated sequences have also been discovered to be a common cause of major human diseases such as Huntington disease, myotonic dystrophy, spinocerebellar ataxia-type I, Fragile X syndrome, and various human cancers (36). The disease syndromes usually result from expanded three-base-pair (trinucleotide) repeats within the protein coding portion of the gene or the 59 or 39 untranslated regions. Huntington disease, for example,

8

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

is caused by the expansion of a polymorphic trinucleotide repeat (CAG) that ranges in size from 9 to 37 repeats in normal individuals but is repeated 37 to 86 times in Huntington patients (37). Expanded two-base-pair (dinucleotide) repeats have been implicated in some forms of cancer such as hereditary nonpolyposis colorectal cancer (HNPCC) (38). VNTR loci are typically classified into two groups depending on the length and structure of the core repeat sequences. Microsatellites or short tandem repeats (STRs) contain two to four base pair sequences such as CA or GATA that are tandemly repeated a variable number of times (Figure 9). Microsatellites are easily genotyped using PCR by constructing oligonucleotides in regions flanking the repeated region and then separating the PCR products on acrylamide gels. Minisatellites contain longer repeat units of ten to several hundred base pairs. The highly variable nature of many minisatellites is utilized in forensic applications where the composite banding profile resulting from several minisatellites represents a pattern (DNA fingerprint) unique to each person.

AEP Vol. 9, No. 1 January 1999: 1–16

DNA CHIP TECHNOLOGY AND HIGH THROUGHPUT MUTATION DETECTION A variety of techniques such as SSCP and direct DNA sequencing have been used to identify genetic mutations, but their reliance on gel electrophoresis complicates the ability to conduct high-throughput analysis. High density oligonucleotide probe arrays (DNA chips) offer the opportunity to significantly increase throughput while decreasing cost in population screening for deleterious mutations that cause genetic disease (39) and single nucleotide polymorphisms (40). One type of DNA chip is produced by synthesizing oligonucleotides (or probes) in situ on a silicon surface using procedures similar to that used in manufacturing computer chips (Figure 10). Using this technology, DNA chips containing many thousands of distinct probes can be produced in several hours which can be customized to provide sequence and genotype information on a gene of interest. DNA chips have been used for large-scale screening of mutations in genes such as the BRCA1 (breast cancer sus-

FIGURE 6. A restriction fragment length polymorphism (RFLP) may be characterized by generating many copies of the DNA region containing the RFLP marker using the polymerase chain reaction (see Figure 7). (A) A short DNA sequence (59-GAATTC-39) recognized by the restriction enzyme Eco RI (isolated from the bacterium Escherichia coli) is located within the amplified fragment of DNA on the left. The restriction enzyme will cut both strands of the DNA within this sequence between the bases indicated by open arrowheads. A single base substitution (in bold) has altered the restriction site (59-GACTTC-39) such that Eco RI will not cut the DNA fragment on the right. (B) The fragment on the left has been cut into two smaller fragments, but the DNA fragment on the right remains uncut. (C) Fragments resulting from the restriction digestion can be separated by size and visualized on an agarose or polyacrylamide gel. The DNA is applied to the gel in the wells at the top and migrates toward the bottom of the gel when an electric current is applied.

AEP Vol. 9, No. 1 January 1999: 1–16

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

9

FIGURE 7. Polymerase chain reaction (PCR) methodology utilizes short segments of man-made DNA (called oligonucleotides or primers which anneal to the DNA of interest and serve as sites where polymerization is initiated) and a thermostable DNA polymerase (such as Taq which is isolated from the thermophilic bacterium Thermus aquaticus) to synthesize many copies of a given DNA segment (A). The DNA is heated to approximately 948C to separate the complementary strands and then cooled (to 45–608C depending on the region of interest) to allow the primers to bind to the flanking sequences. Warming the reaction mixture to 728C allows the enzyme to polymerize the complementary strands. Repeating this thermocycle profile (B) will yield approximately one million fold amplification after 20 cycles and nearly one billion fold amplification after 30 cycles.

ceptibility) gene. More than 96,600 different probes on a single chip have been used to detect mutations within exon 11 (approximately 3450 base pairs long) and the immediate flanking regions (39). The results showed that many known mutations were readily detectable in breast cancer patients but sensitivity was imperfect for some mutations. Although DNA chip technology holds great promise for rapid genotype assessment, the results of this study suggest that technical modifications may be necessary to improve accuracy and reliability. PREPARATION OF GENETIC MATERIALS IN EPIDEMIOLOGIC STUDIES DNA Isolation and Purification: Sources and Techniques The availability of high quality genomic DNA from participants in large epidemiologic studies represents an invalu-

able resource for exploring genetic and environmental influences on disease and observing relationships between clinical outcomes and the genetic composition of individuals. The ability to link phenotypic information on risk factors with genetic data provide opportunities to: 1) examine the relationship between genes and clinical or subclinical disease and the ability to determine if DNA information enables accurate and early identification of individuals at increased risk; 2) improve our understanding of the etiology and pathophysiology of disease by exposing molecular pathways through which genetic variation gives rise to interindividual differences in disease risk; and 3) develop more rational and effective approaches to the treatment and prevention of disease. A variety of techniques and commercial reagents are available for easy and cost-effective isolation of genomic DNA from whole blood or other nucleated cells (41) that yield high-molecular-weight DNA suitable for most molecular biology applications (42, 43). Quality DNA can be effi-

FIGURE 8. The oligonucleotide ligation assay (OLA) can be used to genotype polymorphic single-nucleotide substitutions if the nature of the mutation and surrounding DNA sequence are known. TOP: (A) Two DNA fragments differing at a single nucleotide position (T/ C) are subjected to heat causing the strands to dissociate. (B) An oligonucleotide specific to the T form of the DNA sequence (an allele specific oligonucleotide, ASO) is constructed such that it contains a compound (fluorescein) that will change color attached to the 59 end and the last (39) nucleotide corresponds with the polymorphic position. The oligonucleotide specific to the T allele is allowed to anneal to both fragments. (C) A second oligonucleotide is introduced that begins at the nucleotide immediately adjacent to the polymorphic site and has a biotin molecule attached to the 39 end. The 39 base of the ASO can properly base-pair to the corresponding nucleotide in the fragment on the left, allowing the two oligonucleotides to be ligated (joined). Ligation cannot occur if the 39 base of the ASO cannot correctly base pair with the DNA (right). (D) Reactions occur in microtiter plates that have been coated with streptavidin—the oligonucleotides are captured in the wells because the biotin label on the nonspecific (joining) oligonucleotide will bind to the streptavidin. Subsequent washing will remove the original DNA and any unligated oligonucleotides. The assay containing the ligated oligonucleotides will be detectable due to the fluorescein label, but the assay containing unligated oligonucleotides will remain colorless. BOTTOM: The parallel assay for the alternate (C) allele at the polymorphic site. The oligonucleotide specific to the C allele is labeled with digoxigenin. This assay can be conducted simultaneously in the same reaction vessel such that an individual homozygous for the T allele would appear red, a heterozygote would be purple (mixture of red and blue), and a C homozygote would show blue (32).

AEP Vol. 9, No. 1 January 1999: 1–16

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

11

FIGURE 9. (A) Tetranucleotide repeat in the 39 portion of intron 7 of the human renin gene (51). Intron 7 sequence is shown in lower case letters and exon 8 sequence is depicted in upper case letters. Arrows indicate positions of PCR primers used to genotype the polymorphism. Note the imperfect nature of repeat units #7 and #10. (B) Characterization of a microsatellite marker by PCR using primers constructed in regions flanking the repeat. Amplification products differing in length due to differences in the number of repeats can be visualized on acrylamide gels: lane 1, homozygote for the 255 bp allele; lane 2, homozygote for the 263 bp allele; lane 3, heterozygote containing both alleles.

ciently recovered from blood or other cells using techniques that require as little as a few hours to as much as several days at a total cost of less than $50 per sample. Basic steps involved in extraction of genomic DNA from the buffy coat fraction (primarily white blood cells) of whole blood include: 1) isolating and washing the DNA-containing cells; 2) lysing the cells and digesting proteins that may be complexed with the DNA; 3) removing residual proteins in solution with salt; 4) concentrating the DNA and removing salt and any residual organic solvents by ethanol precipitation; and 5) lyophilizing (drying) the DNA and resolubilizing it in a low-ionic-strength buffer for long term storage. This process typically yields 150–250 mg of purified DNA from 10 ml of whole blood. Although an alternative procedure incorporating phenol extraction and dialysis may yield higher molecular weight DNA (43), the lengthy dialysis step may make it unfeasible to use this protocol with large numbers of samples. Following DNA isolation, optical densities (ODs) should be

determined from a small portion of each sample to determine concentration and purity. This information is essential in preparing constant-concentration constant-volume aliquots necessary to achieve high uniformity of expression in subsequent DNA assays. Aliquots from systematic samples (e.g., every 50th sample) can also be visualized on 0.4% ethidium-stained agarose gels containing high molecular weight DNA markers to assess the average molecular weight of the preparations and the degree of residual RNA. Finally, multiple long-term storage aliquots should be prepared in order to: 1) archive the material for extended periods in different locations to guard against loss; 2) minimize the chances of contaminating the entire sample; and 3) maximize utility and eliminate waste. Establishment of Immortalized Cell Lines by Epstein-Barr Virus Transformation Epstein-Barr virus (EBV) transformation of human B lymphocytes is used to generate permanent or “immor-

FIGURE 10. DNA chips may be produced by synthesizing oligonucleotides (or probes) in situ on a silicon surface. All probes are synthesized simultaneously on the chip by photoactivating only those areas in which the desired chemical reactions are to occur. Selective activation and synthesis continues until all probes are complete. Length of the probes (usually 20 base pairs long) can be modified to optimize binding intensity and specificity. The end result is that a small area (square) on the chip will contain thousands of identical single-stranded DNA probes. (A) Probes may be designed to detect all possible single base substitutions, single base insertions, and one to five base deletions on each strand in a particular DNA region. Each position in the sequence of interest is assayed with 28 separate oligonucleotide probes (14 for each DNA strand) which differ at the central position as follows: four probes contain T, G, C, or A; the next four account for single base insertions; one probe contains the normal or wild-type (wt) sequence; and the remaining five incorporate one to five base deletions. The series progresses along the DNA region such that each nucleotide position in turn is varied to detect potential mutations at that particular site. This design provides redundancy that increases sensitivity and specificity. (B) Representation of a small portion of a DNA chip used to screen mutations in a breast cancer susceptibility gene (39). Probes synthesized on the chips function by hybridizing to DNA (or RNA) that is applied to the surface of the chip if the probe and DNA sequences are complementary. The DNA to be assayed is typically labeled with a compound that will fluoresce when exposed to ultraviolet (UV) light. Labeled DNA to be assayed is applied to the surface of the chip, allowed to anneal to complementary probes, then all unbound DNA is washed away. Note the positive signal near the center of the array. (C) Close-up of the area surrounding the positive signal from panel B with different contrast to increase clarity demonstrating detection of the 2457 C→T mutation in BRCA1 (39). Note that true visualization procedures use a two-color fluorescent assay system to detect mutations which is based on measuring the relative intensity of fluorescence (signal) from a reference person known to have normal sequence versus a test patient’s sample with unknown DNA sequence. DNA from the normal individual can be labeled with a compound that will fluoresce GREEN, while the patient DNA can be labeled with a RED fluorescent tag. Both DNA samples are applied to the chip simultaneously (competitively co-hybridized) and allowed to anneal to the probes on the surface. For each probe position, the relative strength of the signal for the reference and patient samples is then measured. (D) A different representation of the fluorescent signal identifying the 2457 C→T mutation. Adapted from Hacia et al. (39) with permission.

AEP Vol. 9, No. 1 January 1999: 1–16

talized” cell lines that can be maintained indefinitely to provide a renewable source of DNA for genotype analysis in epidemiologic studies (44). Using relatively simple methodology that can be conducted in any laboratory with access to tissue culture facilities, white blood cells are isolated from whole blood samples by centrifugation and then incubated in the presence of EBV. Within 3–4 weeks a sufficient number of cells will be transformed and immortalized to be frozen for future utilization. Frozen cells can be recultured at any time to provide DNA for genetic analysis. Untransformed white blood cells can be preserved for transformation at a later time (cryopreservation), although with reduced transformation efficiency, by freezing at ultra-low temperatures (less than 2808C) in the presence of a cryoprotectant. ETHICAL, LEGAL, AND SOCIAL ISSUES IN GENETIC EPIDEMIOLOGY Informed Consent for Conducting Genetic Research on Human Subjects As in all human subjects research, informed consent discloses information regarding the nature and objectives of the research, potential risks and anticipated benefits that may result from participation, future contact for additional information, and procedures that will be implemented to minimize inadvertent release of personal information (45, 46). Informed consent procedures in genetic epidemiologic studies also include informing prospective participants that biological specimens (tissue, blood, or any material that can serve as a source of DNA) will be used for genetic analyses and disclosing the extent to which confidentiality of genetic test results, such as genetic predisposition to disease and family relationships (nonpaternity, adoption), will be maintained and shared with the participant. Informed consent that is sensitive to issues unique to particular cultural or ethnic groups will minimize the potential for adverse personal and/or social outcomes. New prospective studies should obtain informed consent in a manner that will foster collaborative research among investigators and facilitate future genetic studies using a layered approach that allows participants to consent separately to: 1) participation in current genetic studies involving broadly defined disease research; 2) sharing of specimens with collaborating investigators; and 3) long-term storage of biological materials for potential utilization in future studies (47). Although the need for informed consent for retrospective access to previously collected samples that are anonymous (lack identifiers) or are anonymized (identifiers permanently removed) is the subject of considerable debate, utilization of such specimens may be limited to ensure adequate human subjects protection. Privacy and Confidentiality There is growing public concern regarding the confidentiality of personal medical and genetic information that may

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

13

be generated as new genetic tests are developed and become routine in clinical practice. Although ordinary clinical results such as measurements of blood pressure and cholesterol levels are typically provided to patients or to their personal physician with written permission, genetic findings are usually not incorporated into the medical records of participants. Privacy and strict confidentiality of the genetic data is necessary to: 1) prevent insurance and/or employment discrimination against asymptomatic individuals; 2) minimize social stigmatization; and 3) accommodate the needs of those likely to develop disease later in life (48). Personal genetic information is not normally provided to patients, but may be medically beneficial to both patients and their families under certain circumstances. For example, knowledge of a genetic predisposition to disease may suggest the need or increase the urgency for preventive treatment or encourage beneficial lifestyle changes such as a healthier diet, increased physical activity, and discontinued use of tobacco and alcohol. However, such information may lead to ethical dilemmas and difficult personal or reproductive decisions. Therefore, personal genetic information usually should be presented by a qualified professional who can educate the patient as to the benefits and limitations of molecular diagnostic tests and provide counsel regarding therapeutic options or personal issues. Sharing Genetic Data among Qualified Investigators Epidemiology and human genetics traditionally have been independent disciplines with minimal interaction between them. The exciting potential afforded by integrating epidemiologic and genetic expertise calls for cooperation between epidemiologists involved in the collection of biosamples and measurement of risk factor phenotypes and geneticists with the technical capabilities to identify and measure variation in candidate genes. Present barriers to collaborative research in genetic epidemiology include: 1) insufficient genetic material and/or funds for long-term storage and distribution; 2) limited informed consent in previous cohort studies; 3) difficulties in initiating studies relating genetic variants to disease; 4) limited exchange of information and resources due to concerns regarding scientific credit and recognition of intellectual contributions; and 5) an insufficient number of individuals with both genetic and epidemiologic expertise. Approaches encouraging collaborations between disciplines are the subject of extensive discussion, including several recently summarized in a research report from the National Heart, Lung, and Blood Institute (49).

SUMMARY This paper provided an overview of opportunities made available by joining genetic and epidemiologic research, introduced basic terminology and genetic concepts, described laboratory techniques and their applications to ge-

14

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

netic epidemiology, and outlined several ethical, legal, and social issues encountered in genetic studies. In the next paper of this series, we will explore problems in defining complex disease phenotypes, examine various study designs for genetic epidemiologic research including population sampling strategies and approaches to localizing disease susceptibility genes, and discuss the utility of traditional epidemiologic methods in genetic studies of complex diseases.

AEP Vol. 9, No. 1 January 1999: 1–16

18.

19.

20. 21.

REFERENCES 1. Sing CF, Haviland MB, Reilly SL. Genetic architecture of common multifactorial diseases. In: Chadwick D, Cardew G, eds. Variation in the Human Genome. Chichester, England: John Wiley and Sons; 1996:211–232. 2. Humphries SE. DNA polymorphisms of the apolipoprotein genes— their use in the investigation of the genetic component of hyperlipidaemia and atherosclerosis. Atherosclerosis. 1988;72:89–108. 3. Jeunemaiˆtre X, Soubrier F, Kotelevtsev YV, Lifton RP, Williams CS, Charru A, et al. Molecular basis of human hypertension: Role of angiotensinogen. Cell. 1992;71:169–180. 4. Green ED, Cox DR, Myers RM. The Human Genome Project and its impact on the study of human disease. In: Scriver CR, Beaudet AL, Sly WS, Valle D, eds. The Metabolic and Molecular Bases of Inherited Disease. v. 1. 7th ed. New York: McGraw-Hill; 1995: 401–436. 5. Francke U. Digitized and differentially shaded human chromosome idiograms for genomic applications. Cytogenet Cell Genet. 1994; 65:206–218. 6. Verkerk AJMH, Pieretti M, Sutcliffe JS, Fu Y-H, Kuhl DPA, Pizzuti A, et al. Identification of a gene (FMR-1) containing a CGG repeat coincident with a breakpoint cluster region exhibiting length variation in fragile X syndrome. Cell. 1991;65:905–914. 7. Royer-Pokora B, Kunkel LM, Monaco AP, Goff SC, Newburger PE, Baehner RL, et al. Cloning the gene for an inherited human disorder— chronic granulomatous disease—on the basis of its chromosomal location. Nature. 1986;322:32–38. 8. Schrock E, du Manoir S, Veldman T, Schoell B, Wienberg J, FergusonSmith MA, et al. Multicolor spectral karyotyping of human chromosomes. Science. 1996;273:494–497. 9. Kluwe L, MacCollin M, Tatagiba M, Thomas S, Hazim W, Haase W, et al. Phenotypic variability associated with 14 splice-site mutations in the NF2 gene. Am J Med Genet. 1998;77:228–233. 10. Ingram VM. A specific chemical difference between the globins of normal human and sickle-cell anæmia hæmoglobin. Nature. 1956; 178:792–794. 11. Weatherall DJ. The thalassemias. In: Stamatoyannopoulos G, Nienhuis AW, Majerus PW, Varmus H, eds. The Molecular Basis of Blood Diseases. 2nd ed. Philadelphia: W. B. Saunders; 1994:157–205. 12. Online Mendelian Inheritance in Man, OMIMe. Baltimore: Johns Hopkins University. http://www.ncbi.nlm.nih.gov (MIM Number 602421). Last edited 28 September 1998. 13. Kerem B, Rommens JM, Buchanan JA, Markiewicz D, Cox TK, Chakravarti A, et al. Identification of the cystic fibrosis gene: Genetic analysis. Science. 1989;245:1073–1080. 14. Online Mendelian Inheritance in Man, OMIMe. Baltimore: Johns Hopkins University. http://www.ncbi.nlm.nih.gov (MIM Number 143890). Last edited 1 October 1998. 15. Zannis VI, Breslow JL. Human very low density lipoprotein apolipoprotein E isoprotein polymorphism is explained by genetic variation and posttranslational modification. Biochemistry. 1981;20:1033–1041. 16. Boerwinkle E, Hixson JE. Genes and normal lipid variation. Curr Opin Lipidol. 1990;1:151–159. 17. Boerwinkle E, Utermann G. Simultaneous effects of the apolipoprotein

22.

23. 24. 25.

26.

27. 28.

29.

30.

31. 32.

33.

34. 35.

36. 37.

38.

39.

E polymorphism on apolipoprotein E, apolipoprotein B, and cholesterol metabolism. Am J Hum Genet. 1988;42:104–112. Wilson PWF, Schaefer EJ, Larson MG, Ordovas JM. Apolipoprotein E alleles and risk of coronary disease. Arterioscler Thromb Vasc Biol. 1996;16:1250–1255. Davignon J. Apolipoprotein E polymorphism and atherosclerosis. In: Born GVR, Schwartz CJ, eds. New Horizons in Coronary Heart Disease. London, UK: Science Press Ltd; 1993:5.1–5.21. Olson MV. The Human Genome Project. Proc Natl Acad Sci USA. 1993;90:4338–4344. Landegren U, ed. Laboratory Protocols for Mutation Detection. Oxford, UK: Oxford University Press; 1996. Orita M, Iwahana H, Kanazawa H, Hayashi K, Sekiya T. Detection of polymorphisms of human DNA by gel electrophoresis as singlestrand conformation polymorphisms. Proc Natl Acad Sci USA. 1989;86:2766–2770. Grompe M. The rapid detection of unknown mutations in nucleic acids. Nat Genet. 1993;5:111–117. Olson MV. A time to sequence. Science. 1995;270:394–396. Boushey CJ, Beresford SAA, Omenn GS, Motulsky AG. A quantitative assessment of plasma homocysteine as a risk factor for vascular disease. Probable benefits of increasing folic acid intakes. JAMA. 1995;274:1049–1057. Frosst P, Blom HJ, Milos R, Goyette P, Sheppard CA, Matthews RG, et al. A candidate genetic risk factor for vascular disease: A common mutation in methylenetetrahydrofolate reductase. Nat Genet. 1995; 10:111–113. Strittmatter WJ, Roses AD. Apolipoprotein E and Alzheimer disease. Proc Natl Acad Sci USA. 1995;92:4725–4727. de Andrade M, Thandi I, Brown S, Gotto A, Jr., Patsch W, Boerwinkle E. Relationship of the apolipoprotein E polymorphism with carotid artery atherosclerosis. Am J Hum Genet. 1995;56:1379–1390. Nieminen MS, Mattila KJ, Aalto-Seta¨la¨ K, Kuusi T, Kontula K, Kauppinen-Ma¨kelin R, et al. Lipoproteins and their genetic variation in subjects with and without angiographically verified coronary artery disease. Arterioscler Thromb. 1992;12:58–69. Hixson JE, Vernier DT. Restriction isotyping of human apolipoprotein E by gene amplification and cleavage with Hha I. J Lipid Res. 1990;31:545–548. Landegren U, Kaiser R, Sanders J, Hood L. A ligase-mediated gene detection technique. Science. 1988;241:1077–1080. Tobe VO, Taylor SL, Nickerson DA. Single-well genotyping of diallelic sequence variations by a two-color ELISA-based oligonucleotide ligation assay. Nucleic Acids Res. 1996;24:3728–3732. Chakraborty R, Fornage M, Gueguen R, Boerwinkle E. Population genetics of hypervariable loci: Analysis of PCR based VNTR polymorphism within a population. In: Burke T, Dolf G, Jeffreys AJ, Wolff R, eds. DNA Fingerprinting: Approaches and Applications. Basel, Switzerland: Birkhauser Verlag; 1991:127–143. Chakraborty R, Kidd KK. The utility of DNA typing in forensic work. Science. 1991;254:1735–1739. Dib C, Faure´ S, Fizames C, Samson D, Drouot N, Vignal A, et al. A comprehensive genetic map of the human genome based on 5,264 microsatellites. Nature. 1996;380:152–154. Richards RI, Sutherland GR. Simple repeat DNA is not replicated simply. Nat Genet. 1994;6:114–116. Online Mendelian Inheritance in Man, OMIMe. Baltimore: Johns Hopkins University. http://www.ncbi.nlm.nih.gov (MIM Number 143100). Last edited 27 October 1998. Aaltonen LA, Peltoma¨ki P, Leach FS, Sistonen P, Pylkka¨nen L, Mecklin J-P, et al. Clues to the pathogenesis of familial colorectal cancer. Science. 1993;260:812–816. Hacia JG, Brody LC, Chee MS, Fodor SPA, Collins FS. Detection of heterozygous mutations in BRCA1 using high density oligonucleotide arrays and two-colour fluorescence analysis. Nat Genet. 1996;14:441–447.

AEP Vol. 9, No. 1 January 1999: 1–16

40. Wang DG, Fan J-B, Siao C-J, Berno A, Young P, Sapolsky R, et al. Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science. 1998;280: 1077–1082. 41. Steinberg KK, Sanderlin KC, Ou C-Y, Hannon WH, McQuillan GM, Sampson EJ. DNA banking in epidemiologic studies. Epidemiol Rev. 1997;19:156–162. 42. Miller SA, Dykes DD, Polesky HF. A simple salting out procedure for extracting DNA from human nucleated cells. Nucleic Acids Res. 1988;16:1215. 43. Sambrook J, Fritsch EF, Maniatis T. Molecular Cloning: A Laboratory Manual. 2nd ed. Cold Spring Harbor, NY: Cold Spring Harbor Laboratory Press; 1989. 44. Gilbert J. Establishment of permanent cell lines by Epstein-Barr virus transformation. In: Dracopoli NC, Haines JL, Korf BR, Moir DT, Morton CC, Seidman CE, et al., eds. Current Protocols in Human Genetics. v. 2. New York: John Wiley and Sons; 1994:A.3H.1-A.3H.5. 45. Clayton EW, Steinberg KK, Khoury MJ, Thomson E, Andrews L, Kahn MJE, et al. Informed consent for genetic research on stored tissue samples. JAMA. 1995;274:1786–1792. 46. The American Society of Human Genetics. ASHG report: Statement on informed consent for genetic research. Am J Hum Genet. 1996; 59:471–474. 47. Knoppers BM, Laberge CM. Research and stored tissues: Persons as sources, samples as persons? JAMA. 1995;274:1806–1807. 48. Durfy SJ. Ethics and the human genome project. Arch Pathol Lab Med. 1993;117:466–469. 49. National Heart, Lung, and Blood Institute. Opportunities and Obstacles to Genetic Research in NHLBI Clinical Studies. Bethesda, MD: National Institutes of Health; 1997. 50. McGuire RE, Daiger SP, Green ED. Localization and characterization of the human ADP-ribosylation factor 5 (ARF5) gene. Genomics. 1997;41:481–484. 51. Benson DA, Boguski MS, Lipman DJ, Ostell J, Ouellette BF. GenBank. http://www.ncbi.nlm.nih.gov (Assession number M10151). Nucleic Acids Res. 1998;26:1–7. 52. Suzuki DT, Griffiths AJF, Miller JH, Lewontin RC. An Introduction to Genetic Analysis. 3rd ed. New York: W.H. Freeman; 1986. 53. Jorde LB, Carey JC, White RL. Medical Genetics (accompanying slide set). St. Louis, MO: Mosby-Year Book; 1995.

GLOSSARY Allele specific oligonucleotides (ASOs): short synthetic segments of DNA specific to each form (allele) of a given DNA sequence Alleles: alternate forms of a gene or genetic locus that differ in DNA sequence Annealing: pairing of complementary base pairs in a DNA or RNA molecule (adenine with thymine and cytosine with guanine) Anonymized samples: previously collected biosamples with identifiers permanently removed Anonymous samples: previously collected biosamples with no identifying information available Autosomes: the 22 pairs of human chromosomes excluding the sex chromosomes (X and Y) Biallelic marker: genetic marker consisting of two forms of a DNA sequence which usually differ by a single nucleotide (often created by a nucleotide substitution) Candidate genes: genes believed to influence expression

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

15

of complex phenotypes due to known biological and/or physiological properties of their products CCAAT motif: a conserved gene sequence, like the TATA box, believed to function in gene expression by binding components such as RNA polymerase and transcription factors necessary for transcription Codon: a unit of three nucleotides in the mRNA which specifies the amino acid to be incorporated at a specific position during protein synthesis Crossing-over: process by which homologous chromosomes physically exchange segments of DNA (also known as recombination) Cryopreservation: technique of freezing viable cells at ultra-low temperatures (less than 2808C) in the presence of a cryoprotectant that does not kill the cells and permits them to be thawed and cultured at a later time Cytogenetic map: a banded karyotype produced using sophisticated staining techniques on human chromosomes used in the study of genetic diseases and gene mapping Deletion: mutation in DNA (or RNA) involving removal of one or more nucleotides—large deletions may be detectable by banded (stained) karyotypes Diploid: cells containing two copies of each autosome (and two sex chromosomes, XX or XY), one maternal and one paternal in origin DNA chip: a high density array of oligonucleotide probes generated on a silicon surface using procedures similar to that used in manufacturing computer chips—used to rapidly produce sequence and/or genotype information in population screening for deleterious mutations that cause genetic disease Exon: a DNA sequence that usually specifies the sequence of amino acids in translation Frameshift mutation: insertion or deletion of nucleotides in the DNA causing a shift in the reading frame resulting in a completely different sequence of amino acids in the protein Functional mutations: variants in disease susceptibility genes responsible for inter-individual differences in disease risk Gene-environment interactions: environmental factors (generally but not necessarily modifiable) interacting with and modifying the impact of potentially deleterious genes Haploid: cells containing only one copy of each autosome and a single sex chromosome Homologous chromosomes (or homologs): each member of a pair of autosomes Immortalized cell lines: cell lines (usually lymphocytes) that can be maintained indefinitely (also known as transformed cell lines) because they have been immortalized by Epstein-Barr virus (EBV) transformation Initiation codon: the codon (AUG in mRNA) that specifies the first amino acid (methionine) in a protein Insertion: mutation in DNA (or RNA) involving addition of one or more nucleotides Intron: an intervening DNA sequence removed from

16

Ellsworth and Manolio BASIC GENETIC CONCEPTS IN EPIDEMIOLOGIC RESEARCH

mRNA after transcription and thus does not encode protein in translation Karyotype: photograph of human metaphase chromosomes taken under a microscope Meiosis: process of germ cell division which produces haploid gametes with one-half the number of chromosomes present in the parental cell Microsatellites: two to four base pair sequences such as CA or GATA that are tandemly repeated a variable number of times (also known as short tandem repeats or STRs); a type of VNTR Minisatellites: in contrast to microsatellites, contain longer repeat units of ten to several hundred base pairs Missense mutation: mutation in DNA (or RNA) resulting in replacement of one amino acid by another Mitosis: somatic cell division producing daughter cells with the same number of chromosomes as the parental cell Mutations: occasional errors that occur during DNA replication Nonsense mutation: mutation in DNA (or RNA) producing a stop codon which terminates translation prematurely Nonsynonymous substitution: mutation in the protein coding regions of DNA that may affect the structure and function of the protein product Oligonucleotide ligation assay (OLA): technique of using short synthetic segments of DNA specific to each form of the DNA sequence (allele specific oligonucleotides or ASOs) in parallel reactions to characterize a polymorphism in a large number of study subjects Point mutation: a mutation involving a single nucleotide Polyadenylation signal: nucleotide sequence that functions in processing a mRNA transcript after transcription Polymerase chain reaction (PCR): method for amplifying (making millions of copies of) a specific DNA region of interest from minute amounts of native DNA Polymorphism: the existence of multiple forms of a gene or genetic locus (alleles) that differ in DNA sequence Promoter: region of DNA to which an RNA polymerase binds and initiates transcription—the promoter regulates gene expression by controlling the amount of mRNA transcribed Rearrangement: chromosomal abnormality in which a segment of a chromosome changes position (translocation) or is “flipped” 1808 (inversion)—may be detected by banded (stained) karyotypes Recombination: process by which homologous chromosomes physically exchange segments of DNA (also known as crossing-over) Restriction enzyme: enzyme that recognizes a short, specific DNA sequence (restriction site) and cuts the DNA at that sequence Restriction fragment length polymorphism (RFLP): genetic marker based on DNA fragments generated by a restriction enzyme that differ in length due to the presence or absence of a specific sequence (restriction site)

AEP Vol. 9, No. 1 January 1999: 1–16

Sex chromosomes: the pair of chromosomes that are dissimilar in morphology in human males (X and Y); gender is determined by presence (male) or absence (female) of Y Short tandem repeat or STR: two to four base pair sequences such as CA or GATA that are tandemly repeated a variable number of times (also known as microsatellites); a type of VNTR Silent or synonymous substitution: mutation within the protein coding regions of DNA (exons) that does not result in an amino acid substitution Single nucleotide polymorphism (SNP): point mutation where a single base substitution has created two forms of a DNA sequence that differ by a single nucleotide—currently of great interest for locating genes associated with complex diseases Single-stranded conformation polymorphism (SSCP): technique for identifying DNA sequence variation by detecting differences in characteristic secondary structures or conformations formed when single-stranded DNA is allowed to self-anneal Spectral karyotyping: a cytogenetic technique which discerns all unique human chromosomes in different colors. Such “chromosome painting” uses chromosome-specific probes labeled with different combinations of fluorochromes (fluorescent dyes) and sophisticated spectral imaging Splice-site mutation: mutation at the junction between an intron and exon that usually inhibits excision (splicing) of the intron Substitution: incorporation of an incorrect nucleotide during DNA replication or an incorrect amino acid during protein synthesis Synapsis: pairing of homologous chromosomes during meiosis TATA box: a conserved gene sequence, like the CCAAT motif, believed to function in gene expression by binding components such as RNA polymerase and transcription factors necessary for transcription Termination (stop) codon: one of three codons (UAA, UAG, UGA) which terminates elongation of an amino acid chain during translation Transition: common form of nucleotide substitution resulting in replacement of one purine for another or one pyrimidine for another (A↔G or C↔T) Transcription: generation of a messenger RNA (mRNA) molecule from DNA in the nucleus Translation: protein synthesis—generation of an amino acid chain (polypeptide) from a mRNA molecule in the cytoplasm Transversion: less common form of nucleotide substitution resulting in replacement of a purine by a pyrimidine or vice versa Variable number of tandem repeat locus (VNTR): a highly polymorphic region of DNA reflecting differences in the number of tandemly repeated sequences