|
|
||||||||
McArdle Laboratory for Cancer Research, University of Wisconsin Medical School, Madison, Wisconsin 53706
1 To whom requests for reprints should be addressed at UC Davis Genome Center, Genome and Biomedical Sciences Facility, 451 East Health Sciences Drive, Davis, CA 95616. E-mail: pjfarnham{at}ucdavis.edu
| Abstract |
|---|
|
|
|---|
Key Words: polycomb-group proteins chromatin immunoprecipitation ChIP-chip gene expression bioinformatics
| Introduction |
|---|
|
|
|---|
Regardless of the mechanism by which it is achieved, altering the activity of a transcription factor can lead to major changes in the gene expression patterns of the abnormal cell when compared with its normal counterpart. Of course, many of the changes observed are due to a domino-like effect in which the altered expression of gene A results in the altered expression of gene B, which results in the altered expression of gene C, and so forth. An understanding of the biological consequences of altering a transcriptional activator or repressor in a specific cancer requires the cataloging of all such changes in gene expression. However, an understanding of the molecular mechanisms by which the changes are mediated requires that genes directly regulated by a factor be distinguished from those in the line of dominoes. In this review, we will use the Polycomb proteins as an example of transcription factors that display altered levels of expression in human cancers, describe how microarrays can be used to develop gene expression profiles and identify direct target genes, summarize the experiments performed to date by using these arrays to study Polycomb group (PcG) proteins, and then conclude with suggestions for future experimental approaches.
| PcG Proteins |
|---|
|
|
|---|
The polycomb (pc) gene, discovered in Drosophila melanogaster by P.H. Lewis in 1947 (7), was the first gene shown to be involved in the control of hox gene expression. The gene was given the name polycomb to describe one of the phenotypes observed in male flies carrying mutations of that gene. Normally, male flies have a set of bristles on their first pair of legs, known as the sex comb, which assists them during mating. Mutations in pc caused the development of multiple sex combs (hence the term polycomb) on all pairs of legs in adult male flies. This prominent phenotype of pc mutant flies (i.e., having all legs resemble front legs) is an example of a homeotic transformation that is caused by de-repression of the hox gene clusters Bithorax-complex and Antennapedia-complex in the anterior portion of mutant embryos (6). Since the discovery of the pc gene, 15 other genes have been identified in Drosophila that display similar homeotic transformation phenotypes when mutated, indicating their involvement in the regulation of hox expression (821). The discovery of the Drosophila PcG proteins prompted the identification of mammalian homologues. Unlike many families of transcription factors, the PcG proteins are not grouped on the basis of common domains in their protein structure. Rather, they are classified in the same group because they are all discovered on the basis of their ability to repress transcription of hox genes. A comprehensive list of mouse and human PcG homologues is shown in Table 1
.
|
The discovery that PcGs affect cell proliferation suggests that deregulation of PcG expression might play an important role in tumorigenesis. Accordingly, deregulation of PcG proteins has been observed in several types of cancer. For example, EZH2 upregulation has been significantly correlated with the metastatic progression of prostate and breast cancers (35, 36). Another PcG protein, suppressor of zeste 12 (SUZ12), is often upregulated in tumors of the colon, breast, and liver (37). Additionally, the SUZ12 gene is frequently translocated in endometrial stromal sarcomas and, as a result, is fused to a gene encoding a zinc finger protein (38). Finally, the PcG protein Bmi1 is overexpressed frequently in human medulloblastoma cell lines and primary tumors (32). These and many other studies clearly implicate PcG protein misregulation with cancer.
A growing body of evidence suggests that PcG proteins are important regulators of cell proliferation and development. However, a major limitation in our understanding of how they control these processes is the lack of known mammalian PcG target genes. The different phenotypes observed in PcG null mice and the implication of PcG proteins in various cellular processes suggest that these proteins regulate a broad spectrum of target genes. For example, the observed skeletal transformations in PcG null mice can be explained by the misregulation of developmental control genes such as the Hox family (22). However, the involvement of PcG proteins in the development of various human cancers might be due to altered expression of genes that control processes such as cell proliferation, differentiation, and apoptosis.
PcG-Mediated Transcriptional Regulation.
Although the target genes responsible for PcG-mediated regulation of mammalian development and proliferation have not yet been identified, recent studies have led to the development of a model by which PcG proteins may regulate transcription. Early genetic studies of Drosophila predicted that PcG proteins exert their functions by forming multimeric complexes. For example, double and triple PcG mutant flies exhibit enhanced homeotic transformations when compared with single PcG mutant flies, suggesting functional interactions among the various PcG proteins (11). Recent biochemical studies have defined the composition of two such complexes that are present in both Drosophila and mammalian cells (Figure1
). The first complex identified was named Polycomb Repressive Complex 1 (PRC1). The human complex includes the PcG proteins human polycomb homolog, human polyhomeotic homolog, BMI1, ring finger protein 1, and sex combs on midleg human homolog 1 (3941). The second complex was more recently defined by four independent groups. This complex is referred to as Polycomb Repressive Complex 2 (PRC2, or EED-EZH2), and the human complex consists of five core subunits: the three PcG proteins EZH2, SUZ12, and EED, as well as the histone binding factors retinoblastoma associated proteins p46 and p48 (RbAp46 and RbAp48) (4245). In addition to the identified 600-kDa PRC2 complex, Tie et al. (46) isolated a variant of the PRC2 in Drosophila that was about 1 megadalton. This 1-megadalton complex contained the previously identified core subunits plus the PcG protein polycomblike and the histone deacetylase Rpd3 (46). The identification of these two PRC2s raises the question whether the 600-kDa PRC2 is a stable intermediate of the larger 1-megadalton complex or whether the two complexes form independently and have distinct biological functions.
|
It has recently been shown that one of the components of the PRC2, EED, exists in four different isoforms in human cells (53). Accordingly, Kuzmichev and colleagues (54) have demonstrated the existence of three PRC2-like complexes in human cells that each contain different isoforms of EED. All three complexes, which are now called PRC2,-3, and -4, contain the core subunits EZH2, SUZ12, RbAp46, and RbAp48. In addition to the core subunits, PRC2 contains the longest form of EED (EED 1), PRC3 contains the two shortest forms of EED (EED3 and EED4), and PRC4 contains the intermediate form of EED (EED2) plus the histone deacetylase sirtuin 1.2 Intriguingly, the presence of the different EED isoforms results in different catalytic specificity of the HKMT-EZH2 in vitro For example, PRC2 can methylate both H3-K27 and lysine 26 of histone H1 (H1-K26) on nucleosomal arrays. In contrast, PRC3 can methylate only H3-K27 whereas PRC4 methylates H1-K26 (Figure 1
).
Identifying PRC Target Genes by a Candidate-Gene Approach.
It is not known whether the complexes described above have redundant functions or whether they play different roles in distinct cellular processes. Distinguishing between these two possibilities requires the identification of target genes for each of the three complexes. It would seem that a simple approach would be to examine candidate target genes based on the presence of a consensus PcG element in a promoter region. However, none of the proteins purified in the different mammalian PRCs have been shown to be site-specific DNA binding proteins; therefore, simple sequence inspection cannot suffice to identify PcG target genes. How the mammalian PRCs are targeted to DNA remains unknown. In contrast, several DNA binding proteins have been implicated in targeting PcGs to the DNA in Drosophila. The identification of the hox genes as PcG target genes in Drosophila led to the discovery of cis-regulatory elements in the fly genome that are required for PcG-mediated repression. Genetic studies in combination with reporter assays defined the minimal DNA elements, termed PcG response elements (PREs), that regulate hox gene expression and mediate repression of reporter genes in Drosophila (5557). Sequence alignments among the various PREs that control different hox genes reveal little similarity, and thus a strict consensus sequence has not been defined. However, one similarity among the different PREs is the presence of binding sites for three site-specific DNA binding proteins: the PcG protein pleiohomeotic (pho), GAGA, and zeste. In fact, Ringrose et al. (58) used the finding that these three sites occur frequently in PREs to develop a bioinformatic approach to identify other PREs in the Drosophila genome. Using this approach, the authors discovered 167 candidate PREs, many of which map close to genes that are involved in development and cell proliferation. Unfortunately, although it is clear that Drosophila PcG complexes use the pho, GAGA, or zeste sites in the PREs (5963), none of these DNA binding factors or their mammalian orthologs copurify with the PRCs. The mammalian homologue of pho, known as YY1, was shown to physically interact with the WD-40 domains of EED in a yeast two-hybrid analysis. However, whether this factor facilitates recruitment of the PRCs to DNA in vivo (64) remains unclear.
Because the cis element corresponding to a mammalian PRE has not been identified, bioinformatic approaches have not been used to identify potential mammalian PRC target genes. In an attempt to identify mammalian PRC target genes, Jacobs and colleagues (65) demonstrated that Bmi1, a component of PRC1, cooperates with the oncogene c-Myc to repress the activity of the p16INK4A and p19ARF tumor suppressor genes, leading to transformation of lymphoid cells. Another suggestion that the p16INK4A/p19ARF locus may be a putative mammalian PcG target came from a study where overexpression of the protein chromobox homolog 7 led to the downregulation of both the p16INK4A and the p19ARF gene, resulting in a longer lifespan than in normal cells (66). Other cell-cycle regulatory genes appear to be controlled by the modulation of the PRC components. Bracken et al. (34) proposed that p53, cyclin D1, cyclin E1, cyclin A2, and cyclin B1 are potential PRC target genes because transformation of normal fibroblasts by depletion of EZH2 or EED correlates with altered expression of these genes. Additionally, Varambally and colleagues (36) used gene expression analysis to identify a large number of genes whose expression is reduced upon overexpression of EZH2 in prostate cancer cells. However, none of the above studies show recruitment of the PRCs to the promoters of the affected target genes. Therefore, whether the PcG complexes directly regulate these target genes or the target genes expression is changed indirectly as a result of an altered cellular milieu remains unclear. As described below, the use of a high-throughput, microarray-based genomic approach that has been developed recently can address this issue.
| Array-Based Approaches for the Identification of Target Genes |
|---|
|
|
|---|
|
Two different types of arrays can be used for the study of gene expression changes mediated by a transcription factor. The first type involves physically depositing (spotting) cDNAs or PCR fragments that were derived from mRNAs onto microscope slides. Such arrays were first produced in 1995 to study gene expression in Arabidopsis thaliana (67). Mammalian arrays containing cDNAs corresponding to about 1000 human genes were used in 1996 (68, 69), and 8600 human genes could be analyzed with arrays by 1999 (70). One of the first studies to analyze mRNAs from cells specifically lacking a transcription factor was the study of ATF-2 null mice (71). Innumerable studies using over- or underexpression of a particular transcription factor have been performed since then. Although a major step forward in gene expression analyses, the spotted arrays have several disadvantages. For example, the PCR fragments must be prepared, purified, quantitated, carefully catalogued, and stored. Each of these steps is expensive and subject to technical difficulties.
The second type of array used to analyze gene expression is composed of oligonucleotides that are synthesized directly on the solid phase surface based upon the sequence of known mRNAs (72). Because the oligonucleotides (which are commonly 2025 nts in length but have been synthesized up to 60 nts) are synthesized directly on the array, many of the disadvantages associated with spotted arrays are eliminated. Initial arrays contained 65,000 probes that represented about 100 mammalian genes, but expanded sets of 4 arrays representing 6500 genes were soon created. An early example of the use of high density arrays to identify transcription factor target genes was the analysis of genes whose expression is altered after inducible expression of Wilms tumor 1 (73). Currently, commercially available arrays are used to examine gene expression changes for thousands of human and mouse genes. One commercially available array represents over 47,000 human transcripts corresponding to at least 14,500 well-characterized genes; the same company also produces a mouse array that represents about 39,000 transcripts corresponding to at least 14,000 genes (www.affymetrix.com).
Whether spotted or oligonucleotide arrays are used, it is necessary to collect mRNA from two samples that differ in the abundance of a factor of interest. One of the most common means investigators use to modulate a factor is through the introduction of a plasmid expressing a protein into a cell line and then the preparation of mRNA from the transfected cells, with mRNA preparation from the parental cell line serving as a control. This has proven to be a popular approach because it is technically easier to overexpress a protein than it is to remove a factor from a population of cells. However, overexpression has a distinct disadvantage when studying multisubunit complexes such as the PRCs. For example, if all components of a complex must be present in equal ratios, then overexpressing one component may have very little effect on expression of the target genes. A better approach is to reduce (or eliminate) one component of the complex, which would presumably lead to complex dissolution, and then to search for genes whose expression is increased or decreased (depending on whether the complex primarily activates or represses transcription). Traditionally, this has been performed by using mouse embryo fibroblasts from a knockout animal. Of course, loss of a transcription factor can be lethal, and in the past it has been difficult to study such factors. However, with the advent of RNAi technology, transient knockdowns of a factor in tissue culture cells can be achieved, but it is important to consider that this approach cannot overcome the problems associated with functional redundancy (i.e., multiple proteins, usually members of a family of transcription factors, may be able to regulate a common set of genes).
Once one prepares mRNA samples from the cells expressing normal versus altered levels of a transcription factor, the samples are labeled with fluorescent dyes and applied to microarrays. For some arrays, generally those consisting of spotted PCR fragments, two different dyes are used and the samples are applied to a single array. For oligonucleotide arrays, the samples are labeled with the same dye but applied to two separate arrays. In both cases, analysis programs are used to calculate a fold difference in expression levels of each analyzed mRNA in the two starting cell populations. Most mRNAs will not be changed by removal of the transcription factor and therefore will show fold differences close to 1. However, if studying normal versus knockout cells, levels of mRNAs whose expression is dependent upon the removed factor will decrease and levels of mRNAs whose expression is repressed by the removed factor will increase. Alternatively, if studying normal versus overexpressing cells, levels of mRNAs from promoters activated by the factor will increase and levels of mRNAs from promoters repressed by the factor will decrease.
In such studies, one can often end up with long lists of deregulated genes. The reason that a large number of genes are identified in such experiments is that the observed changes in mRNA may be due to direct and indirect effects of the removed factor. Clearly, removal of a factor can have major effects on multiple signaling pathways in a cell, with the deregulation of the direct target genes setting up cascades of effects on the expression of other genes. Investigators have tried to distinguish direct from indirect effects using, in the case of overexpression of a factor, approaches such as cycloheximide treatment or kinetic studies (74). However, it is possible to definitely prove that a gene is directly regulated by a factor only if one can demonstrate binding of that factor to a promoter or enhancer region of the gene in question. In vitro gel-shift studies have been used for such purposes; however, this type of in vitro experiment is no longer considered sufficient because multiple factors (e.g., different members of a family of transcription factors) can bind to the same sequence of DNA in vitro, especially when isolated from other cellular proteins. Therefore, binding analyses should take into consideration the cellular milieu and the chromatin environment. Such analyses could be achieved by using the ChIP assay to determine if a candidate gene is directly regulated by a factor. Briefly, this assay involves the treatment of cells or tissue with formaldehyde, a procedure that was developed by Solomon and Varshavsky (75), to cross-link the factor to its genomic binding site. Protein-DNA cross-linking is followed by IP with an antibody specific for the factor of interest and then analysis by PCR with primers specific for a particular promoter region. With this assay, the promoters of the genes identified on the expression array can be analyzed to determine if they are directly or indirectly regulated by the factor.
Unfortunately, follow-up ChIP analysis of each of the perhaps hundreds of genes identified on an mRNA expression array would be very laborious. Also, it is often unclear which region of the promoter to analyze for direct binding. Although some factors tend to bind near the transcription start site, other factors (e.g., PRCs in Drosophila) bind to regions located at a great distance from the proximal promoter region. A recent study that focused on identifying target genes of human PRCs has taken an approach that reduces both of these concerns. Kirmizis et al. (52) first used siRNA to SUZ12 (a common component of PRC2/3/4), coupled with expression arrays, to identify a set of genes regulated directly and indirectly by SUZ12. The authors then prepared a custom oligonucleotide array consisting of 5 kb of promoter sequence from each gene that displayed significantly different expression levels. Using a ChIP-chip approach (described in more detail in the next section), they identified within the overall set of deregulated genes a set of genes bound by SUZ12. Although this subset of the genes could be conclusively classified as direct targets, it remains possible that other genes identified by the mRNA arrays are also direct targets but with the binding site located outside of the tested 5 kb region or with the antibody prevented from binding its epitope during the IP because of an unusual nucleoprotein conformation in that particular transcriptional complex.
In summary, the advantage of starting with a gene expression array is that the eventual list of identified genes will consist of those genes whose expression is regulated by the transcription factor in that particular tissue or cell type. The disadvantage is that many indirect targets will be identified, and therefore each gene must be checked as a direct versus indirect target with either individual ChIP assays or customized oligonucleotide arrays in a ChIP-chip assay. Another disadvantage is that it is not possible to know where the binding site for the factor occurs, relative to the transcription start site. Therefore, many direct targets may be mistakenly classified as indirect targets if the genomic region containing the binding site is not included in the follow-up ChIP experiments. The potential for false negatives is a serious problem for the study of mammalian PRCs because the binding site that recruits the complexes is still unknown; it is not yet possible to identify PRC binding sequences in the adjacent regions of the regulated genes and then include the identified region in follow-up analyses.
ChIP-Chip Followed by RNA Expression Analysis.
A second general approach for identifying target genes of a given transcription factor is to begin with a high throughput analysis of a large number of promoter regions or a large span of genomic DNA to identify binding sites for a transcription factor. Most of the studies using this analysis rely on the technique of ChIP. However, one caveat of ChIP is that it provides information only about the binding activity of a transcription factor and does not link binding to a functional effect. For this reason, this second approach requires follow-up studies that can determine if the identified binding sites are functionally important in the regulation of a nearby gene.
The application of ChIP to the analysis of site-specific transcription factors has provided a major advance in the study of mammalian gene regulation. Although this technique has been used only in mammalian systems in the past decade (76, 77), it has now become the accepted method of linking a specific factor to the regulation of a specific gene. The success in adaptation of this technology to mammalian cells has now led to the subsequent modification of the assay from the one-gene-at-a-time approach to a more global screening of thousands of promoters. Although the ChIP-chip approach (i.e., ChIP followed by microarray analysis) was first used to study yeast transcription factors (7884), several groups have now applied this technology to the study of mammalian factors. Several different types of microarrays have been used for the mammalian studies. One type, which consists of spotted PCR fragments corresponding to promoter regions, has been used to identify target genes of E2F, c-Myc, and hepatocyte nuclear factor (HNF) family members. For these studies, specific promoters were selected and small (less than 1 kb) regions of these promoters were created by PCR. These fragments were then spotted onto microscope slides. These studies began a few years ago with a modest number of promoter regions. For example, PCR fragments spanning from 700 to +200 of 1444 human genes were used to determine that ~9% of the promoters were bound by E2F4 (85). However, it may not be correct to assume that 9% of all promoters are regulated by E2F4, because the promoters chosen for the array were selected on the basis of their regulation during the cell cycle (86), a process known to be controlled, in part, by E2F family members. More recently, arrays containing thousands of promoters have been created. For example, Li et al. (87) used an array containing PCR products spanning from 650 to +250 of 4839 human genes to identify c-Myc binding sites in human Daudi cells and found that 15% of the tested promoters were occupied by c-Myc. Odom et al. (88) used arrays containing 13,000 human promoters (spanning from 700 bp upstream to 200 bp downstream of the transcription start sites) to identify binding sites for HNF1
(a homeodomain protein), HNF4
(a nuclear receptor), and HNF6 (a member of the onecut family of transcription factors). The promoters chosen for analysis were those that are well characterized according to the National Center for Biotechnology Information annotation. The authors found that 1.6% and 0.8% of the promoters tested were bound by HNF1
in hepatocytes and pancreatic islets, respectively. Similarly, HNF6 bound to 1.7% and 1.4% of the promoters on the array when analyzed with hepatocytes or islets, respectively. In contrast, HNF4
bound to 11%12% of the genes on the array in both tissues, suggesting that, like c-Myc and E2F, HNF4
may regulate a large percentage of mammalian genes. Unfortunately, these selected promoter arrays are not yet commercially available, and the cost and manpower associated with creating unique primers for tens of thousands of different promoters prohibit many labs from using this technology.
A slightly different approach in ChIP-chip assays has been to use libraries of CpG islands as a source of promoters. CpG islands are G+C-rich regions at least 200 bp long with an observed to expected ratio of CpG dinucleotides of at least 0.8. CpG islands are found in the promoters and first exons of an estimated 70% of human genes or at other regulatory regions in the genome (89). Arrays consisting of 8,00012,000 CpG islands have been used to identify E2F, c-Myc, and SUZ12 target genes. Mao et al. (90) used a CpG island array and found that 12% of the clones were bound by c-Myc in human HL60 cells. The same CpG arrays have been used to identify CpG island clones bound by E2F1, E2F4, and E2F6 (9193). Although the vast majority of the clones bound by E2F4 and E2F6 corresponded to CpG islands near promoter regions, many of the clones bound by E2F1 represented certain types of repeats. For example, sequences repeated on chromosomes 1 and 16 were specifically detected by the E2F1-immunopre-cipitated DNA but not by the E2F4- or E2F6-precipitated DNA. Unfortunately, the different studies used different cell types, so it is not yet clear if the different types of identified sites are reflective of differences in the E2Fs or in the cell types used. For the E2F6 study, siRNA analysis was used to demonstrate that a subset of the identified genes were negatively regulated by E2F6 (93). One distinct advantage of the CpG island arrays is that they are now commercially available at a fairly low cost and therefore can be used by many different investigators (www.microarrays.ca).
A major disadvantage of both the selected promoter arrays and the CpG arrays is that they are not optimal for studying factors that regulate transcription by binding at a great distance from the start site of transcription. To overcome this problem, arrays consisting of PCR fragments of about 700 bp corresponding to 93% of the nonrepetitive regions of human chromosome 22 were created and used to identify nuclear factor-kappa B (NF-
B) and cAMP response element-binding protein (CREB) target genes (94, 95). The authors found that NF-
B bound to both noncoding and coding regions, primarily within 5 kb of the 5' ends of genes and in introns. In the annotated region of chromosome 22, NF-
B bound to 15.5% of the loci, similar to the results obtained from the c-Myc and HNF4
ChIP-chip studies. Importantly, 90% of the identified NF-
B sites fell outside the 1-kb region upstream of a start site; using a selected promoter array would have missed these binding sites. Interestingly, NF-
B sites were also detected in unannotated regions of the genome, suggesting that yet-undiscovered genes may reside in these regions. The utility of a global genomic tiling approach to identify target genes was clearly demonstrated in this initial study. However, not only is this array not commercially available, it also suffers from the problem of having to create 21,024 unique PCR products to study this single chromosome. Clearly, expanding to the entire genome would require hundreds of thousands of PCR fragments and be very costly.
Perhaps the most promising type of array for whole genome profiling is a high-density oligonucleotide array. Such arrays have been used to identify thousands of binding sites for c-Myc, Sp1, and p53 on human chromosomes 21 and 22 (96). For these studies, tiled arrays containing on average one 25mer oligonucleotide spaced every 35 bp through the nonrepetitive regions of these two chromosomes were used in the ChIP-chip assays. The authors found 353 Sp1 sites, 756 c-Myc sites, and 48 p53 sites; extrapolation to the whole genome would suggest 25,000 Myc sites, 12,000 Sp1 sites, and 1,600 p53 sites (assuming that chromosome 21 and 22 contain an average number of genes and transcription factor binding sites as compared with the rest of the genome). The authors found that 43%, 24%, and 17% of the Sp1, c-Myc, and p53 sites, respectively, were located within 1 kb of CpG islands, indicating that only a fraction of sites would have been discovered by using CpG arrays. Interestingly, the authors found that 27%, 18%, and 0% of the Sp1, c-Myc, and p53 sites, respectively, were within 1 kb of a 5' exon, suggesting that selected promoter arrays would have detected fewer binding sites than the CpG island arrays. Unfortunately, the authors did not attempt to determine which genes were regulated either positively or negatively by the binding of Sp1, Myc, or p53. A different array technology has recently been developed that allows the synthesis of custom high-density microarrays that can represent any genomic region of interest (97). These arrays have been used to identify PRC binding sites from a set of candidate target genes (52), as well as to identify E2F binding sites in 1% of the human genome.3 Because these custom oligonucleotide arrays are produced by commercial sources, it is likely that they will soon be available to the scientific community. Scaling to the entire human genome will, of course, require many arrays and will most likely be quite expensive.
Although most studies have used variations of the ChIP-chip assay to identify target genes, two different approaches have also been described. One method uses a sequencing-based approach, and a second method is based on creation of a fusion between a transcription factor and a DNA-adenine methyltransferase (Dam). In the sequencing-based approach, the immunoprecipitated chromatin is not applied to an array. Rather, it is either directly cloned and then sequenced (77, 98) or turned into small tags similar to those used in SAGE analysis, concatamerized, cloned, and sequenced.4 The sequencing-based approaches are not comprehensive and are very laborious, but they may identify targets that are not represented on selected promoter or CpG arrays. Another approach, termed DamID, circumvents the ChIP step entirely (99). In this approach, a DNA binding protein is fused to Escherichia coli Dam permitting methylation of DNA within 1.52 kb from the binding site of the DNA-bound fusion protein. Briefly, the fusion protein is introduced into cells, the cellular DNA is then extracted and digested with a restriction enzyme that cuts only at GATC (if the sequence is methylated), and then size fractionated. As a reference, the Dam protein (not fused to a DNA binding factor) is introduced into parallel cultures, the DNA extracted, digested, and size fractionated. The small DNA fragments produced by the Dam fusion protein versus the normal Dam protein are labeled with different fluorescent dyes and hybridized to a microarray. Initial experiments used cDNA-based microarrays, but more-recent studies have used arrays containing long contiguous regions of Drosophila genomic DNA. This technique has not yet been applied to mammalian cells and has the disadvantage in that an artificial protein must be expressed in cells, running the risk that non-physiological levels of the factor of interest may influence the number of binding sites identified. However, this technique might prove useful for identifying targets of factors that associate transiently with the chromatin and thus cannot be captured at the target locus by a cross-linking method.
The Ideal Array Combination.
All the approaches described above (ChIP-chip with PCR fragments or oligonucleotide arrays, Sequence Tag Analysis of Genomic Enrichment [STAGE], or DamID arrays) provide relatively unbiased information concerning the location of binding sites for a particular transcription factor. However, they all suffer from a similar problem: it is not possible to know the precise function of each of the binding sites without additional experimentation. The genes closest to the identified binding sites must be checked individually for responsiveness to alterations in levels of the factor. However, some of the identified sites may be critical for regulation of the nearby gene in some, but not all, cells. Therefore, real targets that are regulated in a different cell type or under a different physiological condition may be inadvertently discarded with this approach. Despite these limitations, some studies have used these approaches to determine if regulation is mediated by a subset of identified binding sites. For example, some of the c-Myc target genes identified by ChIP-chip assays were analyzed for changes in gene expression by RT-PCR in experiments in which c-Myc levels were increased or decreased (90). Also, a subset of E2F6 sites identified by ChIP-chip assays were analyzed by RT-PCR after removal of E2F6 by using siRNA technology (93). However, it is clear that a complete follow-up analysis by RT-PCR or Northern blots is not possible if thousands of target genes have been identified in the binding site assays.
The ideal approach would be to create an array platform that could allow both the examination of RNA expression changes and the identification of DNA binding sites. The promoter arrays and the CpG island arrays correspond to the 5' ends of genes and, as such, do not contain much of the transcribed regions of the genes. This makes it difficult to use these arrays for mRNA expression analysis. However, it is possible to produce 5'endenriched cDNA populations for use with promoter or CpG arrays (100). Therefore, although not optimal, these arrays could be used to study changes in mRNA levels of the genes regulated by the CpG islands on the arrays. Clearly, a better approach would be to create arrays that tile through an entire genome at a resolution sufficient to identify a binding site. These arrays could be used to identify all the binding sites for a particular factor and to identify all RNAs (including protein-coding and noncoding RNAs) that respond to loss or over-expression of that factor. For example, Martone et al. (94) used a tiled genomic array platform, consisting of PCR fragments of about 700 bp in length, for both expression and DNA binding studies of NF-
B. Interestingly, they found that not all the promoters that are bound by NF-
B responded to changes in levels of NF-
B, suggesting either that some of the binding was nonfunctional or that these targets are regulated under different conditions or different cell types. It also remains possible that, due to inherent problems with microarray analysis, such arrays will not always provide a definitive set of target genes. Although the NF-
B study is a step in the right direction, only a small portion of the human genome (chromosome 22) was examined. Because of the size of the human genome, a comprehensive analysis using this approach would take dozens of arrays and would be quite expensive. An alternative approach, which would not be as comprehensive but which would be perhaps more generally useful, would be to create a one- or two-array set that represents 10-kb upstream of each gene plus a 1-kb portion of the 3'end of the coding region of each known gene. The probes representing the 10-kb region would mainly assist in the identification of binding sites in ChIP-chip experiments, whereas the 1-kb portion of the 3'end would primarily serve for determining RNA levels in gene expression experiments. However, binding sites located outside the 10-kb regions would not be detected. Recent studies suggest that this is a true concern. For example, Cawley et al. (96) demonstrated that 36% of identified binding sites for the transcription factors Sp1, p53, and c-Myc were located within genes or downstream of the most 3'-end exon. As a compromise between a comprehensive genomic array and a promoter array, a conserved region array could be produced. Once more mammalian genome projects are completed and the comparative genomic approaches are improved, one could represent on arrays all the conserved mammalian genomic regions. This method would rely on the assumption that conserved regions represent functional domains of the genome where DNA-protein interactions and transcription would most likely occur. One problem with this method might be the need to use a considerable number of arrays to cover all the evolutionarily conserved regions as indicated by the fact that, at the nucleotide level, approximately 40% of the mouse genome is aligned to the human genome (101). However, unless the entire genome is represented on arrays, probably no other approach will provide a comprehensive identification of binding sites and examination of the transcriptome. We hope that future advances in the micro-array technology will allow the fabrication of whole-genome arrays in both economical and practical ways.
| Using Bioinfomatic Tools to Identify Transcription Factor Binding Sites |
|---|
|
|
|---|
Over the past several years many computational programs have been developed that use global gene expression data to identify regulatory elements. In general, these computational programs use two different methodologies for identifying regulatory motifs. The first method is based on the ability to cluster genes according to their gene expression pattern (102). The underlying assumption is that genes classified in the same cluster are co-regulated and thus share similar regulatory motifs within their promoters. For example, Roth et al. (103) have used cDNA microarrays to identify genes that are involved in different cellular processes in yeast (i.e., galactose response). To identify regulatory elements that might play a role in the control of each cellular process, the authors first ranked the deregulated genes from each experimental system according to their changes in gene expression (i.e., from most upregulated to least upregulated). Then they selected the promoter sequences of the 10 genes with the highest changes in gene expression and used the application AlignACE to identify all the common DNA motifs in their promoter sequences. To validate the functionality of the identified motifs, the authors searched for these motifs in the promoters of other yeast genes and showed that additional genes containing the motifs were regulated similarly to the ones that were originally used for the identification of the motif.
Although clustering genes according to their expression and finding common motifs in their promoters is informative, it has limitations. This approach is based on the assumption that all co-regulated promoters share a common motif, and it does not take into account that some of the genes found in a given gene expression cluster might be a result of secondary gene expression perturbations and thus would not contain the same motif as the primary response genes. In addition, some promoters might contain the identified motif, but those genes are not regulated in a manner dictated by the identified motif because of context-dependent regulation at those promoters (i.e., control of expression is dependent on the synergy of adjacent motifs or transcription factors) (104). To avoid these limitations, the second method that uses gene expression data for the identification of regulatory motifs does not use clustering analysis. Rather, this second method initially uses computational programs to identify regulatory motifs occurring commonly in the promoter sequences of known genes, and then these motifs are correlated to collected gene expression data. The fitting of motifs to gene expression allows for the identification of the most relevant elements and also takes into account the combinatorial effects of these motifs on the control of gene expression (105, 106). However, this method is effective for discovering short and highly conserved motifs but is not reliable for identifying longer elements or motifs with degenerate sequences. To circumvent the disadvantages of both methods described above, Conlon et al. (107) have used a strategy, which they named motif regressor, that combines both approaches. Using yeast that overexpress a particular transcription factor, the authors first cluster the genes according to their changes in gene expression. Then they use a motif-finding program (Motif Discovery scan [MDscan]) that allows the identification of all DNA elements that occur frequently in the promoters of the most highly responsive genes. After finding all the candidate elements, they correlate each sequence with the entire gene expression dataset to determine which motifs most likely affect transcription. Unlike the previous two approaches, this method provides higher specificity and sensitivity for finding relevant regulatory elements.
All the approaches described above were performed with yeast as a model system because the relatively small and simple yeast genome (which has a high gene density, small intergenic regions, and relatively few transcription factors) is amenable to bioinformatic analyses. Although application of bioinformatic approaches is much more difficult when studying higher eukaryotes, several groups have attempted to use computational programs to identify regulatory motifs in mammalian genomes. One example is a study that used a previous gene expression dataset (108) to cluster all the human genes that are cell-cycle regulated (109). Using a computer program known as Promoter Integration in Microarray Analysis, Elkon et al. identified eight transcription factor binding sites that were over-represented in the promoter sequences of their clustered genes. Reassuringly, some of those sites corresponded to binding sites of transcription factors that are known to be involved in cell-cycle regulation (i.e., E2F).
In many of the experiments described above, different computational programs were used to identify regulatory motifs from gene expression results. Each method provides information about regulatory motifs that might control the expression of a set of genes under a specific experimental condition. However, unless the experimental condition entails either increased or decreased activity of a single transcriptional regulator, the promoter sequences of the majority of the deregulated genes identified in a microarray study will not contain a common transcription factor binding site. To circumvent this limitation, computational programs have recently been used in combination with location analyses to identify binding sites for specific transcription factors. The advantage of this approach is that all the genomic fragments that are enriched in binding analyses should contain a site that mediates the function of the transcription factor under examination. Using data from ChIP-chip experiments performed in yeast, Liu et al. (110) have shown that their computational method, MDscan, is able to identify known, as well as novel, consensus sites for a transcription factor. In the same study, the authors also tried three other algorithmsBioProspector, AlignACE, and CONSENSUSin combination with the ChIP-chip dataset for the identification of the transcription factor binding sites, and they reported that those algorithms were much slower and less precise compared with MDscan. The MDscan program has also been applied to sequences derived from ChIP-chip experiments performed in human cells. Cawley and colleagues (96) identified consensus and degenerate binding sites of Sp1 in DNA fragments that were enriched with an antibody against the Sp1 transcription factor. However, in the same study, MDscan failed to discover binding sites in DNA sequences enriched by antibodies against two other known DNA binding transcription factors. This suggests that MDscan might be able to detect only mammalian binding sites, such as Sp1, which are most frequently found in core promoters. Another computational program was also used to identify binding sites in DNA sequences that were isolated by the DamID approach. Orian and colleagues (99) identified a large number of genomic loci bound by the Myc/Mad/Max family of transcription factors in Drosophila cells and then used the REDUCE algorithm to show a high correlation between the presence of the canonical E-box sequence (a Myc/Max/Mad binding site) and the identified transcription factorbound regions.
Although, as described above, computational programs have been used successfully to identify binding sites from sequences enriched in location studies, in many situations they have failed to reveal the correct binding motifs. This might be because transcription factors can bind to non-consensus sequences. For example, previous ChIP-chip studies identified a large number of target promoters that did not contain a consensus site for the factor in question (87, 91, 94). Therefore, advanced bioinformatic approaches must be created that will allow the identification of degenerate binding sites. Such advancement may lie in the use of comparative genomics, also known as phylogenetic footprinting. This approach is based on the assumption that functionally important sequences are conserved through evolution and thus are maintained across several related species. One example of a study that integrates bioinformatics, phylogenetic footprinting, and experimental methods was performed by Kel et al. (111). The authors first identified putative binding sites for the E2F transcription factors within a large set of mammalian promoters by computer-based predictions and sequence conservation between mouse and human promoters, then they verified the binding of various E2F family members to those sites by performing ChIP assays in cultured cells. The E2F study did not begin with a set of promoters identified by ChIP-chip but instead selected the promoters by a consensus sequence. However, Kellis et al. (112) have used yeast ChIP-chip data and applied phylogenetic footprinting to genomic regions that are bound by transcription factors having known consensus binding sites. Surprisingly, only a few of the motifs were discovered by the comparative genomics approach. This result emphasizes the need for the development of improved computational methods that will aid in the identification of functional DNA motifs from sequences enriched in binding analysis.
Finally, primary DNA sequence is not the only determinant of where and when a factor will bind. DNA and histones can be modified, resulting in chromatin that contains epigenetic information that influences the binding of factors to specific genomic regions (113, 114). For example, over 25 posttranslational modifications of histone H3 have been identified that involve acetylation, methylation, and phosphorylation; these and many other modifications regulate recruitment of transcription factors and gene activity. No in silico approach has yet undertaken to include the epigenetic information along with the primary sequence information to determine algorithms for factor binding predictions.
| Conclusions and Future Directions |
|---|
|
|
|---|
How Can PcGs Activate Certain Genes and Repress Others?
A greater knowledge of mammalian PRC target genes will allow the clarification of a dichotomy concerning PcG protein activity. Although PcG proteins and their complexes have been primarily studied in the context of their transcriptional silencing activities, several lines of evidence indicate that some of these proteins can also activate transcription in certain circumstances (34, 52, 115, 116). In fact, such PcG proteins are now classified as Enhancer of Polycomb and Trithorax proteins (117). Therefore, it will be of interest to determine how binding of PcG complexes to some target genes results in activation of gene expression and what mechanisms underline this activation (i.e., is histone methylation involved?). One approach to address this question would be to perform ChIP-chip assays with antibodies to components of the PRCs, differently modified histones, and components of the basal transcriptional machinery. The overlap of the array results could be used to determine if binding of PRCs correlates with active versus inactive chromatin and to develop hypotheses as to how recruitment of the PRC can lead to each type of chromatin state.
Do the PRCs Use the Same Mechanism to Imprint the X Chromosome As They Do to Silence Autosomal Target Genes?
Evidence exists in support of the hypothesis that different mechanisms are involved in the regulation of genes on the X chromosome versus genes on the autosomes. For example, although PRC1 is needed for PRC2-mediated silencing of the hox genes in Drosophila, recent studies have demonstrated that PRC1 does not colocalize on the inactivated X chromosome with PRC2. Furthermore, PRC recruitment to the imprinted X chromosome is uniquely dependent on the Xist RNA, raising the possibility that a protein-RNA interaction mediates the recruitment of PRCs to the X chromosome (28, 29). Elucidation of the mechanisms by which PRCs mediate repression requires the identification of mammalian PRC target loci located on both autosomal and X-chromosomal regions. A ChIP-chip assay (with antibodies to the PRC components) with an X-chromosomespecific tiling array may show unique recruitment patterns.
How Do PRCs Communicate with the Core Promoter Region?
Polycomb Regressive Complexes could use either one of the two modes of action depicted in Figure 3
to regulate their target genes. In the first model, the PRCs bind to a distant enhancer element and then, via a DNA looping mechanism, contact the core promoter via protein-protein interactions to regulate transcriptional activity. Alternatively, the PRCs could use an extensive spreading mechanism, which would entail binding of a PRC to a high-affinity binding site, followed by consecutive recruitment of additional PRCs to nearby low-affinity sites. Studies in Drosophila and recent preliminary evidence in human cells favor the DNA looping mechanism, even though strong evidence that would exclude the spreading mechanism is lacking (52, 118). The ability to develop special tiling arrays of target genes identified in ChIP-chip experiments could help distinguish between the two modes of action of PRCs.
|
| Acknowledgments |
|---|
| Footnotes |
|---|
3 Matthew Oberley and P.F., unpublished observations. ![]()
4 V. Iyer, personal communication. ![]()
| References |
|---|
|
|
|---|