|
|
||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


* Computational Biology Group, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109; and
Institute for Advanced Study, Princeton, New Jersey 08540
1 To whom requests for reprints should be addressed at Computational Biology Group, Fred Hutchinson Cancer Research Center, 1100 Fairview Ave. N, Seattle, WA 98109. E-mail: hrobins{at}fhcrc.org
| Abstract |
|---|
|
|
|---|
Key Words: motifs codon usage relative entropy
| Introduction |
|---|
|
|
|---|
Any genome of an organism can be divided into the portion of a gene which encodes the amino acid sequence of a protein and the non-coding portion that contains the regulatory motifs that impact the levels of a gene product, the timing of its expression, and the place (cell type) where it is produced. Clearly the sequence motifs or signals that have biological function are selected for and retained at high frequencies in a genome. Are there similar sets of sequence information in the protein coding regions of the genome which result from the degeneracy of the genetic code and preferential use of selected codons? In this discussion the search for functional sequence motifs or signals in the genome of an organism will be restricted to the protein coding regions of the genome. Of course, there is the confounding effect that some motifs are conserved for nonfunctional reasons. Only experiments can distinguish the truly functional motifs. The same approaches employed here can also be usefully applied to the regulatory regions of genomes and this is work in progress.
The proportion of each codon used for a given amino acid varies widely between organisms as well as between genes in the same organism. For example, in Saccharomyces cerevisiae the highly expressed genes have different codon usage than poorly expressed genes (3). The typical explanation is that isoaccepting tRNAs have very different concentrations within a yeast cell, so the tRNAs that occur in lower abundance can cause rate-limiting steps in translation of an mRNA. The yeast genome selects against those rare codons that correspond to these low abundance tRNAs. The highly expressed genes have a stronger negative selection pressure on the rare codons than poorly expressed genes. Is this a sufficient explanation to shape the unique DNA sequence and use of codons of S. cerevisiae? The results reviewed here suggest other additional factors may play a role. Yet another example of this issue is the LINE-1 retrotransposon elements found in the genomes of many organisms. These LINE-1 elements are very poorly expressed in the somatic cells of their hosts. This is in part due to the heavy methylation of cytosine residues which decreases transcription rates but a second factor, inherent in the nucleotide sequences of LINE-1 elements, slows the rate of transcription of these genomes. Changing the nucleotides in the third position of codons in a LINE-1 element (no amino acid changes) increases the rate of transcription of these elements many fold (5, 6). This can impact the frequency of transposition and the forces of selection for these sequence motifs. Because the poor expression is a transcriptional problem, not translation, this increase due to codon change is not related to codon usage in translation. Could it be that inherent in these LINE-1 elements are sequence motifs that modulate transcription rates or RNA processing rates? Yet a third example of a possible role for nucleotide sequence motifs in protein coding regions of genes, which are not related to codon usage, is found in the Human Immunodeficiency Virus. The DNA copy of this virus integrated into the host cell genome is transcribed poorly and RNA processing of HIV mRNA is inefficient. Indeed two viral proteins are required to enhance transcription (Tat) and RNA processing (Rev). In addition to these aids in gene expression, however, synonymous changes in the third positions of this HIV DNA sequence greatly increases the gene expression of this virus even in the absence of Tat and Rev (7). Thus, there is growing evidence that the RNA sequence of an organism contains sequence motifs, other than codons, which impart biological information. Despite the direct biological relevance of codon usage, there has been very little effort to search for additional information within coding regions (8–11). Here we review the Robins-Krasnitz algorithm, a new method to search this vast amount of additional information contained in the coding region of each organism for biologically functional signals (12). In particular, the algorithm systematically determines the set of small motifs which are under-and overrepresented in a genome.
This search algorithm has overcome two major obstacles. First, the algorithm attempts to find signals which are obscured by the already known biological properties: coding for proteins and codon usage. These make up a set of complicated constraints that must be controlled. Second, motifs come in large families that are highly codependent. For example, CpG is significantly underrepresented in the human genome because a common mutation is C
U through the removal of an amine group. This negative evolutionary pressure significantly reduces the entire superfamily of motifs that include CG as a substring, e.g. ACG, CCG, CGA, etc. Additionally, the mutation increases the number of TAs and its superfamily. Disentangling the primary object of selection is a difficult challenge. Taking advantage of powerful information theory tools, the Robins-Krasnitz algorithm overcomes both of these obstacles. One drawback of the algorithm in the present form is that it is limited to exact motifs. Extending the formulation to include degenerate motifs is a difficult problem.
These obstacles, particularly the first one, are a specific case of a general problem in bioinformatics as well as other areas of computational biology (13). Finding a pattern or signal requires a null-hypothesis for what the genome would look like without signal. An incorrect null-hypothesis (or background model) leads to the discovery of false signals. Our background model needs to account for as much biological knowledge as possible without damping the signal too dramatically. For the case of finding transcription factor binding sites in upstream regions of genes, we have very little biological knowledge to use. Genomic algorithms have had much difficulty separating out false signals. In the case of coding regions, we have substantially more biological knowledge to incorporate into our background model. For coding regions, we must be careful of the opposite issue, over constraining the background and losing the signal. Of the two constraints we consider, amino acid order is not controversial. The interesting decision is how to include codon usage. We are limited for practical reasons to two choices; either we can fix codon usage in the whole genome or we can fix it individually in each gene. We have strong biological evidence that codon usage within a gene has a large effect on rates of gene expression in bacteria. Additionally, the variation of codon usage between genes in the same genome varies widely, with the variation strongly correlated with expression. If overall codon usage in the genome is used as a constraint, we will be ignoring substantial biological information. On a practical level, fixing codon usage in each gene also fixes mononucleotide content. Lifting this constraint creates false signals that correlate with GC content.
This review is divided into five sections. Sections one and two present the algorithm, explaining how both obstacles described above are overcome. The third section provides evidence for biological function of the motifs. The fourth section presents a set of possible applications for this information. The last section explores the reasons why such coding region sequence motifs might arise in different organisms and differ in different organisms.
| 1) Creating a Maximal Entropy Distribution |
|---|
|
|
|---|
|
|
For practical purposes, we actually only need to keep track of the set of the longest motifs we wish to consider. All shorter motifs can be retrieved from this set. We determine the longest length motif that is statistically robust for the set of sequences we are analyzing. For the case of a bacterium with 1 Mb of coding sequence, we can find motifs up to length seven. Since 106/47 ~ 60, all the motifs of length seven would be expected to have reasonably large counts. Then, we move a sliding window of this length across the sets of sequences, moving one base at a time. This process determines the counts for the largest length motifs. However, the shorter length motifs are over counted, because they appear in more than one window each. The formula to deconstruct the counts for all the shorter length motifs is found in Box 1A
. An example is found in Box 1B
. The estimated probabilities for each motif are simply the counts divided by the total length of sequence. We now have two distributions to compare: one is the motif probabilities from the real set of sequences and the other is the MED distribution which is found by averaging the motif probabilities from a number of Monte Carlo iterations.
| Box 1A. Extracting the Maximal Entropy Distribution from a Set of Sequences. NMED(w) is the average count for each word w from the Monte Carlo runs. Wimax is the set of words of maximum length where i = 0, . . . , 4Lmax–1. All counts NMED(w) can be determined from NMED(Wimax), but a couple of additional quantities need to be defined. C(Wimax,w) is the number of times the word w is contained in the word Wimax. L(w) is the length of word w. The equation to extract the counts of all the shorter motifs from the counts of the longest motifs is
PMED(w) = NMED(w)/L is the probability of motif w, where L is the total length of sequence. PR(w) = NR(w)/L is the probability of motif w in the real sequence (this is found by counting).
|
| Box 1B. An Example. Consider the case where the maximum length is 7, which is the maximum size motif that we can consider for most bacterial genomes. Let w = AAC and W2577 = AACAAAC, and where the label 257 is the base four conversion of the motif with A = 0, C = 1, G = 2, T = 3, i.e. 0010001 = 4^4 + 4^0. L(w) = 3, C(W7257,w) = 2, and L(Wmax) = 7. To get NMED(w), this process is repeated for all 47 Wimax terms. The values are then added in the formula above.
|
| 2) Using Relative Entropy to Choose Over- and Underrepresented Motifs |
|---|
|
|
|---|
| Box 2. The Relative Entropy or Kullback-Leibler Distance. The Kullback-Leibler distance between the real and MED probability distributions is
A figure of merit that measures the extent to which *any* word w, of length 2 to max, contributes to DKL is
This can also be thought of as a Kullback-Leibler distance between two probability distributions, namely the coarse-grained real and background distributions where we only know if a given word is or is not w.
|
The other choice of distance is a symmetrized version of the Kullback-Leibler (KL) distance, called the Jensen-Shannon (JS) divergence. The practical difference is small for our application as the KL distance is symmetric to lowest order in our application. The JS divergence also smoothes the result, which removes some of the signal. JS converges faster than KL upon rescaling, so we use KL.
Now that we have found the most under- or over-represented motif in the sequence, we have to tackle the second obstacle. We need to pick out the motif which is the next most under- or overrepresented. However, we cannot simply take the motif which has the next largest Relative Entropy. This is because the motifs are overlapping, so under- or overrepresentation of a given motif affects the distribution of all the other motifs. The example of CpG from the introduction illustrates this point. If we take the human genome, CG will have the largest Relative Entropy. However, all eight trimers which contain CG as a subset fall within the top 50 highest Relative Entropy of all motifs. This is simply an artifact of the selection against CG. We must first remove the contribution of CG from the MED before recalculating the Relative Entropy to find the next motif. If we call the motif w, we rescale all motifs that contain w, by the same amount such that the rescaled MED has the same distribution for w as the real distribution. This forces the Relative Entropy of w to zero and, at the same time, removes the contribution of w from all other motifs.
The mathematical details of the rescaling are as follows with many of the definitions in Boxes 1![]()
and 2
. After a motif w is found to be over- or underrepresented, its contribution to the difference between the real distribution and the MED needs to be eliminated. The maximal entropy distribution is rescaled in a minimal way, such that the contribution of w becomes identical in both the real distribution and the MED. For the rescaling to be minimal, the ratios of frequencies of words Wimax of length max that contain w the same number of times should not change. That is, all words Wimax with the same C(Wimax,w) must be rescaled by an equal factor. Consider a coarse graining of the detailed probability distributions. The MED is defined as the set of words Wimax of length max, with the probabilities PMED(Wimax). The set of Wimax is partitioned into disjoint subsets where each element of a given subset contains the motif w an equal number of times. These sets are
![]() |
with J = {0, ... , max – 1} and
![]() |
These disjoint subsets, KJ(w), need to be rescaled such that the probability of being in a given subset is equal in the real distribution and the MED. Define
![]() |
![]() |
These are well-defined probability distributions because they are grouped elements from the old probability distribution (and their probabilities are added). A rescaling that factors out the contribution of w while conserving probability is given by
![]() |
for all i. Note that with this rescaled distribution the figure of merit for w is now Srescaled(w) = 0, so the contribution of w to DKL has been removed. Now, the whole procedure can be repeated to find the next word w, etc.
It is not hard to see that in this iterative algorithm, the MEDPMED(Wimax) converges to the real distribution PR(Wimax). This is because DKL is monotonically decreasing (proof is found in Robins et al. 2005). It is well known that DKL is nonnegative, and is 0 if and only if the two distributions are identical. The algorithm will not halt before convergence is achieved since S(w) = 0 for all w also implies that the real and background distributions are identical. Finally, DKL cannot converge to a positive value, since one could then find a word that would reduce it below that value.
The procedure is iterated, so that we remove the contribution of one motif at a time from contributing to the Relative Entropy through rescaling of the MED. Then, we choose the next motif. The rescaling automatically over-comes the second obstacle. The impact of selection on a motif w, affects all other motifs, especially its supersets and subsets. Our scaling procedure systematically handles all of these motifs at the same time.
Iterations are performed until the largest Relative Entropy for any motif is small enough that we would expect it by chance. This cutoff is the point where it becomes likely that chance fluctuations would create the most significant remaining word, appropriately corrected for multiple hypotheses (the set of all words of length w). The cutoff occurs when the selected word w satisfies
![]() |
where
(w) is the standard deviation of the background count for w. For all our applications we stopped after 100 iterations, which is substantially below the cutoff.
The algorithm gives a set of motifs of varying lengths which are under- or overrepresented in the coding region. Each motif varies in a statistically significant manner between the real and Maximal Entropy Distributions. Having this list, the question then arises; do these sequence motifs found under- or overrepresented in the coding regions of genes have any biological function?
| 3) Evidence for Biological Function |
|---|
|
|
|---|
|
Additionally, we have analyzed the frame bias for the start of the motifs. We asked the question, are the motifs preferentially beginning in a particular reading frame. If the answer was yes, this could either be an artifact of the method or be a biologically functional. However, there is no frame preference overall. The distribution among individual motifs agrees with a random distribution (data not shown).
We use the E. coli example to explore the relative effect of the constraints. If we remove the constraint on amino acid order in each gene, we maintain the constraints on GC bias and codon usage. Lifting this constraint makes the least change between the distributions. The result is that 40% of the top 100 motifs are completely different, while 60% are the same. Removing further constraints drops the overlap of motifs below 25%.
How do the motifs in one bacterium relate to the motifs in another bacterium? Using the list of motifs the Robins-Krasnitz algorithm produces for each of the 164 bacterial species found in the NCBI database, we defined a Hamming distance as a metric. We constructed a phylogenetic tree of these bacteria based entirely on the motifs and this metric (Figs. 3A, 3B
). The tree formed from over- or under-represented sequences was then compared with an evolutionary tree constructed from the ribosomal gene family sequences, which is considered at this time the gold standard. The tree created by the Robins-Krasnitz algorithm captures most of the standard bacterial taxonomy with a whole or partial genome approach (12). This distribution of lists of over- and underrepresented sequences for each organism should permit the assembly and classification of other unknown organisms from their sequence data. Thus, this method can be used to take advantage of shotgun sequence data, because it does not require alignment of orthologous genes. As long as a sufficient amount of sequence from a bacterium, e.g. over 50,000 bases, are sequenced, this method can accurately place the unknown organism into a region of the bacterial phylogenetic tree.
|
|
| 4) Applications |
|---|
|
|
|---|
Two projects are presently underway for direct application of these motifs for medical research. The first project extrapolates the phage/bacterial analysis to humans and our viruses. Like bacteria, many of the motifs derived from the human genome match the motifs from our infecting viruses. However, viral genes can have regulatory differences when compared to the sequences in the human genome. The evolutionary pressures on the viral genomes may result in design differences when compared to those in the human genome. These differences may well be reflected in differences in the motifs detected by the Robins-Krasnitz algorithm. There are striking differences between the motifs found in the mRNA of HIV-1 and those found in the set of all human genes in their coding regions. As mentioned previously the genes of HIV are expressed very poorly in their human host. Focusing on a single motif which is overrepresented in the HIV-1 genome and underrepresented in the coding regions of the human genome, we have recoded the Gag gene of HIV-1 to remove this signal. This simple change in coding doubled the expression of Gag in a human kidney cell line. The same recoded sequence was then used as a DNA vaccine in a mouse model. The anti-gag immune response was six fold higher for our recoded sequence. This avenue of research is presently ongoing and would be a first step in improving an HIV DNA vaccine. It might also help explain the modest success of the HIV DNA vaccines.
The second practical application is to increase the gene expression of coding regions of genes from one host expressed in a second host. Human or viral genes are often expressed in foreign organisms, usually bacteria, yeast, or baculoviruses, for either medical purposes or basic research applications. There is a promising computational project underway to create a tool which combines codon optimization and the contribution of the motifs to recreate a sequence for a particular protein which has increased expression in a given organism.
Additionally, the motifs can be used as a bioinformatic tool to determine the origin and classification of DNA sequences of unknown origin or to assemble a genome from its parts when multiple organisms compose a sample for DNA sequencing. With ongoing shotgun sequencing projects to explore the bacterial diversity in the Sargasso Sea, the air above Manhattan, and the human digestive track, among others, classifying bacterial phylogeny based on novel sequence is a useful tool (15). For this purpose a genome wide property that accurately classifies a genome will be essential. The common tool presently used is dinucleotide content. This method has advantages over the Robins Krasnitz algorithm for very short sequences, but as is demonstrated by phage/host analysis, our algorithm becomes significantly better than the other methods when the sequences reach a few tens of thousands of bases.
Finally, as shown in section 3, the motifs themselves appear to have biological function. We expect a subset of these motifs to be protein binding sequences. Further work needs to be done to determine which proteins are binding these particular sequences and what their biological function could be in a cell. Having these overrepresented sequences provides the tools to isolate and identify DNA sequence specific binding proteins and then to determine their functions. Some of the DNA sequence motifs could be providing physical properties such as flexibility for bending of DNA, packaging DNA in a cell or interactions with RNA polymerases. These DNA sequence motifs could function in the DNA or the mRNA produced from that DNA sequence. One way to distinguish between these alternatives is to look for the reverse complement of an over- or underrepresented sequence in the non-coding strand of DNA (note all of the motifs discussed here come from the coding strand). An analysis of the genes in the human genome with a modified Robins-Krasnitz algorithm has identified overrepresented sequences that function in both DNA and RNA molecules (transcription factor binding sites, splice junctions, splice enhancers, etc.). More than half of these sequence motifs have unknown functions.
If it is correct that these sequence motifs that are over-and underrepresented in a genome are derived from bases in the third position of the coding regions of a gene and the motifs that are formed are selected for or against, then some third position mutations may not be neutral to selection even though they dont change an amino acid. Synonymous single nucleotide polymorphisms that have a phenotype could well uncover sequence motifs (out of reading frame) with functions that have important consequences. For example a synonymous single nucleotide sequence in the multidrug resistance gene-1 alters the functionality of that gene product (16). Many of the mutations in the IARC p53 gene mutant database reside in the third position but no one has tested them for their functional consequences, if any. Clearly these ideas bring up some new directions for research in this area.
| 5) Why Do These DNA Sequence Motifs Arise in the Genome? |
|---|
|
|
|---|
These observations provide good evidence that these sequence signals in a genome are under positive or negative evolutionary selection pressures and so genome sequences are shaped by many diverse sources. But it could be that these sequence motifs arise in a more complex way. The mutation rate and even the nature or the frequency of different types of mutations (transitions, transversions, direction of mutations; A to G or G to A) can differ in different organisms. Mutation rates can be set by the variation in the environment or by diversity of the environment. Thus, there is an evolutionary bias for some sequence motifs in a genome and this could differ between organisms, be distributed evenly around the genome, provide evolutionary diversity and an evolutionary tree of related organisms and would impact upon both the host and viral genomes. It seems likely that some of these sequence motifs could arise from a bias in the mutation type and frequency and might not have a functional consequence even though it was selected for by the host. It may be likely that both functional and non-functional sequence motifs reside in this distribution of over- and underrepresented sequences. What is clear is that such sequence motifs identify important processes in the evolution of the host and contain information about those processes that will need to be decoded in the future.
| Footnotes |
|---|
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |