Scaling laws in the functional content of genomes

  • Published on
    16-Sep-2016

  • View
    214

  • Download
    0

Transcript

specific protein interactions provide a sufficient constraintfor a DNA sequence to be recognized, it will ultimately bepossible to design cell-specific response elements to blocktranscription factors in a specific subset of cells [25,26].Our findings suggest that the understanding of howspecific functions are discriminated as different blueprintson the DNA might be within our reach.AcknowledgementsWe apologize for the fact that, because of space constraints, full citation ofall relevant articles is not possible. P. Hogeweg, J. Deschamps and areferee are thanked for their critical reading of the manuscript. This studywas supported in part by HPRN and EU quality of life programs and G.M.is a recipient of a Marie Curie fellowship.References1 Claverie, J.M. (2000) From bioinformatics to computational biology.Genome Res. 10, 127712792 Benos, P.V. et al. (2002) Is there a code for proteinDNA recognition?Probab(ilistical)ly. BioEssays 24, 4664753 Hong, W.K. and Sporn, M.B. (1997) Recent advances in chemopreven-tion of cancer. Science 278, 107310774 Mangelsdorf, D.J. et al. (1994) In The Retinoids: Biology, Chemistryand Medicine (Sporn, M.B. et al., eds), pp. 319349, Raven Press5 Durston, A.J. et al. (1998) Retinoids and related signals in earlydevelopment of the vertebrate central nervous system. Curr. Top. Dev.Biol. 40, 1111756 Rastinejad, F. (2001) Retinoid X receptor and its partners in thenuclear receptor family. Curr. Opin. Struct. Biol. 11, 33387 Gronemeyer, H. and Miturski, R. (2001) Molecular mechanisms ofretinoid action. Cell. Mol. Biol. Lett. 6, 3528 Kurokawa, R. et al. (1995) Polarity-specific activities of retinoic acidreceptors determined by a co-repressor. Nature 377, 4514549 Gould, A. et al. (1998) Initiation of rhombomeric Hoxb4 expressionrequires induction by somites and a retinoid pathway. Neuron 21,395110 Huang, D. et al. (2002) Analysis of two distinct retinoic acid responseelements in the homeobox gene Hoxb1 in transgenic mice. Dev. Dyn.223, 35337011 Nolte, C. et al. (2003) The role of a retinoic acid response element inestablishing the anterior neural expression border of Hoxd4 trans-genes. Mech. Dev. 120, 32533512 Amores, A. et al. (1998) Zebrafish hox clusters and vertebrate genomeevolution. Science 282, 1711171413 Duboule, D. and Morata, G. (1994) Colinearity and functionalhierarchy among genes of the homeotic complexes. Trends Genet. 10,35836414 Krumlauf, R. (1994) Hox genes in vertebrate development. Cell 78,19120115 Moroni, M.C. et al. (1993) Regulation of the human HOXD4 gene byretinoids. Mech. Dev. 44, 13915416 Popperl, H. and Featherstone, M.S. (1993) Identification of a retinoicacid response element upstream of the murine Hox-4.2 gene. Mol. Cell.Biol. 13, 25726517 Morrison, A. et al. (1996) In vitro and transgenic analysis of a humanHOXD4 retinoid-responsive enhancer. Development 122, 1895190718 Langston, A.W. et al. (1997) Retinoic acid-responsive enhancerslocated 30 of the Hox A and Hox B homeobox gene clusters. Functionalanalysis. J. Biol. Chem. 272, 2167217519 Packer, A.I. et al. (1998) Expression of the murine Hoxa4 gene requiresboth autoregulation and a conserved retinoic acid response element.Development 125, 1991199820 Kim, C.B. et al. (2000) Hox cluster genomics in the horn shark,Heterodontus francisci. Proc. Natl. Acad. Sci. U. S. A. 97, 1655166021 Kurokawa, R. et al. (1993) Differential orientations of the DNA-binding domain and carboxy-terminal dimerization interface regulatebinding site selection by nuclear receptor heterodimers. Genes Dev. 7,1423143522 Rastinejad, F. et al. (1995) Structural determinants of nuclear receptorassembly on DNA direct repeats. Nature 375, 20321123 Escriva, H. et al. (1997) Ligand binding was acquired during evolutionof nuclear receptors. Proc. Natl. Acad. Sci. U. S. A. 94, 6803680824 Sluder, A.E. et al. (1999) The nuclear receptor superfamily hasundergone extensive proliferation and diversification in nematodes.Genome Res. 9, 10312025 Cho, Y.S. et al. (2002) A genomic-scale view of the cAMP responseelement-enhancer decoy: a tumor target-based genetic tool. Proc. Natl.Acad. Sci. U. S. A. 99, 156261563126 Wang, L.H. et al. (2003) The cis decoy against the estrogen responseelement suppresses breast cancer cells via target disrupting c-fos notmitogen-activated protein kinase activity. Cancer Res. 63, 2046205127 Chang, B.E. et al. (1997) Axial (HNF3beta) and retinoic acid receptorsare regulators of the zebrafish sonic hedgehog promoter. EMBO J. 16,3955396428 Zhang, F. et al. (2000) Murine hoxd4 expression in the CNS requiresmultiple elements including a retinoic acid response element. Mech.Dev. 96, 798929 Valcarcel, R. et al. (1994) Retinoid-dependent in vitro transcriptionmediated by the RXR/RAR heterodimer. Genes Dev. 8, 3068307930 Astrom, A. et al. (1994) Retinoic acid induction of human cellularretinoic acid-binding protein-II gene transcription is mediated byretinoic acid receptor-retinoid X receptor heterodimers bound to onefar upstream retinoic acid-responsive element with 5-base pairspacing. J. Biol. Chem. 269, 223342233931 Wingender, E. et al. (2001) The TRANSFAC system on gene expressionregulation. Nucleic Acids Res. 29, 2812830168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0168-9525(03)00202-6Scaling laws in the functional content of genomesErik van NimwegenCenter for Studies in Physics and Biology, the Rockefeller University, 1230 York Avenue, New York, NY 12001, USAWith the number of sequenced genomes now totalingmore than 100, and the availability of rough functionalannotations for a substantial proportion of their genes,it has become possible to study the statistics of genecontent across genomes. In this article I show that, formany high-level functional categories, the number ofgenes in each category scales as a power-law of thetotal number of genes in the genome. The occurrenceof such scaling laws can be explained using a simpletheoretical model, and this model suggests that theexponents of the observed scaling laws correspondto universal constants of the evolutionary process.Corresponding author: Erik van Nimwegen (erik@golem.rockefeller.edu).Update TRENDS in Genetics Vol.19 No.9 September 2003 479http://tigs.trends.comI discuss some consequences of these scaling laws forour understanding of organism design.What fraction of the gene content of a genome is allotted todifferent functional tasks, and how does this depend on thecomplexity of the organism? Until recently, there weresimply no data to address such questions in a quantitativeway. At present, however, there are more than 100sequenced genomes in public databases (http://www.ncbi.nlm.nih.gov/Genomes/index.htm), and protein-familyclassification algorithms allow functional annotations fora considerable fraction of the genes in each genome. Thus,it has become possible to analyze the statistics offunctional gene-content across different genomes, and inthis article I present results on the dependency of thenumber of genes in different high-level categories on thetotal number of genes in the genome.Evaluating the functional gene-content of genomesTo estimate the number of genes in different functionalcategories, each genome has to be functionally annotated.The main results presented in this article were obtainedusing the InterPro [1] annotations of sequenced genomesavailable from the European Bioinformatics Institute [2].To map the InterPro annotations to high-level functionalcategories I used the Gene Ontology (GO) biologicalprocess hierarchy [3] and a mapping from InterPro entriesto GO categories, both of which can be obtained from theGene Ontology website (http://www.geneontology.org). Foreach GO category I collected all InterPro entries that mapto it or to one of its descendants in the biological processhierarchy. To minimize the effects of potential bias inthe mappings from InterPro to GO I only used high-level functional categories that are represented by atleast 50 different InterPro entries. This leaves 44 high-level GO categories.A gene with multiple hits to InterPro entries that areassociated with a GO category has a higher probability ofbelonging to that category than a gene with only a singlehit. To take this information into account, I assumed that ifa gene i has nci independent hits to InterPro entriesassociated with GO category c, then with probability 12exp2bnci the gene belongs to this particular GO category.The results are generally insensitive to the value of b(Box 1), and I used b 3 for the results shown below. Theestimated number of genes nc in a genome for a given GOcategory is then the sum nc Pi 12 exp2bnci over allgenes i in the genome.The results for the categories of transcription regulat-ory genes, metabolic genes and cell-cycle-related genes forbacteria and eukaryotes are shown in Fig. 1. Remarkably,for each functional category shown, we find an approxi-mately power-law relationship (solid line fits)*. That is, ifnc is the number of genes in the category, and g is the totalnumber of genes in the genome, we observe laws of theform: nc lga; where both l and the exponent a dependon the category under investigation. In fact, suchpower-laws are observed for most of the 44 high-levelcategories, and the estimated values of the exponentsfor several functional categories are shown in Table 1.A potential source of bias in estimating these expo-nents is the occurrence of multiple genomes from thesame or closely related species in the data. Removingthis redundancy from the data does not alter theobserved exponents (Box 1).The power-law fits and 99% posterior probabilityestimates for their exponents were obtained using aBayesian procedure described in Box 1. To assess thequality of the fits, I measured, for each fit, the fraction ofthe variance in the data that is explained by the fit. Inbacteria, 26 out of 44 GO categories have more than 95% ofthe variance explained by the fit, and 38 categories havemore than 90% explained by the fit. In eukaryotes, 26categories have more than 95% explained by the fit and 32categories have more than 90% explained by the fit.However, with total gene number varying by less than afactor of 20 in bacteria, and the small number of datapoints in eukaryotes, one might wonder how one can claimthat power-laws have been observed. First, the fact that Ifit the data to power-laws should not be mistaken for aclaim that the data can only be described by power-lawfunctions. I only claim that the power-law is by far thesimplest functional form that fits almost all the observeddata. Second, when scatter plots such as those shown inFig. 1 are plotted on linear as opposed to logarithmicscales, it is clear even by eye that the fluctuations in thenumber of genes in a category scale with the total numberof genes in the genome. That is, the fluctuations in the datasuggest that logarithmic scales are the natural scales forthese data. This is further supported by the simpleevolutionary model presented below.One might also wonder to what extent the results aresensitive to the specific functional annotation procedure. Iperformed a variety of tests to assess the robustness of theresults (i.e. the observed power-law scaling and the valuesof the exponents) to changes in the annotation method-ology (Box 1). These involve using entirely independentannotations based on clusters of orthologous groups ofproteins (COGs) [7], and a simple (crude) annotationscheme based on keyword searches of protein tables forsequenced microbial genomes from the NCBI website(ftp.ncbi.nlm.nih.gov/genomes/bacteria). As shown inBox 1, the observed power-law scaling and the values ofthe exponents are generally insensitive to these and otherchanges in annotation methodology. However, because allcurrently available annotation schemes assume at somelevel that functional homology can be inferred fromsequence homology, our results implicitly depend on thisassumption as well.Observed exponentsSome functional categories, such as the large category ofmetabolic genes, occupy a roughly constant fraction of thegene content of a genome, as evidenced by their exponentof 1. However, many categories show significant deviationsfrom this trivial exponent. Genes related to cell-cycle orprotein biosynthesis have exponents significantly below 1,whereas for transcription factors (TFs) the exponent is* These power-laws as a function of total gene number should be distinguished fromthe power-law distributions of gene-family sizes and other genomic attributes within asingle genome [4,5].Update TRENDS in Genetics Vol.19 No.9 September 2003480http://tigs.trends.comBox 1. AppendixResults for ArchaeaFigure I shows the number of transcription regulatory genes, metabolicgenes and cell-cycle-related genes as a function of the total number ofgenes in archaeal genomes. Note that the size of the largest archaealgenome differs from that of the smallest by only a factor of ,3.Consequently, there is large uncertainty regarding the values of theexponents for archaea. The maximum likelihood values and 99%posterior probability intervals are 2.1 and [1.35.7] for transcriptionregulatory genes, 0.81 and [0.441.48] for metabolic genes, and 0.83and [0.541.35] for cell-cycle-related genes.Power-law fittingA power-law relation of the form y cxa translates into a linearrelation between the logarithms of the variables x and y[i.e. logy logc alogx]. Thus, power-law fitting translates intofitting a straight line to the log-transformed data. The power-law fitsshown in Fig. 1 of the main text were obtained using a Bayesianstraight-line fitting procedure. For each gene ontology (GO) category, Ilog-transformed the data such that each data point xi ; yi correspondsto the logarithm xi of the number of genes in the genome and thelogarithm yi of the number of genes in the category. I assumed thatthese transformed data were drawn from a linear model:y ax 1 l h; Eqn Iwhereh and 1 are noise terms in the x- and y-coordinates, respectively,and l is an unknown off-set. I assumed that the joint-distribution Ph; 1is a two-dimensional Gaussian with means zero and unknown variancesand co-variance. That is, I used scale invariant priors for the variances,and integrated these nuisance variables out of the likelihood. A uniformprior was used for the location parameter l, and for the slope a I used arotation invariant prior:Pada da1 a23=2 : Eqn IIThe use of these priors guarantees that the results are invariant under allshifts and rotations of the plane.Integrating over all variables except for a, we obtain for the posteriorPa=D given the data D :Pa=Dda C a2 1n23=2daa2Sxx 2 2asyx syy n21=2; Eqn IIIwhere n is the number of genomes in the data, sxx is the variance inx-values (logarithms of the total gene numbers), syy is the variance iny-values (logarithms of the number of genes in the category), syx is theco-variance, and C is a normalizing constant. The values of theexponents reported in Table I are the values of a that maximizePa=D; and the boundaries of the 99% posterior probability intervalaround it.For each fit, I also measured what fraction of the variance in the data isexplained by the fit. That is, I compared the average distance d of thepoints in the plane to their center of mass with the average distance dl ofthe points to the fitted line and defined the fraction q 12 dl =d as thevariance in the data explained by the fit.Robustness of the resultsFirst, I checked that the total amount of available annotation informationis not itself dependent on genome size. If the total amount of availableannotation information were to vary with the number of genes in thegenome, this could lead to biases in the estimated exponents. Toexclude this possibility, I counted the total number of genes with anyInterPro hits in each genome and found that, for both bacteria andeukaryotes, the fraction of genes in the genome with InterPro hits isapproximately two-thirds, independent of the total number of genes inthe genome. Consistent with this observation, when one fits power-laws to the number of genes nc in a GO category c as a function of thetotal number of annotated genes in the genome, one finds exponentsthat are very close to those found for nc as a function of the total numberof genes in the genome.Second, I tested that the results are insensitive to the value of theparameter b: The default value b 3 gives a gene with a single InterProhit a probability of 12 e23 < 0:95 to belong to the category. This isreasonable because InterPro is designed to only report statisticallysignificant hits. To assess the effect of changing b, results forb 1 weregenerated. For bacteria the change in fitted exponent is less than 5% for26 of 44 categories, and less than 10% for 39 categories. For eukaryotesthe exponent changes by less than 5% for all but three categories. In allcases, the change in fitted exponent is significantly smaller than the 99%posterior probability intervals associated with the exponents of the fits.Third, I tested the robustness of the results against removal ofpotential redundancies in the data. For bacteria there are severalexamples where multiple genomes of the same species or genomes ofvery closely related species occur in the data, and one might suspectFig. I. The number of transcription regulatory genes (red), metabolic genes(blue), and cell-cycle-related genes (green) as a function of the total number ofgenes in archaea. Both axes are shown on logarithmic scales. Each dot corre-sponds to a genome. The straight lines are power-law fits.TRENDS in Genetics 1500 2000 3000 5000Genes in genome205010020050010002000Genes in categoryArchaeaTable I. Estimated 99% posterior probability intervals for thescaling exponents obtained with three different annotationschemesaAnnotation Category ExponentGO Protein biosynthesis 0.110.15COG Translation, ribosomal structure andbiogenesis0.210.37NCBI Protein biosynthesis 0.090.15GO Signal transduction 1.551.90COG Signal transduction mechanisms 1.662.14GO Protein metabolism and modification 0.680.80GO Protein degradation 0.891.06COG Posttranslational modification, proteinturnover, chaperones0.881.15NCBI Protein degradation 0.680.91GO Cell cycle 0.390.54GO DNA repair 0.521.14COG Replication, recombination and repair 0.660.83NCBI Cell cycle 0.450.64GO Ion transport 1.151.70COG Inorganic ion transport and metabolism 1.191.47NCBI Ion transport 1.121.88GO Regulation of transcription 1.742.00COG Transcription regulation 1.692.36NCBI Transcription 1.902.42GO Kinase 0.961.16NCBI Kinase 0.801.03GO Transport 1.081.32NCBI Transport 1.161.50aAbbreviations: COG, clusters of orthologous groups of proteins; GO, geneontology.Update TRENDS in Genetics Vol.19 No.9 September 2003 481http://tigs.trends.comsignificantly above 1. These trends are strongest inbacteria, where the exponent for TFs is almost 2, implyingthat as the number of genes in the genome doubles, thenumber of TFs quadruples. This has some interestingimplications for regulatory design in bacteria. It impliesthat the number of TFs per gene grows in proportionto the size of the genome (see [6] for a similarobservation). This in turn implies that, in largergenomes, each gene must be regulated by a largernumber of TFs and/or each TF must be regulating asmaller set of genes. An exponent of 2 is also observedfor two-component systems, which are the primarymeans by which bacteria sense their environment.This suggests that the relative increase in transcriptionregulators in more complex bacteria is accompanied by anequal relative increase in sensory systems.The difficulties with gene prediction and annotation ineukaryotes, the small number of available genomes, andour lack of understanding of the role of alternative splicingacross eukaryotic genomes make it premature to drawmany conclusions from Fig. 1b. However, the main trendsfrom Fig. 1a are reproduced: the super-linear scaling ofTFs, the sub-linear scaling of cell-cycle genes, and thesmall exponents for DNA replication and protein biosyn-thesis genes.The observed super-linear scaling of TFs also hasimplications for our understanding of combinatorialcontrol in transcription regulation. It is well establishedthat in complex organisms, different TFs combine intocomplexes to affect transcription control. Therefore, arelatively small number of TFs can implement a combi-natorially large number of different transcriptionregulatory states, which might correspond to particularexternal environments, developmental stages, tissues orcombinations of external stimuli. Each such regulatorystate will be associated with a unique set of genes that isexpressed in that state. If the number of such regulatorystates were proportional to the total number of genes, thenthe number of TFs would increase more slowly than thetotal number of genes. However, the scaling results showthat, instead, the number of TFs increases more rapidlythan does the total number of genes. This implies that thenumber of regulatory states is also combinatorial in thetotal number of genes: a relatively small number of genesis used in different combinations to implement combina-torially many regulatory states.The picture that emerges is not one of TFs being used indifferent combinations to implement the regulatory needsof individual genes. But rather that, as one moves fromsimple to more complex organisms, the number ofregulatory states grows so much faster than the totalnumber of genes such that, even with combinatorialcontrol of transcription, the number of TFs grows muchfaster than the total number genes.Evolutionary modelOne of course wonders about the origins of these scalinglaws in genome organization, and I would like to presentsome speculations in this regard. Assume that mostchanges in the number of genes nc in a functional categoryc are caused by duplications and deletions. Then, nctgenerally evolves according to the equation:dnctdt bt2 dtnct rtnct; Eqn 1with bt and dt; respectively, the duplication rate anddeletion rate of the genes in this category at time t in theevolutionary history of the genome. For simplicity ofnotation I have introduced the difference of duplicationand deletion rates rt; which can be thought of as anthat these could bias the results. To this end, I parsed the names of allbacterial species into a general and a specific part, for example, forEscherichia coli, Escherichia is the general part and coli is the specificpart, for Listeria innocua, Listeria is the general part and innocua is thespecific part, and so on. Groups of genomes with the same general partwere then collected together and for each group the gene numbers werereplaced with a single average of total gene number and average genecounts in each of the functional categories. This reduces the size of thedataset by approximately a third. Power-law fitting was then applied tothis reduced set and the fitted exponents were compared with those ofthe full dataset. The exponent was changed by less than 5% in 31 out of44 categories. The largest observed change was an 18% change in theexponent. All changes to the exponents were well within their 99%posterior probability intervals.More importantly, the results could depend on the use of InterPro,Gene Ontology, and the mapping of InterPro entries to GO categories.To test the robustness of the results to bias inherent in InterProannotation and/or the mapping from InterPro to Gene Ontology, Ianalyzed selected functional categories using two other annotationschemes.The first is based on clusters of orthologous groups of proteins(COGs) [7] annotation of 63 bacterial genomes that can be obtained fromthe NCBI database (ftp://ftp.ncbi.nih.gov/pub/COG/COG). In this data-set, proteins of the 63 bacterial genomes are assigned to COGs, and theCOGs have been assigned to functional categories. I used theseassignments to count the number of genes in different functionalcategories according to the COG annotation scheme. A comparison ofthe exponents for COG functional categories and the exponents for theclosest GO categories obtained using the InterPro annotations is shownin Table I.The second alternative annotation scheme I used is based on simplekeyword searches of protein tables for fully sequenced bacterialgenomes available from the NCBI ftp site. (ftp.ncbi.nlm.nih.gov/genomes/Bacteria). Removing genomes for which little or no annota-tion exists, this leaves protein tables for 90 bacterial genomes. Eachprotein in these tables is annotated with a short description line. Thenumber of genes in different functional categories was counted bysearching each description line for hits to a set of keywords thatcharacterize the category. For instance, I chose the keywords ribosom,translation and tRNA for the category protein biosynthesis, and agene is counted as belonging to this category if any of these keywordsoccurs in its description line. For the other categories I used thefollowing keywords: transcription for the category transcription;transport, channel, efflux, pump, porin, export , permease,symport, transloca and PTS for the category transport; allcombinations X Y with X being one of ion, sodium, calcium,potassium, magnesium and manganese and Y being one ofchannel, efflux, transport and uptake for the category iontransport; protease and peptidase for the category protein degra-dation; kinase for the category kinase; and finally the phrases DNApolymerase, topoisomerase, DNA gyrase, DNA ligase, replication,helicase, DNA primase, DNA repair, cell division and septum forthe category cell cycle. The exponents resulting from this (crude)annotation scheme are also shown in Table I. There is good quantitativeagreement between the exponents that are obtained with the differentannotation schemes. Note that two-component systems were not in the list of 44 high-level categories.Update TRENDS in Genetics Vol.19 No.9 September 2003482http://tigs.trends.comeffective duplication rate. This rate rt is presumablyproportional to the difference between the averageprobability that selection will favor fixation of a duplicatedgene from this category and the average probabilitythat selection will favor deletion of a gene from thiscategory. Similarly, the total number of genes gt obeysthe equation:dgtdt gtgt; Eqn 2with gt the overall effective rate of gene duplication in thegenome at time t in its evolutionary history. When we solvefor nc as a function of g we find:nc lgkrl=kgl; Eqn 3where krl and kgl are the mean effective duplication rates ofgenes in category c and the entire genome, respectively,averaged over the evolutionary history of the genome, andl is a constant that depends on the boundary conditions. Inorder for all bacterial genomes to obey the same functionalrelation, the constant l and the ratios krl=kgl have to be thesame for all bacterial evolutionary lineages. Because alllife shares a common ancestor, the boundary conditions forEqns 1 and 2 are trivially the same for all bacteriallineages, implying that the constant l is indeed the samefor all bacterial lineages. In summary, simply assumingthat changes in gene-number occur mostly throughduplications and deletions implies our observed power-law scaling if the ratios krl=kgl are the same for allevolutionary lineages.I thus propose that the explanation for the observedscaling laws is that the ratios krl=kgl are indeed the samefor all bacterial lineages. That is, these ratios of averageduplication rates are universal constants of the evolu-tionary process. For instance, the exponent 2 for TFs inbacteria indicates that, in all bacterial lineages, evolutionselects duplicated TFs twice as frequently as duplicatedgenes in general. It seems likely that such universalconstants are intimately connected to fundamental designprinciples of the evolutionary process. It is tempting tobecome even more speculative in this regard, and suggestthat this factor of 2 in duplication rate is related to theswitch-like function of transcription factors: with eachaddition of a TF the number of transcription regulatorystates of the cell doubles. It is not entirely implausible toassume that with twice the number of internal statesavailable, the probability of such a duplication being fixedFig. 1. The number of transcription regulatory genes (red), metabolic genes (blue),and cell-cycle-related genes (green) as a function of the total number of genes inthe genome for bacteria (a) and eukaryotes (b). Both axes are shown on logarith-mic scales. Each dot corresponds to a genome. The straight lines are power-lawfits. Archaea are not shown because the range of genome sizes in archaea is toosmall for meaningful fits. For completeness the archaeal results are shown in Fig. Iin Box 1.500 1000 2000 5000 10000Genes in genomeGenes in genome101001000Bacteria500 1000 2000 5000 10000 2000010100100010000Genes in categoryGenes in categoryEukaryotes(a)(b)TRENDS in Genetics Table 1. Estimates for the exponents of a selection of functionalcategoriesaCategory Bacteria EukaryotesTranscription regulation 1.87 ^ 0.13 1.26 ^ 0.10Metabolism 1.01 ^ 0.06 1.01 ^ 0.08Cell cycle 0.47 ^ 0.08 0.79 ^ 0.16Signal transduction 1.72 ^ 0.18 1.48 ^ 0.39DNA repair 0.64 ^ 0.08 0.83 ^ 0.31DNA replication 0.43 ^ 0.08 0.72 ^ 0.23Protein biosynthesis 0.13 ^ 0.02 0.41 ^ 0.15Protein degradation 0.97 ^ 0.09 0.90 ^ 0.11Ion transport 1.42 ^ 0.28 1.43 ^ 0.20Catabolism 0.88 ^ 0.07 0.92 ^ 0.08Carbohydrate metabolism 1.01 ^ 0.11 1.36 ^ 0.36Two-component systems 2.07 ^ 0.21 NAbCell communication 1.81 ^ 0.19 1.58 ^ 0.34Defense response NAb 3.35 ^ 1.41aThe first number gives the maximum likelihood estimate of the exponent and thesecond number indicates the boundaries of the 99% posterior probability interval.bNA, not analyzed.Fig. 2. The number of fully sequenced genomes in the NCBI database (http://www.ncbi.nlm.nih.gov/Genomes/index.htm) as a function of time. The vertical axis isshown on a logarithmic scale. The straight line is a least squares fit to an exponen-tial function: nt 2t21994=1:3 :TRENDS in Genetics 1996 1998 2000 2002 2004Year125102050100200Number of genomesGrowth of the number of sequenced genomesUpdate TRENDS in Genetics Vol.19 No.9 September 2003 483http://tigs.trends.comin evolution is twice as large as the probability of fixing aduplicated gene that does not double the number ofinternal states of the cell.Finally, as Table 1 shows, there is still substantialuncertainty about the exact numerical values of theexponents given the current data, and many moregenomes are needed to estimate these values moreaccurately. A survey of the NCBI genome database(http://www.ncbi.nlm.nih.gov/Genomes/index.htm) showsthat the number of sequenced genomes is increasingexponentially, with a doubling time of,16 months (Fig. 2).This suggests that within a few years thousands ofgenomes will become available. With such an increase inavailable data it will become possible to look at much morefine-grained gene-content statistics than those presentedhere. One can imagine, for instance, going beyond lookingat single functional categories at a time, and investi-gate if there are correlations in the variations of genenumber in more fine-grained functional categories. Ibelieve that such investigations have the potential toteach us much about the functional design principles ofthe evolutionary process.References1 Apweiler, R. et al. (2001) The InterPro database, an integrateddocumentation resource for protein families, domains and functionalsites. Nucleic Acids Res. 29, 37402 Apweiler, R. et al. (2001) Proteome Analysis Database: onlineapplication of InterPro and CluSTr for the functional classification ofproteins in whole genomes. Nucleic Acids Res. 29, 44483 The Gene Ontology Consortium (2000) Gene ontology: tool for theunification of biology. Nat. Genet. 25, 25294 Huynen, M.A. and van Nimwegen, E. (1998) The frequency distributionof gene family sizes in complete genomes. Mol. Biol. Evol. 15, 5835895 Luscombe, N.M. et al. (2002) The dominance of the population by aselected few: power-law behaviour applies to a wide variety of genomicproperties. Genome Biol. 3, RESEARCH00406 Stover, C.K. et al. (2000) Complete genome sequence of Pseudomonasaeruginosa PA01, an opportunistic pathogen. Nature 406, 9599647 Tatusov, R.L. et al. (1997) A genomic perspective on protein families.Science 278, 6316370168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0168-9525(03)00203-8|ErrataErratum: FlyMove a new way to look at developmentof DrosophilaTrends in Genetics 19 (2003), 310311In the article by Weigmann et al., which was published inthe June issue of TIG, there was an error regarding thecontribution of the authors. Thomas Strasser, KatrinWeigmann and Robert Klapper contributed equally tothe article. TIG apologises to the authors and readersfor this error. doi of original article: 10.1016/S0168-9525(03)00050-7.0168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0168-9525(03)00198-7Erratum: Control of protein degradation by E3ubiquitin ligases in Drosophila eye developmentTrends in Genetics 19 (2003), 382389In the article by Ou et al., which was published in theJuly issue of TIG, there was an error regarding thename of the corresponding author. Cheng-Ting Chien(ctchien@gate.sinica.edu.tw) is the corresponding authorfor this article. TIG apologises to the authors andreaders for this production error. doi of original article:10.1016/S0168-9525(03)00146-X.0168-9525/$ - see front matter q 2003 Elsevier Ltd. All rights reserved.doi:10.1016/S0168-9525(03)00199-9Update TRENDS in Genetics Vol.19 No.9 September 2003484http://tigs.trends.comOutline placeholderAcknowledgementsReferencesScaling laws in the functional content of genomesEvaluating the functional gene-content of genomesObserved exponentsEvolutionary modelReferencesErratum: FlyMove - a new way to look at development of DrosophilaErratum: Control of protein degradation by E3 ubiquitin ligases in Drosophila eye development

Recommended

View more >