Symmetrical Structures in Genetic Texts of Prokaryotes DNA Replication Origins
N. I. Akberova and A. Yu. Leontiev
Kazan State University
Kazan, Lenin Street, 18
E-mail: natasha@charlie.ksu.ras.ru
http://www.netsci.org/Science/Bioinform/feature05.html
Abstract
A set of genetic texts, taking part in the prokaryotic DNA replication initiation, has been analyzed from the point of view of the presence of symmetrical structures. Along with the potential hairpins and direct and inverted repeats, the purine-pyrimidine and amino-keto structures have been revealed in the replication origins. Among the symmetry structures those corresponding to protein-binding sites and transcription initiation sites have been found. The subsets of sequences demonstrating the common symmetry types were found to represent phenotypically related groups of organisms. The significance of symmetrical structures for the genetic texts' classification and pattern recognition has been assessed.
Introduction
The problem of recognition of functional sites, though formulated not less than two decades ago, is still not solved even for such well studied sites as promoters of transcription and there is doubt whether this problem could be solved in principle.
One of the reasons is that there is no reliable tool (kind of formal language) to describe what the functional site is. At present we can describe the functional site (replication origin - ori - to be particular) by listing known and expecxted from the theoretical point of view properties, e.g.: ori must contain binding sites for proteins of the initiation complex, if the replication is shown to be bidirectional, then ori should contain dyad symmetries etc. Such an attempts were undertaken (for the reviews see 1,2,3,4) but met with the considerable difficulties. As it was pointed out in the cited paper, one of difficulties is the problem of representative sample of sites. We believe that this problem is not a temporary one and it can't be solved by increasing the number of known sequences, because the success of any statistical approach (and all of the suggested by now recognition methods have statistical nature) depends on the knowledge of the form of expected distribution of probabilities for finding any particular string of symbols. For the evolutionary system such as genetic text it is very unlikely that the distribution of probabilities could be found in analytical form.
Excluding the trivial way of describing functional sites by listing all the representatives of sites under consideration, we are left with the only possibility - to find some feature of the genetic text which is context-independent and does not imply any statistical calculations. We know only one such a characteristic of genetic text - its symmetry [5]. In this work we consider along with the usually taken into account direct and inverted repeats and dyad symmetries the purine-pyrimidine and amino-keto symmetries (the detailed description see below).
The problem of recognition of the functional sites is closely connected with the problem of classification of the sequences and the organisms, from which these sequences were taken. The degree of relatedness of symmetrical subsequences may not coincide with the degree of relatedness of the host organisms, but the coincidence will indicate the importance of the subsequences under consideration for the performance of the site function.
In this work we report the results of assessment of the importance of the symmetry structures for the problem of recognition of functional sites - prokaryotic DNA replication origins.
We have used the common symmetry patterns found in replication origins to divide the sample of analyzed sequences into subclasses and constructed the tree of relatedness of origins according to the criterion of presence of symmetrical structures.
By symmetries we mean the fragments of the text invariant in respect to spatial transformations and transformations of the DNA alphabet. Direct and inverted repeats are examples of pure spatial symmetry while the potential hair pins represent the symmetry in respect to both spatial transformation and the transformation of the DNA alphabet according to the rule : {a <-> t, c <-> g}. In the text we'll call this transformation SW because it pairs bases which take part in strong and weak hydrogen bonding. From the formal point of view of the theory of symmetry (group theory) this last transformation is only one of possible not trivial transformations of the genetic text alphabet. We will take into consideration the other two - {a <-> g, c <-> t} and {a <-> c, g <-> t}. The first of these two additional transformations we'll call RY transformation because it doesn't affect the purine-pyrimidine pattern of the genetic text, and the second one - the KM transformation because it retains the keto-amino pattern. From the point of view of the group theory these three alphabet transformations along with a trivial one (a -> a, etc.) form a four fold-group and from the biochemical point of view they are connected with most important properties of nucleotides. So the choice of these transformations is not arbitrary.
Together with two spatial transformation these three so called "colored" transformations yield 8 symmetry types, the examples of which are shown in Table 1.
Table 1
The Symmetry Structures' Nomenclature in Genetic Text
| Name | Direct Repeats | Inverted Repeats |
|---|---|---|
| Common | aagct...aagct | aagct...tcgaa |
| Complementary | aagct...ttcga | aagct...agctt |
| RY | aagct...ggatc | aagct...ctagg |
| KM | aagct...cctag | aagct...gatcc |
Pentanucleotide "aagct" is an example sequence without inner symmetry.
The presented 8 types of symmetry are generated by combination of 4 types of colored and 2 types of spatial transformations. Four colored transformations form a true fourfold group.
The biomedical significance of common and complementary structures is well known, while for RY-structures is under discussion. The significance of KM-structures is discussed in this work.
We have looked for such structures in a sample of sequences using a simple algorithm and saved for the further analysis only those which were found in two or more sequences.
Results
In this work we report the results, obtained for a set of prokaryotic replication origins (Table 2) . Although in EMBL there are 17 sequences with replication initiation function, we've taken for analysis only these 13 because for different species of Shigella the sequences were identical.
Table 2
The List of Studied Sequences
| No. | Organism | GenBank Index | Length (bp) | Designation |
|---|---|---|---|---|
| 1 | Bacillus subtilis | v01490 | 486 | bs |
| 2 | Escherichia coli | v00308 | 555 | ec |
| 3 | Enterobacter aerogenes | j01576 | 556 | ea |
| 4 | Klebsiella pneumoniae | j01744 | 554 | kp |
| 5 | Erwinia carotovora | v00255 | 696 | er |
| 6 | Pseudomonas aeruginosa | m30125 | 651 | pa |
| 7 | Pseudomonas putida | m30126 | 651 | pp |
| 8 | Shigella dysenteriae | x67657 | 450 | sd |
| 9 | Salmonella typhimurium | j01808 | 552 | st |
| 10 | Streptomyces coelicolor | m82836 | 921 | sc |
| 11 | Vibrio harveyi | k00829 | 277 | vh |
| 12 | Caulobacter crescentus | s43898 | 998 | cc |
| 13 | plasmid R751 | pAR757 | 870 | pl |
Different types of symmetries are distributed in replication initiation sites in different ways. Some types of symmetries were absent at all, while others cover almost the whole length of experimentally detected replication origin. The purine-pyrimidine and amino-keto symmetrical structures are represented by fragments which can be detected in the protein binding sites as well as in regions of unknown function, while in eukaryotic genomes such structures are randomly distributed along the DNA sequence. The patterns, in which certain symmetry structures are found at identical distances in each sequence, we call symmetry patterns (SP). These SP were used to group sequences from different organisms and to describe the sites with replication initiation function (See Fig.1, Fig. 2, Fig. 3).
At the stage of the analysis, when the only presence of certain structures, but not the distances between them was taken into account the whole set of sequences was divided into four subsets { ec, ea, kp, er, sd, st, sc, vh, pl }, { pa, pp }, { cc } and { bs } (the sequences are named according with the Table 2 designations). The more detailed classification was possible when RY and KM symmetries were taken into consideration. The sequences ec, ea, kp, er, sd, st, sc, vh, pl have common complementary repeat, which is the binding site of DnaA protein. The sequences {sc } and { pl } drop out of this group, because this repeat is located in these sequences at the distance 389 and 116 bases accordingly, and in group { ec, ea, kp, er, sd, st, vh } - at the distance 182 bases. The sequence { pl }, as belonging to plasmid, viable in wide spectrum of hosts, contains fragments of symmetry structures, inherent almost to all investigated natural sequences, but at the other distances. The group { ec, ea, kp, er, sd, st, vh } has SP, consisting of one complementary repeat - ttatccaca, representing binding site of DnaA protein. When { vh } is excluded from this group, SP is enriched by the second complementary repeat - agatct, which by virtue of its internal symmetry is simultaneously direct common repeat (Fig.1). When { er } is removed from the remaining group of sequences, SP is essentially enriched by structures including the symmetries of RY and KM types ( Fig.2 ). The exclusion of any of five sequences from the group { ec, ea, kp, sd, st } does not result in a significant increase of a general length of SP, as well as in its enrichment by new types of symmetry.
From eight types of symmetry structures in SP for { ec, ea, kp, sd, st } group only inverted purine-pyrimidine and direct amino-keto repeats are absent. Some symmetries in SP are the sites of interaction with proteins, participating in prokaryotic DNA replication initiation, such as Dna€ protein and DNase I. In SP of {ec, ea, kp, sd, st } group two DnaA protein binding sites are present, the distance between them is strictly fixed and equals 180 bases, that may be due to bidirectional nature of replication process. In the same SP the recognition site of DNase I was revealed. The site agatct, which is inverted complementary repeat, is the recognition site of restrictases BgiII and NspMAC. As far as this repeat is present in SP at the fixed distance, it is possible to assume, that restrictase may play an essential role in process of replication initiation.
The sequences { pa, pp } are grouped by repeats of the same six types of symmetry, as for group { ec, ea, kp, sd, st }, and the distances between repeats in the majority of cases are also strictly fixed ( Fig. 3 ). Symmetry pattern of two sequences from Pseudomonas consists of two blocks. One block will form sites of binding with DnaA protein located on fixed distances from it common direct and inverted complementary repeats. On distance in 211-282 nucleotides from the first block the second one is located, in which various symmetry structures enter, the function of which is unknown, but the distances between which are strictly fixed. Such pattern structure may be a consequence that in process DNA replication initiation simultaneously two proteins (or protein complexes) participate. The sequences { pa, pp } are grouped also by entry of inverted complementary repeats tttccaacc, gatatcc, ccgtgt.
It is possible to introduce some measure of "SP efficiency" from the point of view of problems of classification and recognition as the relation of nucleotide quantity in symmetries in SP to the general length of a sequence, covered by pattern. For example, for pattern, submitted on Fig. 2, this measure makes 30 / 247. Thus, only about 1 / 8 of ori for group { ec, ea, kp, er, sd, st } is sufficient for detection of this site in genome and for allocation of family Enterobacteriaceae from all sequences of studied set.
We consider symmetries of SP to be more conservative part of replication origins in prokaryotes and think that they are very important for replication initiation function and the type and length of specific purine-pyrimidine and amino-keto repeats can be used in studies of evolution of DNA sequences as a distance measure.
To evaluate the significance of the symmetries for the recognition of the DNA replication origins we have performed the search of symmetries found in E.coli ori in all DNA sequences belonging to E.coli in EMBL database and found out that these symmetries are present in only one site - DNA replication origin. The assessment of the validity of symmetry structures for solving the problem of related sequences classification has been performed. The set of studied sequences was divided into the subsets by the criterion of presence of certain symmetry structures and the subsets were organized into a tree of relatedness (Pic.4). Sequence from Bacillus subtilis ( the group of Gram-positive endospore-forming rod ) and Caulobacter crescentus ( Gram - negative prosthecate bacteria) very strongly differ from the other sequences ( gramme - negative aerobic and anaerobic bacteria and actinomycet), and are not similar to each other. From the group of gramme - negative unaerobic bacteria essentially two sequences from Pseudomonas (airobic) and sequence from Streptomyces ( actinomicet) are excluded. The remaining group, when other symmetry structures are taken into account, is divided into two subgroups belonging to Enterobacteriaceae and Vibrionaceae families. The group of sequences from the family Enterobacteriaceae is in turn divided into groups from tribe Erwinieae ( Erwinia ), tribe Escherichieae and tribe Klebsielleae. The relationship tree so defined coincides with the toxonomical one.
Summary
The analyzed DNA replication origins contain common symmetry patterns which constitute an essential part of functional core of replication origins because they include binding sites of DNase and DnaA protein and their sequences are identical for closely related organisms.
The degree of relatedness of organisms based on the criterion of presence of symmetrical structures in their replication origins coincides with the phenotypically defined taxons, which can be considered as an additional indication of importance of symmetrical structures for performing the replication initiation function. The symmetry structures of different types are represented in replication origins with significantly different frequencies : the most frequent are direct and inverted common repeats and inverted complementary repeats as well, while KM- and RY-structures usually are of shorter length and cover only few percents of the origin sequences.
The recognition site of DNase I has the structure of ideal KM inverted repeat which may indicate the significance of such structures for DNA - protein interactions.
The main result of our work may be summarized as the following : along with the symmetries usually considered when describing the structure of the functional sites taking into account the RY and KM symmetrical structures makes the description more adequate and the recognition procedures more effective.
References
1. Trifinov E.N, Brendel V. Gnomic, Balaban Publishers, 1986
2. Gelfand M.S., The methods of statistical analysis of functional sites and their application, Moscow, 1989
3. Borodovsky M., Pevzner P. in Computer aided analysis of genetic text, Nauka Publishers, 1990, p.36
4. Alexandrov A., Kalambet Yu. in Computer aided analysis of genetic text, Nauka Publishers, 1990, p.113
5. Leontyev A.Yu. Symmetry of single chain DNA molecules. Biophysics, 1992, 37(5), pp.771-774.
Figure 1
Symmetry Pattern for {ec, ea, kp, er, sd, st} Group

The sequences from six microorganisms are grouped due to presence of the certain symmetry structures that are found at identical distances in each of these sequences. This pattern, containing in this particular case the DnaA protein binding site, we call symmetry pattern (SP).
Figure 2
The Part of Symmetry Pattern for {ec, ea, kp, sd, st} Group

For the {ec, ea, kp, sd, st} group of sequences, the SP is enriched by KM- and RY- structures in comparison to the {ec, ea, kp, sd, st, er} group. The KM-type structures are represented by the Dnase I recognition site and an other structure of undefined function.
Figure 3
The Relationship Tree of Studied Sequences
Constructed Using the Criterion of Symmetry
Structures' Presence
--------------------------------------------------
bs, ec, ea, kp, er, pa, pp, sd, st, sc, vh, cc, pl
--------------------------+-----------------------
|
+------------------+---------+----------+----------------+
| | | |
f. Bacillaceae f.I Pseudomonadaceae | Caulobacter
Bacillus Pseudomonas | |
| | | -----
----- ----- | cc
bs pa,pp | -----
----- ----- |
|
-----------------+------------------
ec, ea, kp, er, sd, st, sc, vh, pl
-----------------+------------------
|
+----------------------------+------+------------------+
| | |
-------+------------------- ---+--- f. Streptomycetaceae
ec, ea, kp, er, sd, st, vh pl Streptomyces
--------------------------- ------- |
| -----
| sc
+-----+-----------------------------+ -----
| |
f. Enterobacteriaceae f. Vibrionaceae
| Vibrio
| |
---+------------------- ----+----
ec, ea, kp, er, sd, st vh
----------+------------ ---------
|
+-------------------------------+
| |
| |
T.I. Escherichieae T.V. Erwinieae
T. II Klebsielleae Erwinia
| |
-----+-------------- --+--
ec, ea, kp, sd, st er
-------+------------ -----
|
+----+---------------------------------------+
| |
T.I. Escherichieae T II. Klebsielleae
| |
+---------+-----------------+ Klebsiella
| | Enterobacter
Escherichia Salmonella |
Shigalla | ----+----
| ---+--- ea, kp
---+---- st ---------
ec, sd -------
--------
NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:
- Network Science Corporation
- 4411 Connecticut Avenue NW, STE 514
- Washington, DC 20008
- Tel: (828) 817-9811
- E-mail: TheEditors@netsci.org
- Website Hosted by Total Choice