New Machine Learning Technique for Analysis and Prediction of Sequence and Structure Features: Protein Secondary Structure Prediction

Victor B. Strelets

Computational Genetics & Biophysics
Supercomputer Computations Research Institute
Florida State University
Tallahassee, FL 32306-4052, USA
Tel: +1-904-644-6563
Fax: +1-904-644-0098
E-mail: strelets@scri.fsu.edu

http://www.netsci.org/Science/Bioinform/feature03.html

Abstract

A new machine learning technique for the prediction of sequence or structural features was developed. The algorithm allows automatic extraction and refinement of sequence or structure organization rules as feature-related patterns. The algorithm does not use data from multiple alignments and, in comparison to neural network technologies, does not implement feedbacks in learning. The method represents a new approach for machine learning on biological sequences and on structural information relying on weight matrix and profile methods combined with the pattern-based classification technique. When applied to the prediction of protein secondary structure, it provides a mechanism for revealing prediction failure cases which can arise from the absence of particular folding themes in the available database. Screening of these cases allows prediction of secondary structure with high accuracy, but at the cost of rejection of some query proteins. For more than 60% of novel structures submitted recently to the PDB, as well as for 40% of those without homologs in the current database, mean accuracy of more than 90% in the three-state prediction (helix, beta, coil/other) was achieved. Such a prediction with the detection/screening of failure cases is the full equivalent of the prediction itself, and its structure output demonstrates the same mean per-residue accuracy about 90%. Other possible applications of the algorithm are discussed.

Keywords: machine learning, sequence analysis, sequence patterns, secondary structure prediction

Introduction

One of the important problems on the way to the prediction of protein tertiary structure from the primary sequence is secondary structure (or local folding) prediction. This problem is now regarded as one of the major obstacles in the process of elucidating protein structure [1]. At present, the most accurate prediction of secondary structure relies on the use of homology-derived information from proteins of known structure [2]. However, for more than 80% of novel protein sequences, no homologous structures are available [3], and structural information must come from analysis of the sequence alone.

A large class of methods is based upon the incorporation of data from multiple alignments with homologous sequences [4-6]. These methods [7] usually incorporate a consensus prediction that yields per-residue accuracy of up to 70%. Unfortunately, for many novel proteins no homologous sequences are available in databases. In addition, it has been shown that the best accuracy for secondary structure prediction from multiple alignments can not be higher than 80-85% because of corresponding deviation of the secondary structure for particular sequences from the consensus variants [8,9]. Therefore, secondary structure prediction problem must usually be resolved in the absence of homologous sequences.

Starting from the first successful method of Chou-Fasman [10], a variety of available algorithms have used secondary structure propensities of single amino acids [10] and of their combinations[11]. Such propensities were derived from statistical evaluations of the occurrence of particular residues in known secondary structure elements of proteins. Predictions based on such propensities do not taking into account the complex context-dependent manner of secondary structure formation [1] and appear to have reached their maximum accuracy of approximately 60%.

The most powerful current prediction methods implement a neural network approach to revealing sequence-structure dependence by learning from examples of known structure [6, 12-15]. These methods allow prediction of secondary structure with up to 70% accuracy. Combined with evolutionary information derived from multiple alignments of homologs, they demonstrate the highest (mean) known prediction accuracy (72%) and are superior to other algorithms. But these methods are highly learning-dependent and their performance in the prediction of novel (i.e., without homologs in the learning set) proteins is significantly lower.

With regard to the understanding of local folding, the most promising methods are designed to assess sequence-structure association rules, patterns or motifs [16,17]. The learning problem can be described as follows: given a sequence of residues from a fixed length window of a protein chain, classify the (central) residue in the window as having a particular secondary structure type [13]. These algorithms usually implement the nearest-neighbor approach where secondary structure propensity of the residue in the sequence window of some predefined length depends on occurrence of the different secondary structure types in different positions of the window, accumulated over all instances of similar sequence contexts in the learning set [15]. Nearest-neighbor methods combined with neural network technologies yield an accuracy[13,15,18] of 65-71%. Many sequence-structure associations with high intrinsic predictive power were found in these studies, some of which turn out to be correct 78% of the time even when applied individually to proteins outside of the learning set[19]. Yet relatively weak correlation between predictive power of these individual sequence patterns and overall prediction accuracy was reported[19]. One of the main limitations of these algorithms is that they rely heavily on the local sequence similarity identified during the accumulation of the pattern/motif statistic. These approaches lead to problems such as overtraining and a dependence on the presence of similar patterns in learning set examples.

It appears that all methods described above exhibit similar accuracies when applied to proteins having similarity to the examples in learning sets (near 70%) or to the novel structures (60-65%). This suggests that there is a largely fundamental barrier in the methodology of data accumulation and analysis. Goals of the work described here are 1) reconsideration of the general algorithmic aspects of the analysis of sequence-structure associations and 2) creation of a more powerful technique for data accumulation, analysis and motif revelation which will be free of some shortcomings of the previous methods.

Methods

To avoid an excessively detailed description of the data available in the learning set (like overtraining in neural nets) without necessarily revealing of the biologically significant sequence-structure associations (referred thereafter as rules), feedback techniques were not used and the learning stage of our method was limited only to the accumulation of the sequence-structure statistic. As a starting approximation in developing the new method, we used the algorithm for the revelation of rules described by Rooman and Wodak[16,17,19]. A brief description of this algorithm (referred thereafter as RW) is as follows:

Sequence patterns consist of a number of consecutive positions along the polypeptide chain, which are referred to as the pattern of the length L. Of these L positions, only a certain number Nact (Nact < L) are specified. Specified positions may be occupied either by one of the 20 naturally occurring amino acids, or by an amino acid property (hydrophobicity, polarity etc.). Patterns with amino acids and/or properties may match identical positions in the sequence, thus providing redundant physical information. Patterns with high intrinsic predictive power are those that indicate the presence of the same secondary structure assignment (structure motif) in most occurrences of the sequence pattern in the database, at the same position in the sequence relative to it. The structure motifs need not necessarily be within the sequence patterns, but may reside in neighboring regions along the polypeptide chain. A sequence pattern is retained if it occurs at least Mmin times in the database and if it is associated with the same structural motif in at least Mp (in %) of its occurrences. For the prediction of structure motifs, the protein sequence is tested for matches with the library of patterns, for all possible placements of the pattern in the sequence. If a match occurs, then the structure motif associated with the pattern is predicted. Because of the possible overlapping of the pattern matches on the particular parts of the sequence, information from the predicted structure motifs is accumulated additively in the form of the three-state (helix, beta, coil/other) profiles, with assignment of weights to the particular structure motifs. After all database patterns are tested against the protein sequence, a final prediction of the secondary structure is made by choosing the structure type with the highest weight for every residue in the sequence.

Rather than selecting associations with reasonably high predictive power (with high probability to obtain the same structure motifs on the matches with sequence pattern), a search was made for the universal near-Boolean rules which demonstrate stable sequence-structure associations in almost all occurrences in the database. If revealed, such rules may correspond to the (elements of the) stereochemical code expected from the beginning of the structure predictions. Instead of "absolute" Mp = "100%" we used Mp = 95% to permit some structure-independent deviations such as secondary structure mapping errors[9] and side effects derived from the sequence termini during collection of the pattern statistics.

The RW algorithm possesses the same shortcoming as do other algorithms: it relies on the repeated occurrence of the pattern in database sequences that may be caused by the presence of homologs and not by the structural importance of the corresponding rules. To resolve this problem, most of the cited methods implement learning on a limited representative set[20] of the structures containing no homologous proteins. This results in a limitation of the data size available for learning[19]. The algorithm described here is constructed so as to (1) reveal (and to delete from the final database) any rules inferred from the local sequence homology, and (2) learn on all currently available structures without regard to representativeness.

For the construction of the sequence patterns (referred thereafter as "Rule Initiators" or RI), we used 20 natural amino acids and set of amino acid properties including hydrophobicity, charge, polarity and bulkiness (amino acids which possess corresponding properties were defined as the members of the strings "AVLICMFYWHKG", "DEHRK", "YWDENQHSRKBZ" and "VILMFYWEQHRKZ", respectively). In contrast to the RW algorithm, RIwas allowed to contain a mixture of amino acids and amino acid properties. In addition, all amino acid properties were allowed to appear in RIin two states (presence or absence of the property) extending our set of properties from four to eight.

One of the important features of the new algorithm is a representation form used for the description of the structure motifs. Generally speaking, previous attempts to reveal rules were based solely on the hypothesis that there should be some patterns with favorable stereochemical positional combinations of the amino acids and/or amino acid properties. Such combinations were supposed to have lower conformational energy in local interactions and therefore were expected to occur more frequently in the database. Because of the conformational nature of this preference, such patterns should have some frequently associated structure motifs. This approach is based on the obvious "favor/optimality/allowance" strategy. But taking into account all available information about protein folding and some well-known aspects of the folding modeling, it appears that this strategy reflects only one of the possible ways of the local structure determining: specifically, an analog of the local conformational energy minimization with the fixation of optimal variants. But perhaps more efficient and frequent should be an opposite event in the local folding such as screening/exclusion of the stereochemically prohibited or energetically inappropriate variants. Projected back to the problem of pattern/motif revealing, it forces one toward introduction of the parallel strategy which will assume that the typical structural motif in the anticipating stereochemical code may contain restrictions on the presence of the particular structure types in some positions, relative to the placement of corresponding sequence patterns. This approach is just another side of the complete sequence-structure dependence model as "restriction/unacceptance/prohibition" strategy. It is interesting to note that the expected occurrence of these prohibitions within the rules should be several times or even orders higher than for the acceptable/optimal motifs because of the analogous difference in the possible number of corresponding conformations (peptides typically display a few low-energy native-like conformations, whereas all other conformations demonstrate unfavorable stereochemical positioning and relatively high conformational energy). To account for such prohibitive elements in the sequence-structure associations, in our prediction of the three-state structure we are using a six-state internal structure representation (helix, non-helix, beta, non-beta, coil/other, non-coil/other) for the construction of the predictive rules. Note that the non-X description is not a simple inversion of its X analog, but (in accordance with our near-Boolean definition of rules) reflects positional absence of the concrete structure type in all instances of the corresponding rule in the database. Because the data collection is based only on the pattern matches, rules including structural prohibitions will not indicate a lack of statistical data in the database (i.e., they could not be inferred from underrepresentation of some rule types caused by the absence of corresponding structures in learning set).

 

Data in Table I (Click here to view Table 1 if your viewer does not have table functionality) provide a specific example of the rule extraction. The starting RI corresponds to the potentially predictive pattern [AxxxxxxxFM] of length L=10, where symbol 'x' ([-] in the Table I) is used for the designation of nonspecified positions. The data matrix for the accumulation of sequence/structure statistic consists of three functionally different parts: amino acid part, amino acid properties part and structure (or rule result) part. These matrix parts are used for the collection of statistics about the occurrences of the particular amino acids or property/structure types in the corresponding positions within the sequence. Because a structure motif is not necessarily located within the sequence pattern, our data matrix contains side extensions of length Ls so that the corresponding data can be collected. After one-pass database screening, some Ni=71 matches of this RI sequence pattern were found within the database. Simple threshold filtering deletes all elements of data matrix with an occurrence of less than a predefined Mp=95%. For convenience, all other "significant" elements are converted to the indicative Boolean description (absence/presence of stable association, designated as [-]/[*] in the Table I). In this example, the presence of the stable property/structure associations outside of the initial RI scope demonstrates that the effective length of the possible rule is higher than the initially assumed L=10 so that optimal value L=20 will be more predictive and the corresponding extended pattern variant is written as [xxxxxAxxxxxxxFMxxxxx].

The next focus of data matrix analysis is on the amino acid part of the matrix. In one of the columns (which corresponds to position 16 in the extended pattern), only one type of amino acid (L) was obtained for nearly all instances of the RI pattern in the database. First this means that a pattern in the form [xxxxxAxxxxxxxFMLxxxx] will provide the same data accumulation as the initial variant [xxxxxAxxxxxxxFMxxxxx]. This example illustrates how the stable RI-associated amino acid placement may be used for the refinement of the initial pattern variant. Second, the information from this part of the matrix shows that for this particular rule, four of 20 possible pattern positions are always occupied by the same types of amino acids, thus reflecting a mean sequence similarity of 25% between corresponding peptides from database sequences. Such an evaluation of sequence similarity of the participating peptides provides a mechanism for the threshold filtering of the rules which are inferred from homology. In our study, we decided to delete from our final library all rules which were extracted from protein segments with more than 30% positional similarity (six positions out of 20 possible).

Analysis of the properties part of the data matrix will not provide any information about the associated structure motif but, like the amino acid part analysis, it permits the use of stable RI-associated placement of properties for the enhancement of the initial rule. In general, the described simple construction of the data accumulation matrix allows one to generate, and to test against the structure database, different RI variants, with the possibility of rule corrections in cases when the sequence pattern of the rule appears to be more complicated than was assumed in RI structure. RI as used in this approach only serves to initiate the data accumulation process. The real sequence pattern of the would-be rule is described by amino acid and properties parts of the final data matrix. Such a data refinement permits simple rule corrections without implementation of the feedbacks.

Through the analysis of the structure part of the data matrix, stable structural motifs associated with our refined sequence pattern are revealed. If no stable structural motifs were found to be associated with the pattern, the corresponding rule was not included in the final library (although amino acid and properties parts of the matrix may demonstrate some stable amino acid or properties associations that are interesting from an evolutionary point of view). Otherwise, the rule was included in the library with amino acid and properties parts describing the final sequence pattern, and with the structure part (rule result) describing associated structure motif. Before writing to the library, all rules were subjected to maximal shortening thus deleting from the data those matrix side columns which did not contain information about stable associations in the data matrix. Consequently, our final library contains rules with sequence patterns of lengths varying from three to 20 positions.

Because of some computational limitations of a combinatorial nature, RI were constructed using only two general types of possible sequence patterns. The first type represents RI with three specified positions (L=10, Nact=3) where each may contain either amino acids or properties. The second type contains four specified positions (L=4, Nact=4) where each is allowed to contain properties, but no more than three are allowed to contain amino acids. The following model parameters were standard for both sets: Mp=95%, Mmin=5. For the acquisition of statistical data about structure and sequence outside of the RI pattern, data matrix extensions with Ls=5 were used.

In a manner similar to many other algorithms, prediction of the structure was performed using three profiles for helix, beta and coil/other structure types. By analogy with the RW method, the profile for the specific type of structure was obtained as a sum of weights for matching sequence patterns, in relative sequence positions as described by structure motifs. Only the numbers of matches Ni are available for the construction of corresponding weights in our final library. The approach for rule weighting was based on very general assumptions. First, different rules with different numbers of database matches, N1 and N2 (N1 > N2) are being compared. A higher N1 value probably means that the first rule describes associations which are more general and less detailed than derived from the second rule. Therefore, the weight for the first rule should be less than for the second. We choose rule weights in simple form, Wi=Const/Ni, where Ni is a number of rule matches with database sequences. To account for lesser importance of those rules when sequence motifs are obtained in the database more frequently due to random combinatorial reasons, weight was divided by the expected number of matches in database Nex (expected number was evaluated by multiplication of frequencies for amino acids or properties in specified positions of the sequence pattern) so that final weight in the form Wi=Const/(Ni(Nex) was used. A constant Const was chosen so that weights transformed to integer values were not less than the value of one for all rules in our final library.

During the accumulation of data from rules matches with the sequence in question, weight value Wi was added to all profile positions corresponding to the sequence positions predicted as having a particular type of structure, and the same value Wi was subtracted if corresponding positions were predicted as not having this type of structure. Calculated from all matches of the sequence with rules in the final library, weight profiles were used as approximate descriptors of the corresponding structure types probability. A prevalence of rules with prohibitions in the final rules library was observed. For this we introduced an additional level of data analysis in the form of three additional Boolean profiles to reflect the presence of positive structure predictions for sequence positions. At the end of sequence testing against the whole rules library, positions in these profiles contain a value of one if at least one rule predicted a positive result for the corresponding position in the sequence and a value of zero if only negative predictions were obtained. Final analysis and comparison of these six profiles was done by a simple expert-like system which includes several rules mimicking possible human-like profile analysis. The system was scanning for the highest weight Wi among three available weight profiles predicting corresponding structure type. In addition, for all sequence positions in which only negative rules for the coil/other structure type were obtained, the system attempted to test for the possibility of helix or beta structure assignments, if their weights were close enough to the weight of coil/other structure. After the prediction phase, additional filtration of the predicted structures was implemented to delete all separated helical segments of length less than three positions and all extended (beta) segments of length less than two positions. Corresponding structure assignments were exchanged for the coil/other structure type assignment.

Results And Discussion

The learning stage of the algorithm was performed on the structures from Brookhaven Protein Data Bank[21]. PIR-41[21] file NRL3D, as translated from PDB-65, was used. We used secondary structure descriptions presented in NRL3D without modification except for encoding (1) helical positions of all types as "helix", (2) beta positions of all types as "extended", or (3) all positions other than helix or beta as "coil/other". We excluded from our consideration (on both the learning and testing stages) those structures which; 1) were determined using computer modeling or NMR techniques, 2) did not contain helix/beta structure placement information, and 3) were shorter than 50 amino acids. The learning stage produced approximately 4x10E4 rules with stable sequence-structure associations. Most contain stable negative structure associations (prohibitions) only. An example of the typical rule is shown in Table I.

As controls of the prediction accuracy, our algorithm allows the use of all learning proteins which have less than Mmin-1 homologs in the learning database. This is because even in the case in which there is some homology influence (although such an influence should be screened out by our homology test for all retained rules), the corresponding rules will not be present in the final database due to filtration by the number of occurrences via the threshold Mmin. Therefore as a Test Set I, we used 56 proteins[23], each with less than four homologous sequences in PDB-65. In the search for homologs, the decision threshold was chosen as a positional similarity of 20% in the pairwise alignment[24] of corresponding protein sequences. Test Set I (as well as others described below) does not contain homologs and is quite representative from the point of view of the differences in possible topologies. Results of the structure predictions from the algorithm for Test Set I as well as for all learning proteins are shown in Table II (Click here to view Table 2 if your viewer does not support the Table feature). Prediction accuracy was evaluated by the standard per-residue method. For both, mean prediction accuracy was greater than 95%.

The distribution of structures by their prediction accuracy values (Figure 1) demonstrates a high efficacy for the algorithm on most proteins from the database. Nevertheless, the distribution plot for the largest set (containing all learning proteins) clearly indicates the presence of some structures with relatively low (less than 80%) predictability. These structures constitute only 3% of all learning proteins. Detailed analyses revealed a near absence of rules for (some parts of) these sequences. For any particular residue in the sequence, the mean number of local matches to the rules in the database is well above 100. This number of matches results in relatively good separation of the weight profiles and provides easy structure assignments. In contrast, "failed" regions in proteins with relatively low predictability represent long (50 amino acids and longer) stretches of residues with less than 10 matching rules per residue. This clearly demonstrates either (1) absence of corresponding local folding examples in learning database (see description of Mmin filtering) or (2) determination of the local folding in such areas by principles other than those which fit our model of local stereochemical code (for instance, by the long range interactions). Absence of folding examples in the learning database seemed to be the most probable cause of such prediction failures and can be studied only through the prediction of large amount of new structures when the probability to obtain new folding themes will be high enough to draw any statistical conclusions.

To simulate prediction of a large amount of novel structures (the equivalent of an extensive blind test), we decided to predict novel proteins which appeared in the PDB database during a long period of data submissions after release 65. For this purpose we used structures from PDB-70 (near one year of data submissions after PDB-65, with a growth in number of entries of approximately 20%) which were absent in the PDB-65. By analogy with the Test Set I, we prepared a Test Set II containing 51 novel proteins [25] from PDB-70, each with less than four homologous sequences in PDB-65. Results of the structure predictions from the algorithm for Test Set II as well as for all novel proteins are shown in Table III (Click here to view Table 3 if your viewer does not support the Table feature). The distribution of structures by their prediction accuracy values (Figure 2) clearly indicates the presence of two independent groups of structures, one with high predictability by our algorithm and another with relatively low predictability. Assuming placement of the boundary between groups as 70% of the prediction accuracy, we recalculated data for groups in the Test Set II as well as among all novel proteins (see Table IV [Click here to view Table 4 if your viewer does not support the Table feature]). The "right" group contains well predicted structures with mean accuracy and standard deviation similar to the results from the prediction of learning structures (see Table II), and probably reflect prediction for the cases when corresponding folding themes are found in the learning set often enough to produce effective predictive rules. The "left" group demonstrates different variations of failures in prediction and may illustrate cases with absence (or rare occurrence) of corresponding folding themes in the learning set. Quantities of proteins in groups provide the simplest possibility for the evaluation of failure probability. For typical novel structure submitted to the PDB database, probability to predict secondary structure with high accuracy (mean value more 90%) is more than 60%. If novel protein has no homologs (or only couple of homologous sequences) in the learning set, probability of such accurate prediction is only about 40%.

But even in a case of prediction failure, our method provides a convenient mechanism for control of prediction quality. For the successfully predicted secondary structure elements, peaks of the corresponding weight profiles usually are confirmed by the peaks in Boolean profiles, and both match valleys in the weight and Boolean profiles for alternative types of the secondary structure. In addition, a local lack of matching rules often causes remarkably low separation of the weight profiles, near total prevalence of coil profile and obvious absence of positive values in Boolean profiles for helix and extended structure types. That can be detected by visual inspection or by analysis of the dispersion profile, which can be calculated from relative separation values of three available weight profiles (smoothed along the sequence). To illustrate this point, we presented prediction profiles for typical proteins from the "right" group (Figure 3) and from the "left" group (Figure 4) of the Test Set II.

To evaluate the performance of the algorithm when using updates with the current information, we recompiled rules database using structures from PDB-70. We prepared a Test Set III, containing 36 proteins[26], each with less than four homologous sequences in PDB-70. Results of the structure predictions from the algorithm for Test Set I, Test Set II, Test Set III as well as for all learning proteins from PDB-70 are shown in Table V (Click here to view Table 5 if your viewer does not support the Table feature). For all of them, mean prediction accuracy was greater than 95%.

Conclusion

Our results demonstrate that the described method for the prediction of protein secondary structure provides a convenient mechanism for the revealing of prediction failure cases which could arise from the absence of corresponding folding themes in the available database. Screening of these cases allows the prediction of secondary structure with high accuracy, but at the cost of rejection of some part of query proteins. Precisely, for more than 60% of novel structures submitted to the PDB, as well as for 40% of novel structures without homologs in the current database, mean accuracy of more than 90% in the three-state prediction was achieved. This accuracy corresponds approximately to the level of structure description on which secondary structure mapping errors, combined with non-perfect quality of experimental data, may provide similar 10% deviations in the assignment of secondary structure. In fact it means that protein secondary structure is much more predictable than thought before and that local context formations of the amino acids are determining local folding of the main chain. As for the utilization of this algorithm, accurate prediction of the structure with the detection of failure cases is not different from the accurate prediction itself, simply because structure output from such a prediction will have demonstrated mean per-residue accuracy of near 90%. Taking into account presumably diversive type of molecular evolution and a principally limited number of naturally existing proteins, periodical updating of the rules database could provide steady decrease in the percentage of the rejected sequences (although a more accurate study is needed to evaluate the possible rate of such a decrease and representativeness of the current database from the point of view of the presence of possible folding themes).

Our work demonstrates a high efficacy for the described learning algorithm in the prediction of sequence features. Generally speaking, any features associated with the sequence positions can be predicted in this manner, if one allows for a simplified localization description in the form of feature presence/absence indication. This hopefully includes not only protein secondary structures, but other sites/domains too. For the successful application of this method, one needs to make only one valid assumption: corresponding features should be somehow context-dependent (which probably is true for most known sequence features). Because of a possibly huge variety of feature realizations for some types of features (such as protein secondary structure) one additional limitation which arose from combinatorial problems will be a possibility to approximate context dependence of the feature by applying additive components of corresponding contexts. The presented method was originally developed bearing in mind the improved prediction of the eucaryotic splice sites, but secondary structures appeared to be more attractive for illustration of its performance.

The parameters of the described algorithm (specifically, values Mmin and Mp, and threshold on the local sequence similarity filtering) could be adjusted to the needs of concrete object of study. It should be noted that the criterion used to define amino acid properties was far from perfect because of large overlapping between corresponding classes (compare strings describing hydrophobic and bulky residues, for example). Screening of homology-inferred rules can be accomplished in a more precise manner when a non-Boolean description of homology is used for every position in the sequence pattern. For example, defining an amino acid as hydrophobic and at the same time as bulky potentially allows nine amino acids out of 20 possible. This particular homology can be described as 9/20 rather than the zero value used in this work. Also for some types of sequence features, screening of homology-related rules may be completely unnecessary.

The package for the prediction of protein secondary structure is available for use via the Internet e-mail server BIO@SCRI.FSU.EDU. In addition, a library of rules with the text of the corresponding C-written subroutine are available by anonymous FTP from site FTP.SCRI.FSU.EDU in the directory /pub/genetics/SSP.

Acknowledgments

Dr. Kenneth Roux (Dept. of Biological Sciences, FSU) is thanked for critical discussion and correction of manuscript. Dr. Lee Makowski (Dept. of Biological Sciences and Institute of Molecular Biophysics, FSU) is thanked for manuscript reading and critical remarks. This work was partially supported by SCRI which is partially funded by the US DOE under Contract Number DE-FC05-85ER250000. The computational part of the work is made possible by the computer time on the DEC5000 and IBM RISC6000 workstations allocated by SCRI.

References

1. Nishikawa, K., Noguchi, T. Predicting protein secondary structure based on amino acid sequence. Meth. Emzymol., 202: 31-44, 1991.

2. Blundell, T., Sibanda, B., Sternberg, M., Thornton, J. Knowledge-based prediction of protein structures and design of novel molecules. Nature, 326: 347-352, 1987.

3. Rost, B., Schneider, R., Sander, C. Progress in protein structure prediction? TIBS, 18: 120-123, April 1993.

4. Niermann, T., Kirschner, K. Use of homologous sequences to improve protein secondary structure prediction. Methods Enzymol., 202: 45-59, 1991.

5. Wako, H., Blundell, T.L. Use of amino acid environment-dependent substitution tables and conformational propensities in structure prediction from aligned sequences of homologous proteins. II. Secondary structures. J. Mol. Biol., 238: 693-708, 1994.

6. Rost, B., Sander, C. Combining evolutionary information and neural networks to predict protein secondary structure. Proteins, 19: 55-72, 1994.

7. Benner, S.A., Gerloff, D.L. Predicting the conformation of proteins. Man versus machine. FEBS Lett., 325: 29-33, 1993.

8. Russel, R.B., Barton, G.J. The limits of protein secondary structure prediction accuracy from multiple sequence alignment. J. Mol. Biol, 234: 951-957, 1993.

9. Rost, B., Sander, C., Schneider, R. Redefining the goals of protein secondary structure prediction. J. Mol. Biol., 235: 13-26, 1994.

10. Chou, P.Y., Fasman, G.D. Prediction of the secondary structure of proteins from their amino acid sequence. Adv. Enzymol., 47: 45-148, 1978.

11. Garnier, J., Osguthorpe, D.J., Robson, B. Analysis and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol., 120: 97-120, 1978.

12. Muskal, S.M., Kim, S.-H. Predicting proteins secondary structure content. J. Mol. Biol., 225: 713-727, 1992.

13. Salzberg, S., Cost, S. Predicting protein secondary structure with a nearest-neighbor algorithm. J. Mol. Biol., 227: 371-374, 1992.

14. Rost, B., Sander, C. Prediction of protein secondary structure at better than 70% accuracy. J. Mol. Biol., 232: 584-599, 1993.

15. Yi, T.-M., Lander, E.S. Protein secondary structure prediction using nearest-neighbor methods. J. Mol. Biol., 232: 1117-1129, 1993.

16. Rooman, M.J., Wodak, S.J. Identification of predictive sequence motifs limited by protein structure database size. Nature (London), 335: 45-49, 1988.

17. Rooman, M.J., Rodriguez, J., Wodak, S.J. Relations between protein sequence and structure and their significance. J. Mol. Biol., 213: 337-350, 1990.

18. Salamov, A.A, Solovyev, V.V. Prediction of protein secondary structure by combining nearest-neighbor algorithms and multiple sequence alignment. J. Mol. Biol., 247: 11-15, 1995.

19. Rooman, M.J., Wodak, S.J. Weak correlation between predictive power of individual sequence patterns and overall prediction accuracy in proteins. Proteins, 9: 69-78, 1991.

20. Hobolm, U., Sander, C. Enlarged representative set of protein structures. Protein Sci., 3: 522-524, 1994.

21. Abola, E.E., Bernstein, F.C., Bryant, S.H., Koetzle, T.F., Weng, J. Protein Data Bank. In: Crystallographic Databases - Information Content, Software Systems, Scientific Applications. Allen, F.H., Bergerhoff, G., Sievers, R. (Eds.), Data Commission of the International Union of Crystallography, Bonn/Cambridge/Chester, 1987:107-132.

22. Barker, W.C., George, D.G., Mewes, H.-W., Pfeiffer, F., Tsugita, A. The PIR-International databases. Nucleic Acids Res., 21: 3089-3092, 1993.

23. PDB indexes for protein structures in the Test Set I:
1LE2, 2DPVA1, 1BMVA, 1CPB1, 1CPB10, 1FCBA1, 1FCBA2, 1FHA, 1GP1A, 1HSDA, 1MLI, 1PHY, 1PYP, 1R1EE, 1RHD, 2FNR, 2STV, 2TMVP, 3BCL2, 3BCL4, 3PGM, 1ULA, 1ATND1, 1ABK, 1BAA, 1BYH, 1COLA, 1DHR, 1END, 1GLAG1, 1GLAG2, 1GPR, 1PDA2, 1TPLA1, 1TPLA2, 2BPAB, 2HHMA7, 2HHRB3, 2SAS, 3SC2A, 3SC2B, 4GCR, 1ABN1, 1CPT1, 1CPT2, 1DSBA, 1GLT2, 1HUW1, 1IFA1, 1MAT, 1MYPA, 1NAR, 1PYAB, 1TML, 2DNJA2, 2TGI.

24. Strelets, V.B., Shindyalov, I.N., Kolchanov, N.A., Milanesi, L. Fast, statistically based alignment of amino acid sequences on the base of diagonal fragments of DOT-matrices. Comput. Appl. Biosci., 8: 529-534, 1992.

25. PDB indexes for protein structures in the Test Set II:
1BNH, 1DLAA1, 1GCS, 1HMCA, 1LBA, 1PYDA2, 1PYDA3, 1SERA2, 3HHRA1, 3HHRB3, 1BET, 1BYA1, 1COMA, 1CPM, 1CTM, 1CUS, 1DDT1, 1DDT2, 1DIRA, 1GLCG1, 1GLCG2, 1GLM, 1GPHA, 1HPLA, 1HRS, 1LST1, 1MDTA1, 1MDTA2, 1RPA, 1SACA, 1TAHB, 1TCA, 1TNDA, 1YSC, 2BTFA, 2LAO, 1WHSA, 1WHSB, 1ACJ1, 1AGM, 1AMP, 1BGT3, 1DGD, 1EFT, 1ENJ, 1GIA, 1IAG, 1PBB, 2BBVA, 2DKB, 3HVTB3.

26. PDB indexes for protein structures in the Test Set III:
2DPVA2, 1BMVB, 1R1EE, 2STV, 1BAA, 1BPM, 1HMY, 1LGAA, 1PEC, 1SBP, 1SIL, 1TPLA2, 1UDPA, 2HHRB3, 2PIA, 2SAS, 4GCR, 1GLT2, 1OMF, 1PYAB, 1PYDA3, 1FNR, 1GP1A, 1CPN, 1CTM, 1DDT2, 1HRS, 1RPA, 1TNDA, 1RHD, 2LBP, 1WHSA, 3BCL2, 1AMP, 1BGT3, 1IAG.



 
Part Extension RI Window Extension
RI - - - - - A - - - - - - - F M - - - - -
Amino Acids - - - - - A - - - - - - - F M L - - - -
P Hydro+ - - - - - - - - - - - - * - - - - - * -
R hydro- - - - - - - - - - - - - - - - - - - - -
O Charge+ - - - - - - - - - - - - - - - - - - - -
P charge- * * - - - - - * * * * - - - - - - * * *
E Polar+ - - - - - - - - - - - - - - - - - - - -
R polar- - - * * - - - * - * - - - - - * * - - -
T Bulky+ - - - - * - - - - - - - - - - - - - - *
Y bulky- - - - - - - - - - * - - - - - - - * * -
R Helix+ - - - - - - - - - - - - - - - - - - - -
E helix- - - - - - - - - - - - - - - - - - - - -
S Extended+ - - - - - - - - - - - - - - - - - - - -
U extended- - - - - - - * * * * - - - - - - - * * -
L Coil+ - - - - - - - - - - - - - - - - - - - -
T coil- - - - - - - - - - - * * * * * * * - - -


Table I. Precise example of the rule (71 instances in learning set): structure of the data accumulation matrix for the typical rule. [-] corresponds to the matrix elements, which were filtered out using threshold Mp=95%. [*] corresponds to the matrix elements which describe stable associations obtained in more than 95% of all rule instances in the learning set. For convenience, the amino acid part of the matrix presented in the form of one-row amino acid pattern instead of actually used 20 rows (each for possible placement of the particular amino acid type, for all positions in the rule window). In the description of sequence and structural features, (feature)+ designation is used for the case of feature presence and (feature)- for the case of feature absence.

Return to the paper

 
Data Set No. of Sequences Mean Accuracy Std. Deviation
Learning 2369 96.30% 5.20%
Test Set 1 56 97.04% 3.95%
Table II. Prediction results for (filtered) structures from PDB-65 using rules database derived from the same PDB-65.
Learning -- all structures;
Test Set I -- structures with less than four homologs in PDB-65, as the simplest model of structure prediction, when homologs are unavailable in the learning set (or, precisely, when due to filtering, homologs were not producing homology-inferred rules in the final rules database)

Return to the paper

 

Data Set No. of Sequences Mean Accuracy Std. Deviation
All New 575 77.77% 19.79%
Test Set II 51 69.51% 18.51%
Table III. Prediction results for (filtered) new structures from PDB-70 (appeared in database between releases 65 and 70) using rules database derived from PDB-65.
All new -- all new structures, as a B model of structure prediction for a large amount of new structures;
Test Set II -- new structures with less than four homologs in PDB-65, as the Best model of structure prediction, when homologs are unavailable in the learning set (or, when due to filtering, homologs were not producing homology-inferred rules in the final rules database).

Return to the paper

 
Data Set No. of Sequences Mean Accuracy Std. Deviation
All New, "left" group 201 53.11% 8.86%
All New, "right" group 374 91.03% 7.54%
Test Set II, "left" group 31 55.91% 6.12%
Test Set I, "right" group 20 90.58% 8.56%
Table IV. Variant of Table III, with the data accumulated separately for the left and right groups of new structures in distribution by their prediction accuracy (with substantial lack of corresponding folding examples/rules in the learning database and with prior occurrence of these examples/rules, correspondingly)

Return to the paper

 
Data Set No. of Sequences Mean Accuracy Std. Deviation
Learning 2944 95.38% 6.45%
Test Set I 56 96.14% 4.68%
Test Set II 51 97.80% 2.69%
Test Set III 36 97.13% 4.74%
Table V. Prediction results for (filtered) structures from PDB-70 using rules database derived from the same PDB-70.
Learning -- all structures;
Test Set I -- structures which were present in PDB-65, with less than four homologs in PDB-65; Test Set II -- new structures (appeared in PDB after release 65) with less than four homologs in PDB-65;
Test Set III -- structures from PDB-70, with less than four homologs in PDB-70, as the model of structure prediction, when homologs are unavailable in the learning set (or, precisely, when due to filtering, homologs were not producing homology-inferred rules in the final rules database).

Return to the paper

 
   Part       Extension       RI Window         Extension  
===========================================================
    RI      | - - - - - | A - - - - - - - F M | - - - - - |
============|===========|=====================|===========|
Amino Acids | - - - - - | A - - - - - - - F M | L - - - - |
============|===========|=====================|===========|
P|    Hydro+| - - - - - | - - - - - - - * - - | - - - * - |
R|    Hydro-| - - - - - | - - - - - - - - - - | - - - - - |
O|   Charge+| - - - - - | - - - - - - - - - - | - - - - - |
P|   Charge-| * * - - - | - - * * * * - - - - | - - * * * |
E|    Polar+| - - - - - | - - - - - - - - - - | - - - - - |
R|    Polar-| * * - - - | - - * - * - - - - - | - - * * - |
T|    Bulky+| - - - - * | - - - - - - - - - - | - - - - * |
Y|    Bulky-| - - - - - | - - - - * - - - - - | - - * * - |
===========================================================
R|    Helix+| - - - - - | - - - - - - - - - - | - - - - - |
E|    Helix-| - - - - - | - - - - - - - - - - | - - - - - |
S| Extended+| - - - - - | - - - - - - - - - - | - - - - - |
U| Extended-| - - - - - | - * * * * - - - - - | - - * * - |
L|     Coil+| - - - - - | - - - - - - - - - - | - - - - - |
T|     Coil-| - - - - - | - - - - - * * * * * | * * - - - |
===========================================================


Table I. Precise example of the rule (71 instances in learning set): structure of the data accumulation matrix for the typical rule. [-] corresponds to the matrix elements, which were filtered out using threshold Mp=95%. [*] corresponds to the matrix elements which describe stable associations obtained in more than 95% of all rule instances in the learning set. For convenience, the amino acid part of the matrix presented in the form of one-row amino acid pattern instead of actually used 20 rows (each for possible placement of the particular amino acid type, for all positions in the rule window). In the description of sequence and structural features, (feature)+ designation is used for the case of feature presence and (feature)- for the case of feature absence.

Return to the paper

 
Data Set   | No. of Structures | Mean Accuracy | St. Deviation |
===========|===================|===============|===============|
Learning   |       2369        |    96.30%     |     5.20%     |
Test Set I |        56         |    97.04%     |     3.95%     |
===========|===================|===============|===============|


Table II. Prediction results for (filtered) structures from PDB-65 using rules database derived from the same PDB-65.
Learning -- all structures;
Test Set I -- structures with less than four homologs in PDB-65, as the simplest model of structure prediction, when homologs are unavailable in the learning set (or, precisely, when due to filtering, homologs were not producing homology-inferred rules in the final rules database)

Return to the paper

 
Data Set    | No. of Structures | Mean Accuracy | St. Deviation |
============|===================|===============|===============|
All New     |       575         |    77.77%     |    19.79%     |
Test Set II |        51         |    69.51%     |    18.51%     |
============|===================|===============|===============|


Table III. Prediction results for (filtered) new structures from PDB-70 (appeared in database between releases 65 and 70) using rules database derived from PDB-65.
All new -- all new structures, as a B model of structure prediction for a large amount of new structures;
Test Set II -- new structures with less than four homologs in PDB-65, as the Best model of structure prediction, when homologs are unavailable in the learning set (or, when due to filtering, homologs were not producing homology-inferred rules in the final rules database).

Return to the paper

 
   Data Set          | # Structs | Mean Accuracy | St. Deviation |
=====================|===========|===============|===============|
New,"Left" group     |    201    |    53.11%     |     8.86%     |
New,"Right" group    |    374    |    91.03%     |     7.54%     |
Set II,"Left" group  |     31    |    55.91%     |     6.12%     |
Set II,"Right" group |     20    |    90.58%     |     8.56%     |
=====================|===========|===============|===============|


Table IV. Variant of Table III, with the data accumulated separately for the left and right groups of new structures in distribution by their prediction accuracy (with substantial lack of corresponding folding examples/rules in the learning database and with prior occurrence of these examples/rules, correspondingly)

Return to the paper

 
   Data Set    | # Structs | Mean Accuracy | St. Deviation |
===============|===========|===============|===============|
Learning       |   2944    |    95.38%     |     6.45%     |
Test Set I     |     56    |    96.14%     |     4.68%     |
Test Set II    |     51    |    97.80%     |     2.69%     |
Test Set III   |     36    |    97.13%     |     4.74%     |
===============|===========|===============|===============|


Table V. Prediction results for (filtered) structures from PDB-70 using rules database derived from the same PDB-70.
Learning -- all structures;
Test Set I -- structures which were present in PDB-65, with less than four homologs in PDB-65; Test Set II -- new structures (appeared in PDB after release 65) with less than four homologs in PDB-65;
Test Set III -- structures from PDB-70, with less than four homologs in PDB-70, as the model of structure prediction, when homologs are unavailable in the learning set (or, precisely, when due to filtering, homologs were not producing homology-inferred rules in the final rules database).

Return to the paper

 

Figure 1. Distribution of structures by their prediction accuracy values, for (filtered) structures from PDB-65 using rules database derived from the same PDB-65.
a -- all structures;
b -- Test Set I, structures with less than four homologs in PDB-65, as a simplest model of structure prediction, when homologs are unavailable in the learning set (or, precisely, when due to filtering homologs were not producing homology-inferred rules in the final rules database).

Return to the paper

 

Figure 2. Distribution of structures by their prediction accuracy values, for (filtered) new structures from PDB-70 (appeared in database between releases 65 and 70) using rules database derived from PDB-65.
a -- all new structures, as a B model of structure prediction for a large amount of new structures;
b -- Test Set II, new structures with less than four homologs in PDB-65, as the Best model of structure prediction, when homologs are unavailable in the learning set (or, when due to filtering, homologs were not producing homology-inferred rules in the final rules database) .

Return to the paper

 

Figure 3. Precise example of the structure prediction (PDB entry 3HHRA1). Secondary structure profiles (weight and Boolean) for the typical structure which is a member of the "right" group of the Test Set II (see Table IV), as an illustration of the case with prior occurrence of corresponding folding examples/rules in the learning database.

Return to the paper

 

Figure 4. Precise example of the structure prediction (PDB entry 1HMCA). Secondary structure profiles (weight and Boolean) for the typical structure which is a member of the "left" group of the Test Set II (see Table IV), as an illustration of the case with substantial lack of corresponding folding examples/rules in the learning database.

Return to the paper

 

Figure 5. Distribution of structures by their prediction accuracy values, for (filtered) structures from PDB-70 using rules database derived from the same PDB-70.
a -- all structures;
b -- Test Set I, structures which were present in PDB-65, with less than four homologs in PDB-65,
c -- Test Set II, new structures with less than four homologs in PDB-65.
d -- Test Set III, structures with less than four homologs in PDB-70, as a model of structure prediction, when homologs are unavailable in the learning set (or, precisely, when due to filtering homologs were not producing homology-inferred rules in the final rules database).

Return to the paper



NetSci, ISSN 1092-7360, is published by Network Science Corporation. Except where expressly stated, content at this site is copyright (© 1995 - 2010) by Network Science Corporation and is for your personal use only. No redistribution is allowed without written permission from Network Science Corporation. This web site is managed by:

Network Science Corporation
4411 Connecticut Avenue NW, STE 514
Washington, DC 20008
Tel: (828) 817-9811
E-mail: TheEditors@netsci.org
Website Hosted by Total Choice