Reliability of models generate by SWISS-MODEL


During the 3DCrunch project we also generated models for proteins of known 3D-structure using modelling templates sharing 25-95% sequence identity with the submitted sequences. Thereby we collected over 1200 model - control structure pairs. This is the first time ever, and unlike CASP where only very limited numbers of data sets are available, that a modelling method was assessed at this scale, providing relevant information about its reliability. Therefore we computed all degrees of identity between target and template sequences and the relative mean square deviation of the models from their corresponding experimental control structure and (Fig. 1) summarised the data in Table I. For instance, 63% of the sequences sharing 40-49% identity with their template, submitted to SWISS-MODEL through the "First Approach mode", yield a model deviating by less than 3 Å from their control structure. This number increases to 79% for sequence identities ranging from 50 to 59%. One can also see in Table I, that below 30% sequence identity, the accuracy of completely automated protein modelling is rapidly degrading

While analysing the results we could confirm (Fig. 2) the observation that the common core of proteins sharing 50% sequence identity deviate by approx. 1 Å relative mean square deviation1. This deviation, even for absolutely identical sequences, can however be much increased if the protein structures were solved by NMR2,3. We could also confirm the long known fact that the most reliable part of a protein model is the portion it shares with the modelling template 2,4, while the rebuilt non-conserved loop were a major contributor to model inaccuracy2,3,4. Two further observations are also noteworthy. Firstly that the choice of the modelling template is crucial, as most high deviations between model and experimental control structures can be traced back to the selected modelling templates. It is well known that the experimental conditions at the time of structure elucidation2,3,4 strongly impacts the resulting structure. Indeed, the presence or absence of co-factors, other sub-units and substrates can strongly influence the structure of an experimental protein structure and thereby all models derived from them. It is thus crucial that the users be aware of the templates that are used by the modelling procedure or make their own informed choice. Secondly, as comparative protein modelling is guided by the alignment between target and template sequence, any error introduced by the alignment algorithm will have profound effects on the model. Models resulting from such alignments are wrong and can generally not be used. We have observed that such errors start to appear when the sequence identity between target and template sequence is lower than 40%. It is however noteworthy that less than 5% of such models generated by SWISS-MODEL are inaccurate because of the errors introduced by the sequence alignment algorithms.

  1. Chothia, C. and Lesk, A. M. (1986) EMBO J. 5, 823-826
  2. Harrison, R. W., Chatterjee, D. and Weber, I. T. (1995) Proteins Struct. Func. Genet. 23, 463-471
  3. Guex, N. and Peitsch M. C. (1997) Electrophoresis 18, 2714-2723
  4. Bajorath, J., Stenkamp, R. and Aruffo, A. (1993) Protein Sci. 2, 1798-1810

Table I

Probabilities of SWISS-MODEL accuracy for target-template identity classes.

Percent sequence identitya

Total number of modelsb

Percentc models with rmsd lower than 1 Å

Percent models with rmsd lower than 2 Å

Percent models with rmsd lower than 3 Å

Percent models with rmsd lower than 4 Å

Percent models with rmsd lower than 5 Å

Percent models with rmsd higher than 5 Å

25-29

125

0

10

30

46

67

33

30-39

222

0

18

45

66

77

23

40-49

156

9

44

63

78

91

9

50-59

155

18

55

79

86

91

9

60-69

145

38

72

85

91

92

8

70-79

137

42

71

82

85

88

12

80-89

173

45

79

86

94

95

5

90-95

88

59

78

83

86

91

9

a: Range of sequence identity between target and template sequence.
b: Total number of models in any given class of sequence identity. The table summarises 1201 model – control structure pairs.
c: Probability in percent that a model, sharing X% sequence identity with its template, deviates by 1 Å or less from the corresponding experimental control structure. The following columns provide these probabilities for other rms deviations.


These graphs were generated with SGI's MineSet software.

Fig. 1

Relative frequency of model accuracy classes (colour coded) as a function of sequence identity levels. The numbers in white correspond to the sequence identity between model and template. The relative mean square deviations are computed on CA atoms only.

Fig. 2

Relative mean square deviation of the common core of proteins as a funtion of their sequence identity. The colours of the data point indicate the experimental method used for both structure determinations.


Torsten Schwede , Manuel C. Peitsch & Nicolas Guex.