Project

3DCrunch:
Very Large Scale Comparative Protein Modelling and Fold Recognition

 

Dates: May 18 to 22, 1998

 

Manuel C. Peitsch, Nicolas Guex, Mansoor Saqi2, Laurent Duret3 & Michael Sternberg4


Glaxo Wellcome Experimental Research S.A.
16, chemin des Aulx
1228 Plan-les-Ouates / Switzerland
Manuel.Peitsch@isb-sib.ch

2 Glaxo Wellcome Medicine Research Centre

3 Université Claude Bernard - Lyon

4 Imperial Cancer Research Fund - London

 

 

Parties Involved

Silicon Graphics Inc, Glaxo Wellcome, The Swiss Institute for Bioinformatics, The Imperial Cancer Research Fund and the University of Claude Bernard in Lyon.

 

 

Introduction

Proteins are the working molecules in most biological processes and a large portion of experimental biology is aimed at the understanding of their function. Indeed, site directed mutagenesis experiments are generally designed to provide information about the involvement of specific residues in enzyme-based reactions, molecular recognition events, protein stability and during drug design projects. Such experiments can be greatly rationalised when the three-dimensional (3-D) structure of the studied molecules is available. The experimental elucidation of these 3-D structures by X-ray crystallography or NMR is however often hampered by difficulties in obtaining sufficient pure protein, diffracting crystals and many other technical aspects. This is illustrated by the fact that the SWISS-PROT/trEMBL database (Bairoch and Apweiler, 1996) contains more than 150,000 sequences, while the Protein Data Bank (PDB) (Berstein et al, 1977) contains the 3-D structure of less then 4000 protein chains with distinct sequences. The number of solved 3-D structures thus increases very slowly compared to the rate of sequencing of novel cDNAs, and no structural information is available for the vast majority of the known protein sequences. This gap will of course further increase as genome sequencing projects continue to yield large numbers of novel sequences.

 

Comparative protein modelling

Comparative molecular modelling (for a review see Bajorath, et al., 1993) allows expanding the number of protein sequences for which we have structural information. Such models can be effectively used to design mutagenesis experiments and to support drug design projects (Peitsch & Guex, 1997). Over the last years we have developed an automated comparative protein modelling server (SWISS-MODEL) and a sequence to structure workbench (Swiss-PdbViewer). These tools, taken together, are a very powerful protein modelling environment which features numerous algorithms for both model construction and structure analysis (Guex & Peitsch, 1998).

 

Fold recognition

The number of protein sequence for which structural information can be derived by computational methods has further increased in the last few years. Indeed, many proteins with undetectable sequence similarity adopting the same 3D fold have been documented. Conventional comparative modelling methods cannot be applied to such cases as the sequence alignment necessary for the modelling process cannot be generated by current algorithms. Fold recognition methods, however, attempt to evaluate the compatibility of a new sequence with a library of 3D folds. These methods are very promising and may prove a useful tool in enhanced genome analysis.

 

Large-scale comparative protein modelling

The software framework of the SWISS-MODEL server can be used to generate large collections of protein models. This was demonstrated in spring 1996, when we built models for all E. coli protein which were identified by proteomics approaches (Peitsch, et al., 1997). In order to store these models, we have established the SWISS-MODEL Repository, a new database for protein models generated by theoretical approaches. This repository is directly integrated with SWISS-PROT and other databases through the ExPASy World-Wide Web server (URL is http://www.expasy.org) .

More recently, we have applied the same technology to larger sets of protein sequences, all derived from complete microbial genomes. Our most recent large-scale comparative protein modelling experiment - using the proteins identified from several complete microbial genomes - demonstrated that approximately 15% of the protein sequences of these organisms could be modelled successfully (spring 1997). This number increased by 5 to 10 % when we used the SWISS-PROT database. This effect finds its reasons in the bias of the database content towards well studied and over represented protein families (i.e. serine proteases, dehydrogenases etc…). Another interesting observation was made when we compared these numbers between autumn 1996 and spring 1997. Indeed, the number of proteins for which we could derive a model increased by more than 3 %. This is due to the growth of the Protein Data Bank, and clearly demonstrates that a concerted experimental protein structure elucidation effort has a very significant influence on the global knowledge of protein structure.

 

Structural genomics

Since some time, crystallographers, genome scientists and protein modellers are discussing a major "post-genome" project, which should have partners around the world in both academic and industrial settings. This project is aimed at the experimental elucidation of a large number of protein structures. The proteins to study by either NMR or X-ray crystallography ideally comprise at least one member of each protein family. The selection process will be based on advanced sequence analysis methods and a reliable classification of proteins into families and fold classes.

 

Shaping the future

This structural genomics project will obviously have a major impact on the number of sequences for which a protein model can be derived, and thereby gradually allow the gathering of structural information on every protein family. A wild extrapolation of the above experiment would tend to show we could be able to derive molecular models for "every protein" by 2020. This will have a major effect on the efficiency and time necessary to discover new drugs and thereby improve the quality of healthcare.

 

Very large-scale protein modelling

In this context we will conduct a very large-scale protein modelling project at the Supercomputing Technology Center of Silicon Graphics Inc. at Cortaillod in Switzerland. We plan to build models for every protein in the SWISS-PROT/trEMBL database by comparative methods (using the software framework of SWISS-MODEL ). Furthermore, we intend to use a fold recognition algorithm to examine the sequences of bacterial origin, which are not significantly similar to any proteins of known structure, and can thus not be modelled. We will use FoldFit, which was developed by the Imperial Cancer Research Fund (Russell et al, 1997a, 1997b) and GlaxoWellcome. The potential of such methods has been suggested by recent blind prediction challenges (Moult et al., 1995).

 

 Impact

3DCrunch will be the first such large-scale computational project initiated in structural biology. The project will yield an unprecedented amount of useful structural information, which we will make available to the scientific community.

 

Availability of the results

The very large number of protein models generated by the project will be stored in a database: the Model Database (MODB). This new database will replace today's SWISS-MODEL Repository which presently holds close to 4000 protein models, mainly of sequences of bacterial origin (Peitsch et al., 1997; Peitsch and Guex, 1997).

Each completed model will be fully annotated and stored as a SwissPdbViewer project file. This enables the user to get a complete modelling environment with all the template structures superposed onto the final model. Users can thus further refine the model from within the Swiss-PdbViewer and use the direct SWISS-MODEL access to submit Optimise mode requests.

The results obtained from FoldFit will be made available on the Internet through a new section of the SWISS-MODEL server and the ICRF.

All results will be accessible to the scientific community via the SWISS-MODEL section of the ExPASy Web Server.

 

Follow-up

After the project is completed, we will implement the procedures used as completely automated processes, which will constantly look for new sequences to model and new experimental structures to build protein models previously not fit for comparative modelling due to limited sequences similarities with the available template structures. Thereby, we will be able maintain a complete protein model database with only limited human intervention. This will be implemented on our own Silicon Graphics servers.

 

References

Bairoch A, Apweiler, R (1996) The SWISS-PROT protein sequence data bank and its new supplement TrEMBL. Nucleic Acids Res. 24:21-25

Bajorath J, Stenkamp R, Aruffo A (1993) Knowledge-based model building of proteins: Concepts and examples. Prot. Sci. 2:1798-1810

Bernstein FC, Koetzle TF, Williams GJB, Meyer EF, Brice MD, Rodgers JR, Kennard O, Shimanouchi T, Tasumi M (1977) The protein data bank: a computer-based archival file for macromolaecular structures. J. Mol. Biol. 112:535-542

Guex N and Peitsch MC (1997) The Swiss-PdbViewer and SWISS-MODEL, an environment for comparative protein modelling. Electrophoresis 18:2714-2723.

Moult J, Pedersen JT, Judson R, Fidelis K (1995) A large scale experiment to assess protein structure prediction methods Proteins: struct Funct. Genet. 23:II-V

Peitsch MC, Wilkins MR, Tonella L, Sanchez J-C, Appel RD, Hochstrasser DF (1997) Large scale protein modelling and integration with the SWISS-PROT and SWISS-2DPAGE databases: the example of Escherichia coli. Electrophoresis. 18:498-501.

Peitsch MC, Guex N (1997) Large-scale comparative protein modelling. in: Proteome research: new frontiers in functional genomics, p 177-186, Wilkins MR, Williams KL, Appel RO, Hochstrasser DF eds., Springer.

Russell RB, Saqi MAS, Sayle RA, Sternberg MJE (1997a) Recognition of analogous and homologous folds - analysis of sequence and structure conservation. J Mol Biol. 269, 423-429

Russell RB Saqi MAS Sayle RA Bates, Sternberg MJE (1997b) Recognition of Analogous and Homologous Folds: Assessment of prediction success and associated alignment accuracy using empirical scoring matrices. Prot Eng (in press).