Natural sequence database


SCOP(v1.75)-DB (version 1): Sequence homologs for 16,712 sequences from 3,901 SCOP domain families (version 1.75) were identified using PSI-BLAST searches against UniRef90 database (search criteria: E-value and H-value = 0.0001, query coverage = 80%, iteration = 5). Hits were clustered at 90% sequence identity to obtain a non-redundant set of sequences. 16,712 sequences and their respective non-redundant homologs were pooled together to generate the natural (or CONTROL) database comprising of 4,694,921 sequences.

Pfam(v27.0)-DB (version 1): 10,626,097 non-redundant protein sequences for 14,831 Pfam families were retrieved from Pfam database (version 27).

 

Why design sequences between families?


Protein domains are organized into families based on sequence, structure or functional similarities. If we were to visualize this protein family landscape, it is riddled with gaps. Sequence-based remote homology detection methods are rendered less effective due to this non-uniform sequence dispersion and the paucity of sequences which can act as ‘linkers’ to facilitate homology. Therefore to address this problem, we developed an algorithm that designs protein-like sequences and attempts to link distantly related proteins by ‘populating gaps in protein sequence space’.

 

Algorithm to design sequences and generate NrichD database


 

AS-DB (Artifical Sequence database, version 1)


3,611,010 intermediate sequences were designed for all 374 folds with more than two families in the SCOP database (version 1.75). This dataset is referred to as the AS-DB (Artifical sequence database). Each designed intermediate sequences is annotated with the two parent families and the level at which they were designed. These designed intermediate sequences can be plugged-into any commonly used databases.

Table1: Description of the designed sequences for each of the major SCOP classes

SCOP Class Total number of folds Number of folds with more than two families Number of folds for which sequences could be designed Number of Designed sequences
All alpha proteins 284 87 78 851,150
All beta proteins 174 72 68 520,235
Alpha/beta proteins 147 77 72 1,474,107
Alpha + Beta proteins 376 133 121 741,087
Multi-domain proteins 66 17 15 4,619
Membrane and cell surface proteins 58 9 9 14,596
Small proteins 90 24 12 5,216
Total 1195 419 374 3,611,010

 

NrichD (Natural sequences enriched with designed intermediate sequences) databases


3,611,010 intermediate sequences were designed between 27,882 pairs of families belonging to 374 folds in the SCOP database (v1.75) are augmented into the natural sequence database of SCOP and Pfam to create SCOP-NrichD database (version 1) and Pfam-NrichD database (version 1) respectively.

(1) SCOP(v1.75)-NrichD sequence database (version 1)


About 4,678,209 sequence homologues for 16,712 SCOP domain family sequences were obtained using PSI-BLAST searches against UniRef90 database. These 4,694,921 natural sequences were augmented with computationally designed intermediate sequences for 374 SCOP folds (8,305,931 sequences)

(2) Pfam(v27.0)-NrichD sequence database (version 1)


The Pfam sequence database was augmented with computationally designed intermediate sequences which were previously generated using multiple PSSMs of SCOP domain families (14,237,107 sequences)

In our previous analysis we have demonstrated that the sensitivity for searches made in SCOP-NrichD database is significantly more than searches made in SCOP-DB sequence databases. However, sometimes due to profile drifting, few important hits can be missed out. Therefore, in this web-resource, we query both databases (1) Natural sequence databases (SCOP-DB or Pfam-DB) and (2) NrichD (SCOP-NrichD or Pfam-NrichD) and provide the results as a union of both searches.

 

Download databases


The compressed data files of the (a) SCOP-DB (b) SCOP-NrichD database (c) Pfam-DB (d) Pfam-NrichD database (e) Pfam31-NrichD and (f) AS-DB Artifical Sequence database, can be downloaded from the download page - Download page

 

Design protein-like sequences


This resource also provides a unique feature to design protein-like sequences. The user can design sequences for a single family which uses the inherent amino acid frequency for each position in the protein family alignments. An amino acid is chosen based on its fitness/frequency for that position. Once sequences are generated, they are considered as bonafide designed sequences, only if they detect the parent profile at E-value <= 0.0001, 80% query coverage and detect profiles from the parent fold only.
The process for designing intermediate sequences between families is described in the flow chart. Briefly, we first represent the protein families as profiles or PSSMs. Then, these profiles are aligned, all vs. all, using AlignHUSH and the best scoring alignment is taken as a guide to combine scores at a user defined LEVEL for each alignment position. These scores are transfered to a roulette wheel and an amino acid is selected at each alignment position to generate an entire sequence using a random number generator. User can define a LEVEL which dictates the amount of divergence in the designed sequences. Higher the level, more divergent are the designed sequences. These generated artificial sequences are subjected to eligibility checks: should detect both parent families in reverse searches with a query coverage greater than or equal to 80% and not detect any false positives [i.e., sequences from other folds, if fold information is known].

References:

(1) Filling-in void and sparse regions in protein sequence space by protein-like artificial sequences enables remarkable enhancement in remote homology detection capability.
Mudgal R., Sowdhamini R., Chandra N., Srinivasan N. and Sandhya S.
J. Mol. Biol. (2014) 426: 962-979.  (http://dx.doi.org/10.1016/j.jmb.2013.11.026)


(2) Cascaded walks in protein sequence space: use of artificial sequences in remote homology detection between natural proteins.
Sandhya S., Mudgal R., Jayadev C., Abhinandan K.R., Sowdhamini R. and Srinivasan N.
Mol Biosyst. (2012) 8: 2076-84.  (doi: 10.1039/c2mb25113b)