NAME

ssearch - scan a protein or DNA sequence library for similar sequences

SYNOPSIS

ssearch [-a -b # -d # -E # -f # -g # -h -i -l FASTLIBS -L -r STATFILE -m # -O filename -Q -s SMATRIX -w # -z ] query-sequence-file library-file

ssearch [-QabdEfghilmOrswz] query-file @library-name-file

ssearch [-QabdEfghilmOrswz] query-file "%PRMVI"

ssearch [-aEfghilmrsw] - interactive mode

DESCRIPTION

ssearch compares a protein or DNA sequence to all of the entries in a sequence library using the rigorous Smith-Waterman algorithm (Smith and Waterman, J. Mol. Biol. (1983) 147:195-197. For example, ssearch can compare a protein sequence to all of the sequences in the NBRF PIR protein sequence database. ssearch will automatically decide whether the query sequence is DNA or protein by reading the query sequence as protein and determining whether the `amino-acid composition' is more than 85% A+C+G+T. The program can be invoked either with command line arguments or in interactive mode. ssearch compares a query sequence to a sequence library which consists of sequence data interspersed with comments, see below. The fasta programs, including ssearch, use a standard text format sequence file. Lines beginning with or lower case, blanks,tabs and unrecognizable characters are ignored. ssearch expects sequences to use the single letter amino acid codes, see protcodes(1) . Library files for ssearch should have the form shown below.

OPTIONS

ssearch can be directed to change the scoring matrix, search parameters, output format, and default search directories by entering options on the command line (preceeded by a `-'). All of the options should preceed the file name and ktup arguments). Alternately, these options can be changed by setting environment variables. The options and environment variables are:

-a: (SHOWALL) Modifies the display of the two sequences in alignments. Normally, both sequences are shown only where they overlap (SHOWALL=0); If -a or the environment variable SHOWALL = 1, both sequences are shown in their entirety.
-b #: The number of similarity scores to be shown when the -Q option is used. This value is usually calculated based on the actual scores.
-d #: The number of alignments to be shown. Normally, ssearch shows the same number of alignments as similarity scores. By using ssearch -Q -b 200 -d 50, one would see the top scoring 200 sequences and alignments for the 50 best scores.
-E #: The expectation value threshold for displaying similarity scores and sequence alignments. fasta -Q -E 2.0 would show all library sequences with scores expected to occur no more than 2 times by chance in a search of the library.
-f #: Penalty for the first residue in a gap (-12 by default).
-g #: Penalty for additional residues in a gap (-2 by default).
-h: Do not display histogram of similarity scores.
-l file: (FASTLIBS) The name of the library menu file. Normally this will be determined by the environment variable FASTLIBS. However, a library menu file can also be specified with -l.
-L: display more information about the library sequence in the alignment.
-m #: (MARKX) =0,1,2,3. Alternate display of matches and mismatches in alignments. MARKX=0 uses ":","."," ", for identities, consevative replacements, and non-conservative replacements, respectively. MARKX=1 uses " ","x", and "X". MARKX=2 does not show the second sequence, but uses the second alignment line to display matches with a "." for identity, or with the mismatched residue for mismatches. MARKX=2 is useful for aligning large numbers of similar sequences. MARKX=3 writes out a file of library sequences in FASTA format. MARKX=3 should always be used with the "SHOWALL" (-a) option, but this does not completely ensure that all of the sequences output will be aligned.
-O filename: Sends copy of results to "filename".
-Q Quiet option. This allows ssearch to search a database and report: the results without asking any questions. ssearch -Q file library > output can be put in the background or run at a later time with the unix 'at' command. The number of similarity scores and alignments displayed with the -Q option can be modified with the -b (scores) and -d (alignments) options.
-r: STATFILE Causes ssearch to write out the sequence identifier, superfamily number (if available), and similarity scores to STATFILE for every sequence in the library. These results are not sorted.
-s str: (SMATRIX) the filename of an alternative scoring matrix file. For protein sequences, BLOSUM50 is used by default; PAM250 can be used with the command line option -s 250.
-w #: (LINLEN) output line length for sequence alignments. (normally 60, can be set up to 200).
-z: Do not do statistical significance calculation.

EXAMPLES

(1): ssearch musplfm.aa $AABANK

Compare the amino acid sequence in the file musplfm.aa with the complete PIR protein sequence library. This is extremely slow and should almost never be done. ssearch is designed to search very small libraries of sequences.




>LCBO bovine preprolactin
WILLLSQ ...
>LCHU human ...
...

(2): ssearch -a -w 80 musplfm.aa lcbo.aa

Compare the amino acid sequence in the file musplfm.aa with the sequences in the file lcbo.aa using ktup = 1. Show both sequences in their entirety, with 80 residues on each output line.

(3): ssearch

Run the ssearch program in interactive mode. The program will prompt for the file name for the query sequence, list alternative libraries to be seached (if FASTLIBS is set), and prompt for the ktup.

You can use your own sequence files for ssearch, just be certain to put a '>' and comment as the first line before the sequence.

AUTHOR