|
|
| |
SPIDEY(1) |
NCBI Tools User's Manual |
SPIDEY(1) |
spidey - align mRNA sequences to a genome
spidey [-] [-F N] [-G]
[-L N] [-M filename]
[-N filename] [-R filename]
[-S p/m] [-T N] [-X]
[-a filename] [-c N] [-d]
[-e X] [-f X]
[-g X] -i filename [-j]
[-k filename] [-l N]
-m filename [-n N]
[-o str] [-p N]
[-r c/d/m/p/v] [-s]
[-t filename] [-u] [-w]
spidey is a tool for aligning one or more mRNA sequences to a given
genomic sequence. spidey was written with two main goals in mind: find
good alignments regardless of intron size; and avoid getting confused by
nearby pseudogenes and paralogs. Towards the first goal, spidey uses
BLAST and Dot View (another local alignment tool) to find its alignments;
since these are both local alignment tools, spidey does not
intrinsically favor shorter or longer introns and has no maximum intron size.
To avoid mistakenly including exons from paralogs and pseudogenes,
spidey first defines windows on the genomic sequence and then performs
the mRNA-to-genomic alignment separately within each window. Because of the
way the windows are constructed, neighboring paralogs or pseudogenes should be
in separate windows and should not be included in the final spliced alignment.
spidey takes as input a single genomic sequence and a set of mRNA
accessions or FASTA sequences. All processing is done one mRNA sequence at a
time. The first step for each mRNA sequence is a high-stringency BLAST against
the genomic sequence. The resulting hits are analyzed to find the genomic
windows.
The BLAST alignments are sorted by score and then assigned into
windows by a recursive function which takes the first alignment and then
goes down the alignment list to find all alignments that are consistent with
the first (same strand of mRNA, both the mRNA and genomic coordinates are
nonoverlapping and linearly consistent). On subsequent passes, the remaining
alignments are examined and are put into their own nonoverlapping,
consistent windows, until no alignments are left. Depending on how many gene
models are desired, the top n windows are chosen to go on to the next
step and the others are deleted.
Once the genomic windows are constructed, the initial BLAST alignments are freed
and another BLAST search is performed, this time with the entire mRNA against
the genomic region defined by the window, and at a lower stringency than the
initial search. spidey then uses a greedy algorithm to generate a
high-scoring, nonoverlapping subset of the alignments from the second BLAST
search. This consistent set is analyzed carefully to make sure that the entire
mRNA sequence is covered by the alignments. When gaps are found between the
alignments, the appropriate region of genomic sequence is searched against the
missing mRNA, first using a very low-stringency BLAST and, if the BLAST fails
to find a hit, using DotView functions to locate the alignment. When gaps are
found at the ends of the alignments, the BLAST and DotView searches are
actually allowed to extend past the boundaries of the window. If the 3' end of
the mRNA does not align completely, it is first examined for the presence of a
poly(A) tail. No attempt is made to align the portion of the mRNA that seems
to be a poly(A) tail; sometimes there is a poly(A) tail that does align to the
genomic sequence, and these are noted because they indicate the possibility of
a pseudogene.
Now that the mRNA is completely covered by the set of alignments,
the boundaries of the alignments (there should be one alignment per exon
now) are adjusted so that the alignments abut each other precisely and so
that they are adjacent to good splice donor and acceptor sites. Most
commonly, two adjacent exons' alignments overlap by as much as 20 or 30 base
pairs on the mRNA sequence. The true exon boundary may lie anywhere within
this overlap, or (as we have seen empirically) even a few base pairs outside
the overlap. To position the exon boundaries, the overlap plus a few base
pairs on each side is examined for splice donor sites, using functions that
have different splice matrices depending on the organism chosen. The top few
splice donor sites (by score) are then evaluated as to how much they affect
the original alignment boundaries. The site that affects the boundaries the
least is chosen, and is evaluated as to the presence of an acceptor site.
The alignments are truncated or extended as necessary so that they terminate
at the splice donor site and so that they do not overlap.
The windows are examined carefully to get the percent identity per exon, the
number of gaps per exon, the overall percent identity, the percent coverage of
the mRNA, presence of an aligning or non-aligning poly(A) tail, number of
splice donor sites and the presence or absence of splice donor and acceptor
sites for each exon, and the occurrence of an mRNA that has a 5' or 3' end (or
both) that does not align to the genomic sequence. If the overall percent
identity and percent length coverage are above the user-defined cutoffs, a
summary report is printed, and, if requested, a text alignment showing
identities and mismatches is also printed.
spidey is capable of performing interspecies alignments. The major
difference in interspecies alignments is that the mRNA-genomic identity will
not be close to 100% as it is in intraspecies alignments; also, the alignments
have numerous and lengthy gaps. If spidey is used in its normal mode to
do interspecies alignments, it produces gene models with many, many short
exons. When the interspecies flag is set, spidey uses different BLAST
parameters to encourage longer and more gaps and to not penalize as heavily
for mismatches. This way, the alignments for the exons are much longer and
more closely approximate the actual gene structure.
When spidey is run in network-aware mode or when ASN.1 files are used for
the mRNA records, it is capable of extracting a CDS alignment from an mRNA
alignment and printing the CDS information also. Since the CDS alignment is
just a subset of the mRNA alignment, it is relatively straightforward to
truncate the exon alignments as necessary and to generate a CDS alignment.
Furthermore, the untranslated regions are now defined, so the percent identity
for the 5' and 3' untranslated regions is also calculated.
A summary of options is included below.
- -
- Print usage message.
- -F N
- Start of genomic interval desired (from; 0-based).
- -G
- Input file is a GI list.
- -L N
- The extra-large intron size to use (default = 220000).
- -M filename
- File with donor splice matrix.
- -N filename
- File with acceptor splice matrix.
- -R filename
- File (including path) to repeat blast database for filtering.
- -S p/m
- Restrict to plus (p) or minus (m) strand of genomic sequence.
- -T N
- Stop of genomic interval desired (to; 0-based).
- -X
- Use extra-large intron sizes (increases the limit for initial and terminal
introns from 100kb to 240kb and for all others from 35kb to 120kb); may
result in significantly longer compute times.
- -a filename
- Output file for alignments when directed to a separate file with
-p 3 (default = spidey.aln).
- -c N
- Identity cutoff, in percent, for quality control purposes.
- -d
- Also try to align coding sequences corresponding to the given mRNA records
(may require network access).
- -e X
- First-pass e-value (default = 1.0e-10). Higher values increase speed at
the cost of sensitivity.
- -f X
- Second-pass e-value (default = 0.001).
- -g X
- Third-pass e-value (default = 10).
- -i filename
- Input file containing the genomic sequence in ASN.1 or FASTA format. If
your computer is running on a network that can access GenBank, you can
substitute the desired accession number for the filename.
- -j
- Print ASN.1 alignment?
- -k filename
- File for ASN.1 output with -k (default = spidey.asn).
- -l N
- Length coverage cutoff, in percent.
- -m filename
- Input file containing the mRNA sequence(s) in ASN.1 or FASTA format, or a
list of their accessions (with -G). If your computer is running on
a network that can access GenBank, you can substitute a single accession
number for the filename.
- -n N
- Number of gene models to return per input mRNA (default = 1).
- -o str
- Main output file (default = stdout; contents controlled by
-p).
- -p N
- Print alignment?
- 0
- summary and alignments together (default)
- 1
- just the summary
- 2
- just the alignments
- 3
- summary and alignments in different files
- -r c/d/m/p/v
- Organism of genomic sequence, used to determine splice matrices.
- c
- C. elegans
- d
- Drosophila
- m
- Dictyostelium discoideum
- p
- plant
- v
- vertebrate (default)
- -s
- Tune for interspecies alignments.
- -t filename
- File with feature table, in 4 tab-delimited columns:
- seqid
- (e.g., NM_04377.1)
- name
- (only repetitive_region is currently supported)
- start
- (0-based)
- stop
- (0-based)
- -u
- Make a multiple alignment of all input mRNAs (which must overlap on the
genomic sequence).
- -w
- Consider lowercase characters in input FASTA sequences to be masked.
Sarah Wheelan and others at the National Center for Biotechnology Information;
Steffen Moeller contributed to this documentation.
<http://www.ncbi.nlm.nih.gov/spidey>
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |