|
|
| |
cmalign(1) |
Infernal Manual |
cmalign(1) |
cmalign - align sequences to a covariance model
- cmalign
- [options] <cmfile> <seqfile>
cmalign aligns the RNA sequences in <seqfile> to the
covariance model (CM) in <cmfile>. The new alignment is output to
stdout in Stockholm format, but can be redirected to a file
<f> with the -o <f> option.
Either <cmfile> or <seqfile> (but not
both) may be '-' (dash), which means reading this input from stdin
rather than a file.
The sequence file <seqfile> must be in FASTA or
Genbank format.
cmalign uses an HMM banding technique to accelerate
alignment by default as described below for the --hbanded option. HMM
banding can be turned off with the --nonbanded option.
By default, cmalign computes the alignment with maximum
expected accuracy that is consistent with constraints (bands) derived from
an HMM, using a banded version of the Durbin/Holmes optimal accuracy
algorithm. This behavior can be changed with the --cyk or
--sample options.
cmalign takes special care to correctly align truncated
sequences, where some nucleotides from the beginning (5') and/or end (3') of
the actual full length biological sequence are not present in the input
sequence (see DL Kolbe and SR Eddy, Bioinformatics, 25:1236-1243, 2009).
This behavior is on by default, but can be turned off with --notrunc.
In previous versions of cmalign the --sub option was required
to appropriately handle truncated sequences. The --sub option is
still available in this version, but the new default method for handling
truncated sequences should be as good or superior to the sub method in
nearly all cases.
The --mapali <s> option allows inclusion of
the fixed training alignment used to build the CM from file <s>
within the output alignment of cmalign.
It is possible to merge two or more alignments created by the same
CM using the Easel miniapp esl-alimerge (included in the
easel/miniapps/ subdirectory of Infernal). Previous versions of
cmalign included options to merge alignments but they were deprecated
upon development of esl-alimerge, which is significantly more memory
efficient.
By default, cmalign will output the alignment to stdout.
The alignment can be redirected to an output file <f> with the
-o <f> option. With -o, information on each
aligned sequence, including score and model alignment boundaries will be
printed to stdout (more on this below).
The output alignment will be in Stockholm format by default. This
can be changed to Pfam, aligned FASTA (AFA), A2M, Clustal, or Phylip format
using the --outformat <s> option, where <s>
is the name of the desired format. As a special case, if the output
alignment is large (more than 10,000 sequences or more than 10,000,000 total
nucleotides) than the output format will be Pfam format, with each sequence
appearing on a single line, for reasons of memory efficiency. For alignments
larger than this, using --ileaved will force interleaved Stockholm
format, but the user should be aware that this may require a lot of memory.
--ileaved will only work for alignments up to 100,000 sequences or
100,000,000 total nucleotides.
If the output alignment format is Stockholm or Pfam, the output
alignment will be annotated with posterior probabilities which estimate the
confidence level of each aligned nucleotide. This annotation appears as
lines beginning with "#=GR <seq name> PP", one per sequence,
each immediately below the corresponding aligned sequence "<seq
name>". Characters in PP lines have 12 possible values:
"0-9", "*", or ".". If ".", the
position corresponds to a gap in the sequence. A value of "0"
indicates a posterior probability of between 0.0 and 0.05, "1"
indicates between 0.05 and 0.15, "2" indicates between 0.15 and
0.25 and so on up to "9" which indicates between 0.85 and 0.95. A
value of "*" indicates a posterior probability of between 0.95 and
1.0. Higher posterior probabilities correspond to greater confidence that
the aligned nucleotide belongs where it appears in the alignment. With
--nonbanded, the calculation of the posterior probabilities considers
all possible alignments of the target sequence to the CM. Without
--nonbanded (i.e. in default mode), the calculation considers only
possible alignments within the HMM bands. Further, the posterior
probabilities are conditional on the truncation mode of the alignment. For
example, if the sequence alignment is truncated 5', a PP value of
"9" indicates between 0.85 and 0.95 of all 5' truncated alignments
include the given nucleotide at the given position. The posterior annotation
can be turned off with the --noprob option. If --small is
enabled, posterior annotation must also be turned off using
--noprob.
The tabular output that is printed to stdout if the -o
option is used includes one line per sequence and twelve fields per line:
"idx": the index of the sequence in the input file, "seq
name": the sequence name; "length": the length of the
sequence; "cm from" and "cm to": the model start and end
positions of the alignment; "trunc": "no" if the
sequence is not truncated, "5'" if the beginning of the sequence
truncated 5', "3'" if the end of the sequence is truncated, and
"5'&3'" if both the beginning and the end are truncated;
"bit sc": the bit score of the alignment, "avg pp" the
average posterior probability of all aligned nucleotides in the alignment;
"band calc", "alignment" and "total": the time
in seconds required for calculating HMM bands, computing the alignment, and
complete processing of the sequence, respectively; "mem (Mb)": the
size in Mb of all dynamic programming matrices required for aligning the
sequence. This tabular data can be saved to file <f> with the
--sfile <f> option.
- -h
- Help; print a brief reminder of command line usage and available options.
- -o <f>
- Save the alignment in Stockholm format to a file <f>. The
default is to write it to standard output.
- -g
- Configure the model for global alignment of the query model to the target
sequences. By default, the model is configured for local alignment. Local
alignments can contain large insertions and deletions called "local
ends" in the structure to be penalized differently than normal
indels. These are annotated as "~" columns in the RF line of the
output alignment. The -g option can be used to disallow these local
ends. The -g option is required if the --sub option is also
used.
- --optacc
- Align sequences using the Durbin/Holmes optimal accuracy algorithm. This
is the default. The optimal accuracy alignment will be constrained by HMM
bands for acceleration unless the --nonbanded option is enabled.
The optimal accuracy algorithm determines the alignment that maximizes the
posterior probabilities of the aligned nucleotides within it. The
posterior probabilites are determined using (possibly HMM banded) variants
of the Inside and Outside algorithms.
- --cyk
- Do not use the Durbin/Holmes optimal accuracy alignment to align the
sequences, instead use the CYK algorithm which determines the optimally
scoring (maximum likelihood) alignment of the sequence to the model, given
the HMM bands (unless --nonbanded is also enabled).
- --sample
- Sample an alignment from the posterior distribution of alignments. The
posterior distribution is determined using an HMM banded (unless
--nonbanded) variant of the Inside algorithm.
- --seed <n>
- Seed the random number generator with <n>, an integer >=
0. This option can only be used in combination with --sample. If
<n> is nonzero, stochastic sampling of alignments will be
reproducible; the same command will give the same results. If
<n> is 0, the random number generator is seeded arbitrarily,
and stochastic samplings may vary from run to run of the same command. The
default seed is 181.
- --notrunc
- Turn off truncated alignment algorithms. All sequences in the input file
will be assumed to be full length, unless --sub is also used, in
which case the program can still handle truncated sequences but will use
an alternative strategy for their alignment.
- --sub
- Turn on the sub model construction and alignment procedure. For each
sequence, an HMM is first used to predict the model start and end
consensus columns, and a new sub CM is constructed that only models
consensus columns from start to end. The sequence is then aligned to this
sub CM. Sub alignment is an older method than the default one for aligning
sequences that are possibly truncated. By default, cmalign uses
special DP algorithms to handle truncated sequences which should be more
accurate than the sub method in most cases. --sub is still included
as an option mainly for testing against this default truncated sequence
handling. This "sub CM" procedure is not the same as the
"sub CMs" described by Weinberg and Ruzzo.
- --hbanded
- This option is turned on by default. Accelerate alignment by pruning away
regions of the CM DP matrix that are deemed negligible by an HMM. First,
each sequence is scored with a CM plan 9 HMM derived from the CM using the
Forward and Backward HMM algorithms to calculate posterior probabilities
that each nucleotide aligns to each state of the HMM. These posterior
probabilities are used to derive constraints (bands) on the CM DP matrix.
Finally, the target sequence is aligned to the CM using the banded DP
matrix, during which cells outside the bands are ignored. Usually most of
the full DP matrix lies outside the bands (often more than 95%), making
this technique faster because fewer DP calculations are required, and more
memory efficient because only cells within the bands need be allocated.
Importantly, HMM banding sacrifices the guarantee of
determining the optimally accurarte or optimal alignment, which will be
missed if it lies outside the bands. The tau paramater is the amount of
probability mass considered negligible during HMM band calculation;
lower values of tau yield greater speedups but also a greater chance of
missing the optimal alignment. The default tau is 1E-7, determined
empirically as a good tradeoff between sensitivity and speed, though
this value can be changed with the --tau <x> option. The
level of acceleration increases with both the length and primary
sequence conservation level of the family. For example, with the default
tau of 1E-7, tRNA models (low primary sequence conservation with length
of about 75 nucleotides) show about 10X acceleration, and SSU bacterial
rRNA models (high primary sequence conservation with length of about
1500 nucleotides) show about 700X. HMM banding can be turned off with
the --nonbanded option.
- --tau <x>
- Set the tail loss probability used during HMM band calculation to
<x>. This is the amount of probability mass within the HMM
posterior probabilities that is considered negligible. The default value
is 1E-7. In general, higher values will result in greater acceleration,
but increase the chance of missing the optimal alignment due to the HMM
bands.
- --mxsize <x>
- Set the maximum allowable total DP matrix size to <x>
megabytes. By default this size is 1028 Mb. This should be large enough
for the vast majority of alignments, however if it is not cmalign
will attempt to iteratively tighten the HMM bands it uses to constrain the
alignment by raising the tau parameter and recalculating the bands until
the total matrix size needed falls below <x> megabytes or the
maximum allowable tau value (0.05 by default, but changeable with
--maxtau) is reached. At each iteration of band tightening, tau is
multiplied by a 2.0. The band tightening strategy can be turned off with
the --fixedtau option. If the maximum tau is reached and the
required matrix size still exceeds <x> or if HMM banding is
not being used and the required matrix size exceeds <x> then
cmalign will exit prematurely and report an error message that the
matrix exceeded its maximum allowable size. In this case, the
--mxsize can be used to raise the size limit or the maximum tau can
be raised with --maxtau. The limit will commonly be exceeded when
the --nonbanded option is used without the --small option,
but can still occur when --nonbanded is not used. Note that if
cmalign is being run in <n> multiple threads on a
multicore machine then each thread may have an allocated matrix of up to
size <x> Mb at any given time.
- --fixedtau
- Turn off the HMM band tightening strategy described in the explanation of
the --mxsize option above.
- --maxtau <x>
- Set the maximum allowed value for tau during band tightening, described in
the explanation of --mxsize above, to <x>. By default
this value is 0.05.
- --nonbanded
- Turns off HMM banding. The returned alignment is guaranteed to be the
globally optimally accurate one (by default) or the globally optimally
scoring one (if --cyk is enabled). The --small option is
recommended in combination with this option, because standard alignment
without HMM banding requires a lot of memory (see --small ).
- --small
- Use the divide and conquer CYK alignment algorithm described in SR Eddy,
BMC Bioinformatics 3:18, 2002. The --nonbanded option must be used
in combination with this options. Also, it is recommended whenever
--nonbanded is used that --small is also used because
standard CM alignment without HMM banding requires a lot of memory,
especially for large RNAs. --small allows CM alignment within
practical memory limits, reducing the memory required for alignment LSU
rRNA, the largest known RNAs, from 150 Gb to less than 300 Mb. This option
can only be used in combination with --nonbanded, --notrunc,
and --cyk.
- --sfile <f>
- Dump per-sequence alignment score and timig information to file
<f>. The format of this file is described above (it's the
same data in the same format as the tabular stdout output when the
-o option is used).
- --tfile <f>
- Dump tabular sequence tracebacks for each individual sequence to a file
<f>. Primarily useful for debugging.
- --ifile <f>
- Dump per-sequence insert information to file <f>. The format
of the file is described by "#"-prefixed comment lines included
at the top of the file <f>. The insert information is valid
even when the --matchonly option is used.
- --elfile <f>
- Dump per-sequence EL state (local end) insert information to file
<f>. The format of the file is described by
"#"-prefixed comment lines included at the top of the file
<f>. The EL insert information is valid even when the
--matchonly option is used.
- --mapali <f>
- Reads the alignment from file <f> used to build the model
aligns it as a single object to the CM; e.g. the alignment in
<f> is held fixed. This allows you to align sequences to a
model with cmalign and view them in the context of an existing
trusted multiple alignment. <f> must be the alignment file
that the CM was built from. The program verifies that the checksum of the
file matches that of the file used to construct the CM. A similar option
to this one was called --withali in previous versions of
cmalign.
- --mapstr
- Must be used in combination with --mapali <f>.
Propogate structural information for any pseudoknots that exist in
<f> to the output alignment. A similar option to this one was
called --withstr in previous versions of cmalign.
- --informat <s>
- Assert that the input <seqfile> is in format
<s>. Do not run Babelfish format autodection. This increases
the reliability of the program somewhat, because the Babelfish can make
mistakes; particularly recommended for unattended, high-throughput runs of
Infernal. Acceptable formats are: FASTA, GENBANK, and DDBJ.
<s> is case-insensitive.
- --outformat <s>
- Specify the output alignment format as <s>. Acceptable
formats are: Pfam, AFA, A2M, Clustal, and Phylip. AFA is aligned fasta.
Only Pfam and Stockholm alignment formats will include consensus structure
annotation and posterior probability annotation of aligned residues.
- --dnaout
- Output the alignments as DNA sequence alignments, instead of RNA ones.
- --noprob
- Do not annotate the output alignment with posterior probabilities.
- --matchonly
- Only include match columns in the output alignment, do not include any
insertions relative to the consensus model. This option may be useful when
creating very large alignments that require a lot of memory and disk
space, most of which is necessary only to deal with insert columns that
are gaps in most sequences.
- --ileaved
- Output the alignment in interleaved Stockholm format of a fixed width that
may be more convenient for examination. This was the default output
alignment format of previous versions of cmalign. Note that
cmalign requires more memory when this option is used. For this
reason, --ileaved will only work for alignments of up to 100,000
sequences or a total of 100,000,000 aligned nucleotides.
- --regress <s>
- Save an additional copy of the output alignment with no author information
to file <s>.
- --verbose
- Output additional information in the tabular scores output (output to
stdout if -o is used, or to <f> if --sfile
<f> is used). These are mainly useful for testing and debugging.
- --cpu <n>
- Specify that <n> parallel CPU workers be used. If
<n> is set as "0", then the program will be run in
serial mode, without using threads. You can also control this number by
setting an environment variable, INFERNAL_NCPU. This option will
only be available if the machine on which Infernal was built is capable of
using POSIX threading (see the Installation section of the user guide for
more information).
- --mpi
- Run as an MPI parallel program. This option will only be available if
Infernal has been configured and built with the "--enable-mpi"
flag (see the Installation section of the user guide for more
information).
See infernal(1) for a master man page with a list of all the individual
man pages for programs in the Infernal package.
For complete documentation, see the user guide that came with your
Infernal distribution (Userguide.pdf); or see the Infernal web page ().
Copyright (C) 2019 Howard Hughes Medical Institute.
Freely distributed under the BSD open source license.
For additional information on copyright and licensing, see the
file called COPYRIGHT in your Infernal source distribution, or see the
Infernal web page ().
The Eddy/Rivas Laboratory
Janelia Farm Research Campus
19700 Helix Drive
Ashburn VA 20147 USA
http://eddylab.org
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |