|
|
| |
FA2HTGS(1) |
NCBI Tools User's Manual |
FA2HTGS(1) |
fa2htgs - formatter for high throughput genome sequencing project submissions
fa2htgs [-] [-6 str]
[-7 str] [-A filename]
[-C str] [-D] [-L filename]
[-M str] [-N] [-O filename]
[-P str] [-Q filename]
[-S str] [-T filename] [-X]
[-a str] [-b N]
[-c str] [-d str]
[-e filename] [-f] -g str
[-h str] [-i filename]
[-k str] [-l N] [-m]
[-n str] [-o filename]
[-p N] [-q] [-r str]
-s str [-t filename] [-u]
[-v] [-w] [-x str]
fa2htgs is a program used to generate Seq-submits (an ASN.1 sequence
submission file) for high throughput genome sequencing projects.
fa2htgs will read a FASTA file (or an Ace Contig file with
Phrap sequence quality values), a Sequin submission template file, (to get
contact and citation information for the submission), and a series of
command line arguments (see below). This program will then combines these
information to make a submission suitable for GenBank. Once you have
generated your submission file, you need to follow the submission protocol
(see the README present on your FTP account or mailed out to your
Center).
fa2htgs is intended for the automation by scripts for bulk
submission of unannotated genome sequence. It can easily be extended from
its current simple form to allow more complicated processing. A submission
prepared with fa2htgs can also be read into Psequin(1), and
then annotated more extensively.
Questions and concerns about this processing protocol, or how to
use this tool should be forwarded to <htgs@ncbi.nlm.nih.gov>.
A summary of options is included below.
- -
- Print usage message
- -6 str
- SP6 clone (e.g., Contig1,left)
- -7 str
- T7 clone (e.g., Contig2,right)
- -A filename
- Filename for accession list input (mutually exclusive with -T and
-i). The input file contains a tab-delimited table with three to
five columns, which are accession number, start position, stop position,
and (optionally) length and strand. If start > stop, the minus strand
on the referenced accession is used. A gap is indicated by the word
"gap" instead of an accession, 0 for the start and stop
positions, and a number for the length.
- -C str
- Clone library name (will appear as
/clone-lib="str" on the source
feature)
- -D
- HTGS_DRAFT sequence
- -L filename
- Read phrap contig order from filename. This is a tab-delimited file
that can be used to drive the order of contigs (normally specified by
-P), as well as indicating the SP6 and T7 ends. It can also be used
when contigs are known to be in opposite orientation. For example:
Contig2 + 1 SP6 left
Contig3 + 1
Contig1 - T7 right
The first column is the contig name, the second is the orientation, the
third is the fragment_group, the fourth indicates the SP6 or T7 end, and
the fifth says which side of SP6 or T7 end had vector removed.
- -M str
- Map name (will appear as /map="str" on the
source feature)
- -N
- Annotate assembly_fragments
- -O filename
- Read comment from filename (100-character-per-line maximum;
~ is a linebreak and `~ is a literal ~. You can check
the format with PSequin(1).)
- -P str
- Contigs to use, separated by commas. If -P is not indicated with
the -T option, then the fragments will go in in the order that they
are in the ace file (which is appropriate for a phase 1 record, but not
for a phase 2 or 3). If you need to set the order of the segments of the
ace file, you need to set it with the -P flag, like this: -P
"Contig1,Contig4,Contig3,Contig2,Contig5"
- -Q filename
- Read quality scores from filename
- -S str
- Strain name
- -T filename
- Filename for phrap input (mutually exclusive with -A and
-i)
- -X
- The coordinates in the input file are on the resulting segmented sequence.
(Bases 1 through n of each accession are used.) Otherwise, the
coordinates are on the individual accessions, which need not start at base
1 of the record.
- -a str
- GenBank accession; use if and only if updating a sequence.
- -b N
- Gap length (default = 100; anything from 0 to 1000000000 is legal)
- -c str
- Clone name (will appear as /clone in the source feature; can be the
same as -s)
- -d str
- Title for sequence (will appear in GenBank DEFINITION line)
- -e filename
- Log errors to filename
- -f
- htgs_fulltop keyword
- -g str
- Genome Center tag (probably the same as your login name on the NCBI FTP
server)
- -h str
- Chromosome (will appear as /chromosome in the source feature)
- -i filename
- Filename for fasta input (default is stdin; mutually exclusive with
-A and -T)
- -k str
- Add the supplied string as a keyword.
- -l N
- Length of sequence in bp (default = 0). The length is checked against the
actual number of bases we get. For phase 1 and 2 sequence it is also used
to estimate gap lengths. For phase 1 and 2 records, it is important to use
a number GREATER than the amount of provided nucleotide, otherwise this
will generate false `gaps'. Here is assumed that the putative full length
of the BAC or cosmid will be used. There should be at least 20 to 30 `n'
in between the segments (you can check for these in Sequin), as this will
ensure proper behavior when this sequence is used with BLAST. Otherwise
`artifactual' unrelated segment neighbors may be brought into proximity of
each other.
- -m
- Take comment from template
- -n str
- Organism name (default = Homo sapiens)
- -o filename
- Filename for asn.1 output (default = stdout)
- -p N
- HTGS phase:
- 1
- A collection of unordered contigs with gaps of unknown length. A Phase 1
record must at the very least have two segments with one gap.
(default)
- 2
- A series of ordered contigs, possibly with known gap lengths. This could
be a single sequence without gaps, if the sequence has ambiguities to
resolve.
- 3
- A single contiguous sequence. This sequence is finished, but not
necessarily annotated.
- -q
- htgs_cancelled keyword
- -r str
- Remark for update (brief comment describing the nature of the update, such
as "new sequence", "new citation", or "updated
features")
- -s str
- Sequence name. The sequence must have a name that is unique within the
genome center. We use the combination of the genome center name (-g
argument) and the sequence name (-s) to track this sequence and to
talk to you about it. The name can have any form you like but must be
unique within your center.
- -t filename
- Filename for Seq-submit template (default = template.sub)
- -u
- Take biosource from template
- -v
- htgs_activefin keyword
- -w
- Whole Genome Shotgun flag
- -x str
- Secondary accession numbers, separated by commas, s.t. U10000,L11000.
In some cases a large segment will supersede another or
group of other accession numbers (records). These records which are no longer
wanted in GenBank should be made secondary. Using the -x argument you
can list the Accession Numbers you want to make secondary. This will instruct
us to remove the accession number(s) from GenBank, and will no longer be part
of the GenBank release. They will nonetheless be available from Entrez.
GREAT CARE should be taken when using this argument!!!
Improper use of accession numbers here will result in the inappropriate
withdrawal of GenBank records from GenBank, EMBL and DDBJ. We provide this
parameter as a convenience to submitting centers, but this may need to be
removed if it is not used carefully.
The National Center for Biotechnology Information.
Psequin(1), fa2htgs/README
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |