6.4. Datatype-specific Command Line Qualifiers

6.4. Datatype-specific Command Line Qualifiers
Prev	Chapter 6. The EMBOSS Command Line	Next

6.4.1. Introduction

Datatype-specific qualifiers are inbuilt qualifiers available for specific ACD datatypes only. They are used to specify a particular application option in more detail. They control such things as:

Formats of input and output files
File names
Start and end regions of sequences and features
Sequence type
Output options

The datatype-specific qualifiers are defined for individual datatypes or groups of related datatypes. They are summarised below, organised by ACD datatype grouping.

6.4.2. Sequences

Sequences are referenced on the command line by the USA (Uniform Sequence Address) mechanism (Section 6.6, “The Uniform Sequence Address (USA)”). To provide maximum flexibility, parts of this specification may be given on the command line as qualifiers. The qualifiers are used to specify such things as the name and format of the file, the sequence type and the region of interest in the sequence.

The qualifiers can refer to one or to all sequence parameters. If given at the start of the command line before any sequence parameters then they refer to all parameters. If a qualifier is given after a sequence parameter, then it refers to that parameter only. If a number is given after the qualifier name, then the qualifier refers to the 'n'th instance of the associated datatype in the ACD file. For example, -sbegin3 refers to the third input sequence parameter. More information on numbering qualifiers is given elsewhere (Section 6.4, “Datatype-specific Command Line Qualifiers”).

The qualifiers have the general format:

-QualifierName "QualifierDatatype" ("DefaultValue")

For example:

-sbegin "integer" ("0")

6.4.2.1. Sequence Input

The qualifiers below apply to the following ACD datatypes:

sequence
seqall
seqset
seqsetall

-sbegin "integer" ("0"). Start of the sequence to be used. The default 0 is the start of the sequence. Negative numbers are counted from the end with -1 indicating the last character of the sequence.

-send "integer" ("0"). End of the sequence to be used. The default 0 is the end of the sequence. Negative numbers are counted from the end with -1 indicating the last character of the sequence.

-sreverse "boolean" ("N"). Reverse complement (if DNA).

-sask "boolean" ("N"). Prompt for begin/end/reverse interactively.

-snucleotide "boolean" ("N"). Sequence is nucleotide. EMBOSS will identify most sequences correctly but some short protein sequences are also valid as nucleotide sequences.

-sprotein "boolean" ("N"). Sequence is protein. EMBOSS will identify most sequences correctly but some short protein sequences are also valid as nucleotide sequences.

-slower "boolean" ("N"). Make sequence characters lower case.

-supper "boolean" ("N"). Make sequence characters upper case.

-sformat "string" (""). Specifies the input sequence format.

-sdbname "string" (""). Database name to be used for sequence output; useful for formats that include a database name.

-sid "string" (""). Sequence identifier to be used for sequence output; useful if the sequence has no identifier.

-ufo "string" (""). UFO (uniform feature object) for reading features if not included in the sequence data.

-fformat "string" (""). Features format (can also be specified in the UFO).

-fopenfile "string" (""). Features input file name.

6.4.2.1.1. Specifying Input Sequence Format

The format of your input sequence can be specified by adding -sformat Format on the command line, where Format is the name of a supported sequence format (Section A.1, “Supported Sequence Formats”). For example:

seqret myfile.seq -sformat embl

The format may also be specified in the USA of the input filename. For example:

seqret embl::myfile.seq

The behaviour of -sformat and Format:: in a USA are identical. The format is not required, however. When reading in a sequence, EMBOSS will work out the sequence format by trying all known formats until one succeeds. Very few formats (only those that can be misinterpreted) are unsuitable for autodetection.

Here the input sequence is specified in the USA:

% seqret embl::L07770.embl -outseq L07770.fasta

% more L07770.fasta
>L07770 L07770.1 Xenopus laevis rhodopsin mRNA, complete cds.
ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaagaaac
acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaacggaac
.......

In the example below, a sequence is read from GCG format and written in FASTA (default) format:

% cat L07770.gcg
!!NA_SEQUENCE 1.0

Xenopus laevis rhodopsin mRNA, complete cds.

L07770  Length: 1684  Type: N  Check: 9453 ..

    1 ggtagaacag cttcagttgg gatcacaggc ttctagggat cctttgggca

   51 aaaaagaaac acagaaggca ttctttctat acaagaaagg actttataga

  101 gctgctacca tgaacggaac agaaggtcca aatttttatg tccccatgtc
  .........................

% seqret L07770.gcg
Reads and writes (returns) sequences
output sequence(s) [L07770.fasta]: L077703.fasta

% cat L07770.fasta
>L07770
ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaagaaac
acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaacggaac
.........

6.4.2.1.2. Reading from a Database

To read sequence(s) from a database, the database name and sequence id or accession number must be specified in the USA:

seqret uniprot:P10932

The database name must be already defined. The application showdb lists all known database definitions.

In the example below, entry P10932 is retrieved from the UniProt database via its accession number.

% seqret
Reads and writes (returns) sequences
Input sequence(s): uniprot:p10932
Output sequence [amir_pseae.fasta]: 

% more amir_pseae.fasta
>AMIR_PSEAE P10932 Aliphatic amidase regulator;
MSANSLLGSLRELQVLVLNPPGEVSDALVLQLIRIGCSVRQCWPPPESFDVPVDVVFTSI
FQNRHHDEIAALLAAGTPRTTLVALVEYESPAVLSQIIELECHGVITQPLDAHRVLPVLV
SARRISEEMAKLKQKTEQLQERIAGQARINQAKALLMQRHGWDEREAHQYLSREAMKRRE
PILKIAQELLGNEPSA
........

Here the same entry is retrieved by its database ID:

% seqret
Reads and writes (returns) sequences
Input sequence(s): uniprot:amir_pseae
Output sequence [amir_pseae.fasta]: p10932.fasta

%  more p10932.fasta 
>AMIR_PSEAE P10932 Aliphatic amidase regulator;
MSANSLLGSLRELQVLVLNPPGEVSDALVLQLIRIGCSVRQCWPPPESFDVPVDVVFTSI
FQNRHHDEIAALLAAGTPRTTLVALVEYESPAVLSQIIELECHGVITQPLDAHRVLPVLV
SARRISEEMAKLKQKTEQLQERIAGQARINQAKALLMQRHGWDEREAHQYLSREAMKRRE
PILKIAQELLGNEPSA
.......

6.4.2.1.3. Reading from an Input File

To read sequence(s) from an input file, use FileName where FileName is the name of the source file. For example:

seqret myfile.seq

If the file contains more than one sequence, you can specify the sequence identifier or accession number in the same way as for databases:

seqret myfile.seq:P10932

If the file has no sequence identifier, or you need to specify a new sequence identifier, use the -sid qualifier. Note that this is also used for the default output file name.

seqret myfile.seq:P10932 -sid amidase_regulator

6.4.2.1.4. Specifying the Sequence Type

EMBOSS determines whether a sequence is nucleotide or protein by the proportion of possible nucleotide characters in the sequence. It is usually correct, but might make a mistake with an unusually short and ambiguous sequence. You can force EMBOSS to accept that the sequence is nucleotide or protein using -snucleotide or -sprotein. These are boolean qualifiers and are set on if they are given on the command line. For example:

seqret myfile.seq:P10932 -sprotein

6.4.2.1.5. Specifying a Sub-sequence

To specify a subsequence, use -sbegin Start and -send End, where Start and End are the start and end respectively of the region of interest. For example:

seqret myfile.seq:P10932 -sbegin 25 -send 100

Or via the USA:

seqret myfile.seq:P10932[25:100]

6.4.2.1.6. Command line Styles

In practice, various styles of command line are supported (see Section 6.1, “Introduction to the EMBOSS Command Line”) and the qualifier names can be abbreviated so long as they remain unambiguous. The following command lines all tell seqret to read sequence P10932 in FASTA format, starting at base 25:

seqret P10932 -sf fasta -sbeg 25

seqret fasta::P10932 -sbegin=25

seqret -sbegin=25 fasta::P10932

seqret -sbegin=25 P10932 -sformat fasta

seqret -sbeg 25 P10932 -sf=fasta

seqret -sbeg 25 -sequence=P10932 -sf=fasta

seqret sbeg=25 -sequence=P10932 sf=fasta

seqret -sbeg 25 -sequence P10932 -sf fasta

seqret /SBEG=25 /SEQUENCE=P10932 /SF=fasta

This may seem rather confusing, but only because there is no enforcement of a single way for users to specify the command lines. For general use, the first style above is strongly recommended.

6.4.2.2. Sequence Output

The qualifiers below apply to the following ACD datatypes:

seqout
seqoutall
seqoutset

-osformat "string" (""). Output sequence format. The default is FASTA format.

-osextension "string" (""). File name extension. The default is to use the name of the format.

-osname "string" (""). Base file name. The default is to use the sequence identifier, in lower case.

-osdirectory "string" (""). Output directory. The default it the current directory.

-osdbname "string" (""). Database name to include in output formats that report it.

-ossingle "boolean" ("N"). Write a separate file for each entry.

-oufo "string" (""). UFO (uniform feature object) for output features.

-offormat "string" (""). Output feature format. The default is GFF3.

-ofname "string" (""). Features file name.

-ofdirectory "string" (""). Output directory for features output.

6.4.2.2.1. Specifying Output Sequence Format

The format of your output sequence can be specified with -osformat Format on the command line, where Format is the name of a supported sequence format (Section A.1, “Supported Sequence Formats”). In the following examples, -outseq is used to refer explicitly to the output parameter of seqret, which is the second parameter in the ACD file:

seqret -outseq myfile.seqout -osformat embl

The format may also be specified in the USA of the output filename:

seqret -outseq embl::myfile.seqout

The behaviour of -osformat and Format:: in a USA is identical. It is not necessary to specify the output format however because EMBOSS will use FASTA format by default.

The following example reads sequence L07770 from the EMBL database and writes it to a file in GCG format:

%  seqret embl:L07770 -outseq gcg::L07770.gcg
Reads and writes (returns) sequences

%  cat L07770.gcg
!!NA_SEQUENCE 1.0

Xenopus laevis rhodopsin mRNA, complete cds.

L07770  Length: 1684  Type: N  Check: 9453 ..

    1 ggtagaacag cttcagttgg gatcacaggc ttctagggat cctttgggca

   51 aaaaagaaac acagaaggca ttctttctat acaagaaagg actttataga

  101 gctgctacca tgaacggaac agaaggtcca aatttttatg tccccatgtc

.........................

This example reads entry L07770 from EMBL and writes it to L07770.ncbi in NCBI format.

%  seqret embl:L07770 L07770.ncbi -osformat ncbi
Reads and writes (returns) sequences

%  cat L077704.ncbi
>gnl|embl|L07770 (L07770.1) Xenopus laevis rhodopsin mRNA, complete cds.
ggtagaacagcttcagttgggatcacaggcttctagggatcctttgggcaaaaaagaaac
acagaaggcattctttctatacaagaaaggactttatagagctgctaccatgaacggaac

Command lines to write sequences in the file myfile.seq to the output file myfile.seqout in the format EMBL are:

seqret myfile.seq embl::myfile.seqout or

seqret myfile.seq myfile.seqout -osformat embl

6.4.2.2.2. Output File Naming

The output file name defaults to the sequence identifier in lower case as the base name, and the sequence output format as the extension. These can be overridden with -osname FileBaseName and -osextension FileExtension, where FileBaseName is the base name of the output file and FileExtension is the filename extension. For example:

seqret -osname myfile -osextension ".seq"

Or to do this via the USA:

seqret -outseq myfile.seq

6.4.2.2.3. Identifier and Database Name

In some output sequence formats (e.g. NCBI) the output includes the sequence identifier and a database name. NCBI format defaults to 'unk' for an unknown database, or uses the database name of the input sequence. The database name can be specified or replaced using -osdbname DatabaseName where DatabaseName is the name of the database to be used in the output file. For example:

seqret myfile.seq ncbi::myfile.seqout -osdbname embl

Similarly, the output sequence identifier can be specified or replaced with the -sid input qualifier:

seqret -outseq embl

It is also possible to specify the database name with the -sdbname input qualifier, but only if the sequence is read from a file as any input database name takes precedence.

6.4.2.2.4. Handling Multiple Sequence Output

Not all of the sequence formats can hold multiple sequences. Some formats, such as gcg, plain, raw and staden have no indication of where the sequence ends and the next sequence starts. They cannot, therefore, hold more than one sequence.

To write multiple sequences to a new file per sequence, you can use the -ossingle qualifier. Any output file name specified will be ignored and names of output files will be assigned automatically from the sequence ID name, with the format name used as the extension. So a sequence with the ID name IXI_567 being written in gcg format would be written to the file ixi_567.gcg. For example:

seqret myfile.seq gcg::myfile.seqout -ossingle

One output file would be created for each file in myfile.seq. Even though the output file (myfile.seqout) was specified, this would be ignored and files with names of EntryName.gcg would be generated.

6.4.2.2.5. Output to Screen

Sequence output can be printed to the screen rather than a file or database. This is useful if you want to see the output immediately after running the program, for example, when testing an application. It is also useful to pipe output to another application on UNIX. To do this you specify stdout in place of the file name. stdout is the standard UNIX filename for the screen. For example:

seqret -outseq stdout -osformat gcg

Or to do this via the USA:

seqret -outseq gcg::stdout

6.4.3. Sequence Features

All the common feature formats are supported for input (see Section A.2, “Supported Feature Formats”). Features may be read either as part of an entry in a sequence database or file or from a raw feature table file containing the feature table only. The sequence, seqall, seqset & seqsetall datatypes (for sequence input) can read features if the application will use the feature data.

If the feature table is included in the sequence input file (as is generally the case when you are reading the sequence from a database), then the feature table will be read with no problem, so long as the features: attribute is set for the input sequence(s) in the ACD file (see the EMBOSS Developers Guide).

Features are referenced on the command line by the UFO (Uniform Feature Object) mechanism (Section 6.7, “The Uniform Feature Object (UFO)”). To provide maximum flexibility, parts of this specification may be given on the command line as qualifiers. The qualifiers are used to specify such things as the name and format of the file and the start and end sequence position of the feature of interest.

6.4.3.1. Feature Input

The qualifiers below apply to the following ACD datatype:

features

Feature input is also covered by the sequence, seqall, seqset & seqsetall datatypes if their features ACD attribute is set in the sequence definition in the ACD file.

-fformat "string" (""). Features format.

-fopenfile "string" (""). Features file name.

-fask "boolean" ("N"). Prompt for begin/end/reverse.

-fbegin "integer" ("0"). Start of the features to be used. The default (0) is the start of the sequence. Negative numbers are counted from the end with -1 indicating the last character of the sequence.

-fend "integer" ("0"). End of the features to be used. The default (0) is the end of the sequence. Negative numbers are counted from the end with -1 indicating the last character of the sequence.

-freverse "boolean" ("N"). Reverse complement all feature locations (if nucleotide).

6.4.3.1.1. Specifying Input Feature Format and File Name

The format of your input features can be specified by adding -fformat Format on the command line, where Format is the name of a supported feature format (Section A.2, “Supported Feature Formats”). For example:

extractfeat myfile.feat -fformat embl

The name of a raw features file can be set directly by adding -fopenfile FileName where FileName is the name of the raw features file. For example:

extractfeat -fopenfile myfile.feat -fformat embl

The file name and format may also be specified in the UFO of the input filename. For example:

extractfeat embl::myfile.feat

The behaviour of -fformat and Format:: in a UFO is identical. The format is not required, however. When reading in features, EMBOSS will guess the sequence feature format by trying all known formats until one succeeds.

6.4.3.1.2. Reading Features from a Sequence File

In this example, showfeat is used to extract features from the EMBL entry U23808:

%  showfeat embl:U23808
Show features of a sequence.
Output file [U23808.showfeat]: 

%  cat U23808.showfeat 
U23808
Xenopus laevis rhodopsin gene, complete cds.
|==========================================================| 4734
|----------------------------------------------------------> source
             |----->                                         mRNA
               |--->                                         CDS
                       |->                                   CDS
                       |->                                   mRNA
                                |->                          CDS
                                |->                          mRNA
                                      |-->                   CDS
                                      |-->                   mRNA
                                                  |>         CDS
                                                  |------->  mRNA

The -feature option of seqret is used extract the sequence with its features from the EMBL entry U23808:

%  seqret -feature embl:U23808
Reads and writes (returns) sequences
Output sequence [U23808.fasta]: 

%  cat U23808.gff
##gff-version 3
#!sequence-region U23808 1 8914
#!date 2008-03-10
#!Type DNA
#!Source-version EMBOSS 6.1.0
U23808 EMBL databank_entry 1    8194 0.000 + . ID="U23808.1";
          organism="Xenopus laevis";mol_type="genomic DNA";db_xref="taxon:8355"
U23808 EMBL promoter 5128 5360 0.000 + . ID="U23808.2"
U23808 EMBL sequence_feature 5128 5158 0.000 + . ID="U23808.3";
          note="XOP4 cis element"
U23808 EMBL sequence_feature 5191 5215 0.000 + . ID="U23808.4";
          note="XOP3 cis element"
U23808 EMBL sequence_feature 5225 5239 0.000 + . ID="U23808.5";
          note="Ret1 cis element"
U23808 EMBL sequence_feature 5254 5270 0.000 + . ID="U23808.6";
          note="Bat1 cis element"
U23808 EMBL sequence_feature 5277 5302 0.000 + . ID="U23808.7";
          note="NRE cis element"
U23808 EMBL sequence_feature 5309 5318 0.000 + . ID="U23808.8";
          note="Ret4 cis element"
U23808 EMBL mRNA 5361 5830 0.000 + . ID="U23808.9";featflags="0x100";
          product="rhodopsin"
U23808 EMBL mRNA 6079 6247 0.000 + . ID="U23808.9";featflags="0x104"
U23808 EMBL mRNA 6849 7014 0.000 + . ID="U23808.9";featflags="0x104"
U23808 EMBL mRNA 7265 7504 0.000 + . ID="U23808.9";featflags="0x104"
U23808 EMBL mRNA 8210 8867 0.000 + . ID="U23808.9";featflags="0x104"
U23808 EMBL CDS 5470 5830 0.000 + 0 ID="U23808.10";featflags="0x100";
         codon_start=1;product="rhodopsin";
         note="cDNA sequence deposited under GenBank Accession Number L07770";
         db_xref="GOA:P29403";db_xref="HSSP:1F88";db_xref="InterPro:IPR017452";
         db_xref="UniProtKB/Swiss-Prot:P29403";protein_id="AAC59901.1";
         translation="MNGTEGPN...SQVSPA"
U23808 EMBL CDS 6079 6247 0.000 + 0 ID="U23808.10";featflags="0x104"
U23808 EMBL CDS 6849 7014 0.000 + 0 ID="U23808.10";featflags="0x104"
U23808 EMBL CDS 7265 7504 0.000 + 0 ID="U23808.10";featflags="0x104"
U23808 EMBL CDS 8210 8338 0.000 + 0 ID="U23808.10";featflags="0x104"

6.4.3.1.3. Reading a Raw Feature Table

To read a raw feature table from file, you must either name it explicitly using -fopenfile as shown above, or specify the -ufo on the command line; this is specific to the sequence datatypes (see Section 6.4.2, “Sequences”). For example:

extractfeat myfile.seq -ufo myfile.feat -fformat embl

This behaviour cannot be set in the sequence USA directly: -ufo must be used for separate feature input with any sequence. For example:

extractfeat embl::myfile.feat -ufo myfile.feat

In the example below, seqret is used to read features (-feature) from the file U23808.gff which is specified as containing a feature table (-ufo) with automatically detected format, and to write the features out in EMBL format to the file U23808.embl:

%  seqret -feature myfile.seq -ufo U23808.gff embl::U23808.embl
Reads and writes (returns) sequences

%  cat U23808.embl
ID   U23808    standard; DNA; UNC; 4734 BP.
AC   U23808;
SV   U23808.1
DE   Xenopus laevis rhodopsin gene, complete cds.
FH   Key             Location/Qualifiers
FH
FT   source          1..4734
FT                   /db_xref="taxon:8355"
FT                   /mol_type="genomic DNA"
FT                   /organism="Xenopus laevis"
FT   mRNA            join(1181..1650,1899..2067,2669..2834,3085..3324,
FT                   4030..4687)
FT                   /product="rhodopsin"
FT   CDS             join(1290..1650,1899..2067,2669..2834,3085..3324,
FT                   4030..4158)
FT                   /codon_start=1
FT                   /db_xref="GOA:P29403"
FT                   /db_xref="HSSP:1F88"
FT                   /db_xref="UniProt/Swiss-Prot:P29403"
FT                   /note="cDNA sequence deposited under GenBank Accession
FT                   Number L07770"
FT                   /product="rhodopsin"
FT                   /protein_id="AAC59901.1"
FT                   /translation="MNGTEGPNFYVPMSNKTGVVRSPFDYPQYYLAEPWQYSALAAYMF
FT                   LLILLGLPINFMTLFVTIQHKKLRTPLNYILLNLVFANHFMVLCGFTVTMYTSMHGYFI
FT                   FGQTGCYIEGFFATLGGEVALWSLVVLAVERYMVVCKPMANFRFGENHAIMGVAFTWIM
FT                   ALSCAAPPLFGWSRYIPEGMQCSCGVDYYTLKPEVNNESFVIYMFIVHFTIPLIVIFFC
FT                   YGRLLCTVKEAAAQQQESATTQKAEKEVTRMVVIMVVFFLICWVPYAYVAFYIFTHQGS
FT                   NFGPVFMTVPAFFAKSSAIYNPVIYIVLNKQFRNCLITTLCCGKNPFGDEDGSSAATSK
FT                   TEASSVSSSQVSPA"
SQ   Sequence 4734 BP; 1315 A; 1046 C; 985 G; 1388 T; 0 other;
     cgtaactagg accccaggtc gacacgacac cttccctttc ccagttattt cccctgtaga        60
     cgttagaagg ggaaggggtg tacttatgtc acgacgaact acgtccttga ctacttaggg       120

6.4.3.1.4. Specifying a Region of Features

On input, a region of features can be specified by using -fbegin and -fend. You can force a program to prompt for these values, and also the sense in a nucleotide sequence, by specifying -fask on the command line.

6.4.3.2. Feature Output

The qualifiers below apply to the following ACD datatype:

featout

-offormat "string" (""). Output feature format. The default is GFF3.

-ofopenfile "string" (""). Features output file name.

-ofextension "string" (""). Features output file name extension.

-ofdirectory "string" (""). Features output directory.

-ofname "string" (""). Features output base file name. defaults to the sequence identifier in lower case.

-ofsingle "boolean" ("N"). Separate file for each entry.

6.4.3.2.1. Specifying Output Feature Format and File Name

Note

The seqout, seqoutall and seqoutset datatypes (for sequence output) can write features if their features ACD attribute is set.

All the common feature formats (Section A.2, “Supported Feature Formats”) are supported for output. If a program is capable of writing out sequences with features (for example seqret -feature), then the feature table will be written out as part of the output sequence file, if the format of the sequence file was designed to hold a feature table, i.e. is one of:

embl

gff

swissprot

pir

seqret -feat uniprot:p10932 myfile.seqfeat -osformat swissprot

If the specified output sequence format cannot hold a feature table (e.g. FASTA format), then a file with the extension .gff is written with the feature table in GFF format.

The -ofname and -offormat options enable you to specify a name and format for separate output to a feature table file, even for formats that are capable of holding the feature table with its sequence.

The format of your output features can be specified by adding -offormat Format on the command line, where Format is the name of a supported output feature format (Section A.2, “Supported Feature Formats”). In the following examples, the output parameter is named explicitly (-outseq) because seqret has two parameters but only the second parameter in the ACD file is specified on the command line:

seqret -feat uniprot:p10932 -outseq myfile.featout -offormat swissprot

The file name can be set directly with -ofname FileName where FileName is the name of the features file. For example:

seqret -feat -ofopenfile -ofname myfile.featout -offormat swissprot

The file name and format may also be specified together in a UFO string. For example:

seqret -feat uniprot:p10932 myfile.seqout -oufo swissprot::myfile.featout

The output directory (-ofdirectory) and file name -ofname may be set for the feature output files. If these are not set then the output files will be assigned names automatically from the sequence ID name, with the format name used as the extension, and written to the current directory. So a sequence with the ID name IXI_567 being written in gcg format would be written to the file ixi_567.gcg with features in the file ixi_567.gff.

The behaviour of -offormat and Format:: in a UFO is identical. The output format is not required, however. When writing features, EMBOSS will use GFF by default.

6.4.4. Sequence Alignments

The reading of sequence alignment files is a special case of general sequence input. The seqset and seqsetall datatypes are used (there is no dedicated ACD datatype), therefore all of the command line qualifiers available for sequence input (see above) also apply to input alignments. All the common alignment formats are supported, those being a subset of the supported sequence formats (Section A.1, “Supported Sequence Formats”).

There is a dedicated ACD datatype (align) for alignment output for which specific qualifiers are defined. All the common alignment formats (Section A.3, “Supported Alignment Formats”) are supported for output, some of which are suitable for multiple sequences and some for pairwise alignments (i.e. of two sequences) only.

The qualifiers below apply to the following ACD datatypes:

seqset
seqsetall

Thus, the command line qualifiers for sequence input are all available (see above). The sequence type defined in the ACD file should include gaps, and the sequence set should have the aligned: attribute set to "Y".

Typically, a program that writes an alignment will define a default output format in its ACD file. This is usually "simple" format for multiple alignments and "pair" for pairwise alignments. You are not restricted to these formats though. The qualifiers may be used to specify the name and format of the alignment output file and other output options, such as alignment width and what bibliographic information is printed to the alignment header.

The options below specify alignment output and apply to the ACD datatype:

align

-aformat "string" (""). Alignment format.

-aextension "string" (""). File name extension.

-adirectory "string" (""). Output directory.

-aname "string" (""). Base file name.

-awidth "integer" ("0"). Alignment width.

-aaccshow "boolean" ("N"). Show accession number in the header.

-adesshow "boolean" ("N"). Show description in the header.

-ausashow "boolean" ("N"). Show the full USA in the alignment.

-aglobal "boolean" ("N"). Show the full sequence in alignment.

6.4.4.1. Specifying the Alignment File Format and Name

Alignments have their own specific formats, but can also use many of the most common multiple sequence formats. To specify the required format use -aformat Format on the command line, where Format is the name of a supported alignment format (Section A.3, “Supported Alignment Formats”). For example:

water Seq1.seq Seq2.seq SeqOut.seq -aformat msf

Here the water application is called on two input sequences (Seq1.seq and Seq2.seq) to generate an alignment output file (SeqOut.seq) in MSF format.

As an alternative to naming the sequence alignment file directly, an output directory (-adirectory), base file name -aname and file name extension -aextension may be set. For example:

water Seq1.seq Seq2.seq -aextension "msf" -aname "SeqOut" -adirectory "alignments" -aformat msf

6.4.5. General Input

Qualifiers are used to specify the format of the input file.

The qualifiers below apply to the following ACD datatypes:

scop
codon
cpdb

-format "string" (""). Data format specific to the input data type.

6.4.6. Patterns

Qualifiers are used to specify the name and format of the file containing the patterns.

6.4.6.1. Regular Expressions

Regular expressions are covered by the following ACD datatype:

regexp

-pformat "string" (""). Pattern file format where the regular expression(s) are specified as a file of patterns with the syntax @FileName.

-pname "string" (""). Pattern base name.

6.4.6.2. Sequence Patterns

Sequence patterns are covered by the following ACD datatype:

pattern

-pformat "string" (""). Pattern file format where the pattern(s) are specified as a file of patterns with the syntax @FileName.

-pname "string" (""). Pattern base name.

-pmismatch "integer" ("0"). Pattern mismatches allowed.

6.4.7. General Output

Qualifiers are used to specify the format (excluding outfile and outfileall) and output directory for the output files.

6.4.7.1. General Output

The qualifiers below apply to the following ACD datatypes:

outcodon
outcpdb
outdata
outdiscrete
outdistance
outfreq
outmatrix
outmatrixf
outproperties
outscop
outtree

-odirectory "string" (""). Output directory.

-oformat "string" (""). Output format specific to this data type.

6.4.7.2. Output Files

General output also includes the following ACD datatypes:

outfile
outfileall

-odirectory "string" (""). Output directory.

6.4.8. Application Report Output

Qualifiers are used to specify the name and format of the report output file and which fields are written.

Each program that writes a report will have a default report format defined for that program. This format is usually table but other more appropriate formats can be chosen as the default.

The qualifiers below apply to the following ACD datatype:

report

-rformat "string" (""). Report format.

-rname "string" (""). Base file name.

-rextension "string" (""). File name extension.

-rdirectory "string" (""). Output directory.

-raccshow "boolean" ("N"). Show accession number in the report.

-rdesshow "boolean" ("N"). Show sequence description in the report.

-rscoreshow "boolean" ("N"). Show the score in the report.

-rusashow "boolean" ("N"). Show the full USA in the report.

-rmaxall "integer" ("0"). Maximum total hits to report.

-rmaxseq "integer" ("0"). Maximum hits to report for one sequence.

6.4.8.1. Usage Examples

You are not restricted to the default format. You specify the required format by adding -rformat Format on the command line, where Format is the name of a supported report format (Section A.4, “Supported Report Formats”). Many output feature formats are also valid as report formats. For example:

garnier -rformat gff

The output file can be named directly on the command line, for example:

garnier seq.in report.out -rformat gff

Alternatively, use -rname FileBaseName, -rextension FileExtension and -rdirectory Directory to specify the base file name, file extension and a directory for the report. For example:

garnier seq.in -rname report -rextension ".out" -rdirectory reports -rformat gff

Prev	Up	Next
6.3. Global Command Line Qualifiers	Home	6.5. Graphical Output