6.2. Specifying Values for Application Options

6.2.1. General Rules

When specifying values on the command line, the following rules apply:

  • Flags (qualifier or parameter names) can be shortened as long as they remain unambiguous.

  • Flags can appear in any order, although care must be taken with options of the same datatype (see Section 6.1.4.1, “Multiple Qualifiers”).

  • Datatype-specific qualifiers (specific for a certain datatype instance) should immediately follow an option with that datatype. In this position, these flags apply only to that option and not to all options with that datatype.

  • Flags must start with either the hyphen - (UNIX style) or the forward slash / (OpenVMS style), unless there is an = sign between the qualifier/parameter name and the value (SeqPup command style).

  • The values are spaced from the qualifier/parameter name by either a space (UNIX style) or an = sign (OpenVMS or SeqPup style).

  • If the equal sign (=) is used to assign a value to a qualifier, the prefix hyphen (-) or forward slash /) can be omitted (SeqPup style).

  • Boolean (Yes/No, True/False) options have no attached value and are set True by giving the qualifier/parameter name, and set to False by adding the prefix no to the name.

  • Values given after flags are not usually case sensitive. An obvious exception is filenames, which must match in normal UNIX style (on normal UNIX systems).

The value that must be given depends upon the ACD datatype of the option in question (see below). For convenience, the available ACD datatypes (and hence options) are organised into five groupings, reflecting similar properties or modes of usage:

  • Simple Datatypes

  • Input Datatypes

  • Selection Datatypes

  • Output Datatypes

  • Graphics Datatypes

6.2.2. Simple ACD Datatypes

The simple datatypes include primitive types such as string and integer, and more complex datatypes such as ranges.

6.2.2.1. Primitive Datatypes

Primitive ACD datatypes include:

boolean

Simple boolean value

float

Simple floating point number

integer

Simple integer number

string

Simple string

toggle

Simple boolean switch for controlling other parameters

6.2.2.1.1. boolean

The data value is "true" or "false" and is specified as follows:

"Y"
"yes"
"true"
"N"
"no"
"false"

The value will be "Y" when the parameter name is entered on the command line as a flag, for example:

-ToggleOption

If the qualifier is absent from the command line the default value is used. The flag can also be prefixed by no, for example:

-noToggleOption

to force the value to be "N". This is needed if the default value is "Y".

6.2.2.1.2. float

The data value is any valid floating point number. For example:

"100.24"
6.2.2.1.3. integer

The data value is any integer value. For example:

"100"
6.2.2.1.4. string

The data value is any valid ASCII text string which should be enclosed in quotes. For example:

"This is a valid text string"
6.2.2.1.5. toggle

The data value is "true" or "false" and is specified as follows:

"Y"
"yes"
"true"
"N"
"no"
"false"

Toggle parameters work in exactly the same way as boolean parameters (see above) but are used to control prompting for other parameters (turn prompting on or off). See the EMBOSS Developers Guide for further information.

6.2.2.2. Other Simple Datatypes

Other simple datatypes include:

array

List of either integer or floating point numbers

range

Range of sequence positions

regexp

Regular expression pattern

pattern

A sequence pattern

6.2.2.2.1. array

The data value is a list of numbers separated by spaces or commas. For example:

"1 2 3 4 5"
"1.5, 2.0, 2.5, 3.0"
6.2.2.2.2. range

One or more ranges may be defined on the command line or in a range file.

On the command line, a range is defined by a pair of integer numbers and multiple ranges may be given. The numbers may be delimited by any non-digit, non-alphabetic character. For example:

"24-45, 56-78"
"1:45, 67=99;765..888"
"1,5,8,10,23,45,57,99"

A range file contains a list of pairs of numbers with optional text comments. One pair of numbers must be given per line and the file can contain comment lines which are preceded with a '#' character. For example:

# A set of ranges in a range file.
 12      23      
  4      5       This is an optional comment.
 67      10348   Another comment.

Range files are specified on the command line by preceding the filename with @. For example, for the range file RangeFileName:

@RangeFileName

In cases where the numbers are sequence positions, the upper and lower bounds will in practice depend on the length of the sequence to which they are applied. You should bear in mind that sequence positions can be negative, in which case they count back from the end of the sequence.

6.2.2.2.3. regexp

EMBOSS uses the "Perl-Compatible Regular Expression Library" (PCRE) release 4.3 to process regular expressions, so any regular expression that is valid in Perl 5.0 (http://search.cpan.org/~nwclark/perl-5.8.7/pod/perlre.pod) should be valid here.

6.2.3. Input ACD Datatypes

The input datatypes cater for input of sequences, sequence features, files and directories, inputs specific to EMBASSY packages (e.g. phylipnew), data files and other files of biological data.

6.2.3.1. Sequence Input

Input datatypes for handling biological sequences include:

sequence

A single sequence for reading

seqall

A set of single sequences that are addressed one after another

seqset

A set of single sequences that can be used all at the same time

seqsetall

One or more sets of single sequences that can be used all at the same time

The data value in all cases is the Uniform Sequence Address or USA (Section 6.6, “The Uniform Sequence Address (USA)”) of one or more sequences. The USA might specify a literal sequence, database reference, file or some other sequence reference.

6.2.3.1.1. sequence

The data value is the USA of a single sequence.

6.2.3.1.2. seqall

The data value is the USA of a set of sequences to be read one at a time. For example, the USA might specify a sequence database for sequential reading of entries.

6.2.3.1.3. seqset

The data value is the USA of a set of single sequences. For example, a set of sequences from a multiple alignment file, or sequences from a database.

6.2.3.1.4. seqsetall

The data value is the USA of one or more sets of single sequences. For example, sets of sequences from two databases or two alignment files. The data value would typically be a listfile: a file containing a list of USAs (see Section 6.6, “The Uniform Sequence Address (USA)”).

6.2.3.2. Feature Input

There is a single datatype for handling biological sequence features input:

features

Sequence feature annotation in any known feature format

6.2.3.2.1. features

The data value is the name of a features file. A features file contains sequence feature information. Several feature formats are supported (Section A.2, “Supported Feature Formats”).

6.2.3.3. Files and Directories

Input datatypes for handling general files and directories include:

directory

A directory that can be used for input or output

dirlist

A list of file names that are read from a directory

filelist

A list of input files

infile

Non-sequence-related input file

6.2.3.3.1. directory

The data value is the name of any valid directory. For example:

"."
"/data"
"/data/sequences"
6.2.3.3.2. dirlist

The data value is the name of any valid directory. For example:

"."
"/data"
"/data/sequences"
6.2.3.3.3. filelist

The data value is a list of file names separated by commas. For example:

"../data/file1.dat, file2.dat"
6.2.3.3.4. infile

The data value is the name of an input file. For example:

"data.in"
"/data/infile.1" 

infile is used for files of data not catered for by some other ACD datatype. For example, an infile would not normally contain sequence data.

6.2.3.4. Data Files

Input datatypes for handling data files include:

datafile

A formatted data file read from the standard EMBOSS data search path

matrix

Comparison matrix file (integer values)

matrixf

Comparison matrix file (floating point values)

In all cases, the data value is the name of a file in the EMBOSS data search path (Section 2.8, “Maintenance”).

Typically where a comparison matrix is specified, gap penalties will also be required. These must be specified separately in one or more other data definitions (see the EMBOSS Developers Guide). The matrix files distributed with BLAST are also distributed with EMBOSS in the EMBOSS data directory.

6.2.3.4.1. datafile

The data value is the name of a data file. Many data files already have their own ACD datatype, for example, matrix, matrixf and codon. Other data files do not have or need their own ACD definition and datafile is used for these.

6.2.3.4.2. matrix

The data value is the name of an integer comparison matrix file. Applications using integer matrices are usually faster than those using floating point matrices.

6.2.3.4.3. matrixf

The data value is the name of a floating point comparison matrix file in the EMBOSS data search path (Section 2.8, “Maintenance”).

The matrixf datatype defines floating point matrices, which usually involve slower calculation times than integer matrices. An integer matrix file can of course also be read as floating point.

6.2.3.5. Datatypes for phylipnew EMBASSY Package

Input datatypes specific to the phylipnew EMBASSY package are given below. These provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.

discretestates

Discrete states file

distances

Distance matrix

frequencies

Frequency value(s)

properties

Property value(s)

tree

Phylogenetic tree

6.2.3.5.1. discretestates

The data value is the name of a discrete states file and is used by the phylip "discrete character" applications.

6.2.3.5.2. distances

The data value is the name of a distances file as used by the phylip "distance matrix" applications.

6.2.3.5.3. frequencies

The data value is the name of a frequencies file as used by the phylip "gene frequency and continuous character" applications.

6.2.3.5.4. properties

The data value is the name of a properties file as used by the phylip applications to define weights, ancestral states and factors (multi-state characters).

6.2.3.5.5. tree

The data value is the name of a tree file and is used as input to the phylip applications to define one or more phylogenetic trees.

6.2.3.6. Other Biological Inputs

Other biological input datatypes include:

codon

Codon usage table file

cpdb

Protein coordinate data in a simple file format (clean coordinate file)

scop

SCOP and CATH domain classification information in a simple file format (domain classification file)

6.2.3.6.1. codon

The data value is the name of a codon usage table file in the EMBOSS data search path (Section 2.8, “Maintenance”).

Codon usage files are distributed in the EMBOSS data directory. They are ASCII text files and can be read in several formats.

6.2.3.6.2. cpdb

The data value is the name of a CCF file.

CCF (clean coordinate file) format is a simple "clean" file format for protein and domain coordinate data. See the documentation for pdbparse, part of the EMBASSY domainatrix package, which generates CCF files from PDB file input.

6.2.3.6.3. scop

The data value is the name of a DCF file.

DCF (domain classification file) format is a simple "clean" file format for domain classification data. See the documentation for domainer, part of the EMBASSY domainatrix package, which generates DCF files from SCOP and CATH file input.

6.2.3.6.4. pattern

The standard IUPAC one-letter codes for the amino acids and nucleotides are used. The symbol x is used for a position where any amino acid is accepted. The symbol n is used for a position where any nucleotide is accepted.

Ambiguities are indicated by listing the acceptable amino acids or bases for a given position, between square parentheses [ ]. For example:

[ALT]

stands, in the case of proteins, for Ala or Leu or Thr.

Ambiguities are also indicated by listing between a pair of curly brackets { } the amino acids or bases that are not accepted at a given position. For example:

{AM}

stands, in the case of proteins, for any amino acid except Ala and Met.

Each element in a pattern is separated from its neighbor by a '-' (dash). Repetition of an element of the pattern can be indicated by following that element with a numerical value or a numerical range between parenthesis. For example:

x(3) corresponds to x-x-x
x(2,4) corresponds to x-x or x-x-x or x-x-x-x

When a pattern is restricted to either the N- or C-terminal (5' or 3') of a sequence, that pattern either starts with a '<' (reverse chevron) symbol or respectively ends with a '>' (forward chevron) symbol. A period ends the pattern (in most cases optionally). For example:

[DE](2)HS{P}X(2)PX(2,4)C.

6.2.4. Output ACD Datatypes

The output datatypes cater for output of sequences, sequence features, alignments, files and directories, outputs specific to EMBASSY packages (e.g. phylipnew), data files, other files of biological data and formatted application output files (reports).

6.2.4.1. Sequence Output

Output datatypes for handling biological sequences include:

seqout

Output file for single sequence

seqoutall

Output file for multiple sequences

seqoutset

A set of single sequences stored in memory together, to be written to a file

The behaviour of these datatypes is identical but they provided for consistency with the input sequence datatypes (see above).

The data value in all cases is the USA (Section 6.6, “The Uniform Sequence Address (USA)”) of an output sequence stream. FASTA format is used by default for the output sequence(s). The format is normally set at the command line but a default may be hard-coded with osformat: in an ACD file.

6.2.4.1.1. seqout

The data value is a USA for a single output sequence, for example, the name of a file.

6.2.4.1.2. seqoutall

The data value is a USA for multiple output sequences, for example, the name of a file.

6.2.4.1.3. seqoutset

The data value is a USA for multiple output sequences stored as a set in memory together, to be written to file.

6.2.4.2. Features

There is a single output datatype for handling biological sequence features:

featout

Output file for sequence feature annotation

6.2.4.2.1. featout

The data value is any valid file name. The data is stored as a feature table. Most common sequence feature formats are supported (Section A.2, “Supported Feature Formats”).

GFF format is used by default for the output feature(s). The format is normally set at the command line but a default may be hard-coded in the ACD file using the offormat: attribute.

6.2.4.3. Alignments

There is a single output datatype for handling alignments:

align

Output file for sequence alignments

6.2.4.3.1. align

An alignment output file is defined in the same way as a plain output file (outfile datatype) but has extra qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) to allow a choice of alignment formats and attributes. These can specify whether the alignment will have 2 or more sequences (which limits the possible formats).

The data value is any valid file name. The data is stored as sequences and all of the common alignment formats are supported (Section A.3, “Supported Alignment Formats”).

6.2.4.4. Output Files and Directories

Output datatypes for handling general files and directories of files include:

outdir

Output directory for the writing of multiple output files

outfile

General output file

outfileall

Multiple general output files

outfile and outfileall are used for data not catered for by some other output ACD datatype. For example, the output file would not normally contain sequence data. They are suitable for general application output in plain text.

6.2.4.4.1. outdir

The data value is the name of any valid directory. For example:

"."
"/data"
"/data/sequences"
6.2.4.4.2. outfile

The data value is the name of an output file.

6.2.4.4.3. outfileall

The data value is the base file name for multiple output files.

6.2.4.5. Output Data Files

Output datatypes for handling data files include:

outdata

Output file for data formatted cleanly as a table or list

outmatrix

Output file for integer comparison matrix data

outmatrixf

Output file for floating point comparison matrix data

In all cases the data value is any valid file name.

6.2.4.5.1. outdata

The output corresponding to multiple outdata definitions in an ACD file are appended to a single file. The individual ACD definitions allow the format of each file section to be defined.

6.2.4.5.2. outmatrix

The data value is the name of an integer substitution matrix in the EMBOSS data search path (Section 2.8, “Maintenance”).

6.2.4.5.3. outmatrixf

The data value is the name of a floating point substitution matrix in the EMBOSS data search path (Section 2.8, “Maintenance”).

6.2.4.6. Datatypes for phylipnew EMBASSY Package

Output datatypes specific to the phylipnew EMBASSY package are given below. By defining specific ACD datatypes for phylipnew EMBOSS can provide detailed type checking, and can automatically detect and validate the various alternative formats that phylip supports without the need for complex extra command line options.

outdiscrete

Output file for phylogenetics discrete characteristics data

outdistance

Output file for phylogenetics distance matrix data

outfreq

Output file for phylogenetics character frequency data

outproperties

Output file for phylogenetics property data

outtree

Output file for phylogenetic tree data

In all cases, the data value is any valid file name.

6.2.4.6.1. outdiscrete

The data value is a name for the discrete states output file.

6.2.4.6.2. outdistance

The data value is a name for the distances output file.

6.2.4.6.3. outfreq

The data value is a name for the frequencies output file.

6.2.4.6.4. outproperties

The data value is a name for the properties output file.

6.2.4.6.5. outtree

The data value is a name for the tree output file.

6.2.4.7. Other Biological Outputs

Other biological output datatypes include:

outcodon

Output file for codon usage data

outcpdb

Output file for protein coordinate data in CCF (clean coordinate file) format

outscop

Output file for SCOP and CATH domain classification information in DCF (domain classification file) format

The data value is any valid file name.

6.2.4.7.1. outcodon

The data value is a name for the codon usage output file.

The data is stored as a codon usage table. Codon usage table files are ASCII text files and can be written in several formats.

6.2.4.7.2. outcpdb

The data value is a name for the CCF output file.

CCF (clean coordinate file) format is a simple "clean" file format for protein and domain coordinate data. See the documentation for pdbparse, part of the EMBASSY domainatrix package, which generates CCF files from PDB file input.

6.2.4.7.3. outscop

The data value is a name for the DCF output file.

DCF (domain classification file) format is a simple "clean" file format for domain classification data. See the documentation for domainer, part of the EMBASSY domainatrix package, which generates DCF files from SCOP and CATH file input.

6.2.4.8. Report Output

The datatype for handling formatted application output is:

report

Output file for sequence annotation

6.2.4.8.1. report

The data value is any valid file name.

Report data is stored internally as a feature table, so the available formats (Section A.4, “Supported Report Formats”) include the most common feature formats.

A report file is defined in the same way as a plain output file (Outfile) but has extra qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) to allow a choice of report formats.

6.2.5. Selection ACD Datatypes

Two datatypes cater for menus. In either case, you'll be presented with a limited list of options, each with a label and descriptive text, to choose from.

list

A list of options (typically terse text descriptions) with text labels

selection

A list of options (typically verbose text descriptions) with automatically-generated numerical labels

The data value is one (or more) of the valid options. An option is specified by label (whether text or numerical) or by a non-ambiguous part of the descriptive text itself given after the label. If multiple selections are allowed, you must supply a comma-separated list of options.

6.2.5.1. list

Here is the prompt for a list datatype:

Translation frames

   1     1
   2     2
   3     3
   F     Forward three frames
  -1    -1
  -2    -2
  -3    -3
   R     Reverse three frames
   6     All three frames

Frame(s) to translate[1]:

Assuming a single selection only is allowed, these are all valid selections:

"1"
"F"
"Forward"
"For"
"R"
"Reverse"
"Rev"

6.2.5.2. selection

Here is prompt for a selection datatype:

Directories to ignore
1        None
2        AAINDEX
3        CVS
4        CODONS
5        PRINTS
6        PROSITE
7        REBASE

Select directories{3, 5, 6]:

Assuming multiple selections are allowed then here are some valid selections:

"3,5,6"
"3"
"CVS"
"5"
"PRINTS"
"PRI"

6.2.6. Graphics ACD Datatypes

The graphics datatypes cater for graphical output:

graph

Graphical output of any general kind

xygraph

Graphical output as a simple two dimensional (2D) XY plot with the sequence along the x-axis

The data value is the graphics device, as limited by the PLPLOT graphics library currently used by EMBOSS. The currently supported devices include ps for Postscript, png for PNG files, and X11 for X-Windows. A value of ? in answer to the prompt will list the available graphics devices on your installation:

"ps"
"png"
"X11"
"gif"
"ps"
"cps"
"?"

6.2.6.1. graph

The data value is the graphics device for a general graph. dotplots may be generated with the graph datatype.

6.2.6.2. xygraph

The data value is the graphics device for a 2D graph.