EMBOSS supports most of the common sequence alignment formats for input and output (see the EMBOSS Users Guide for a complete list). Alignments to be read or written are defined in the application ACD file although it is possible to create alignment objects directly if this is required.
Most of the alignment formats can include a standard header (given at the start of the alignment file) and in some cases a tail (given at the end) which gives information including the program, date, output filename, ID names of the sequences and some of the parameters and statistics of the alignment. There is also a subheader and subtail used for additional comments and annotation.
The tail section is used by some applications (e.g. merger in EMBOSS) to report special features of the alignment.
All alignments have certain basic properties:
Type of sequences (protein or nucleotide)
Number of sequences in the alignment
Minimum permissible number of sequences
Maximum permissible number of sequences
In addition there is often associated data including:
An integer or floating point matrix
Name of matrix
Gap insertion penalty
Gap extension penalty
For global alignments the full sequence or just the matching regions can be displayed. Optionally the alignment can include the accession number, sequence description and full USA of the aligned sequences.
The options above are usually set in the application ACD file (via attributes of the data definition) or on the command line (via qualifiers that are specific to alignments). For a description of these attributes and qualifiers see Section A.5, “Datatype-specific Attributes”
Functions are provided to set these directly in case this is required.
Functions for manipulating alignments are organised into four groups:
Retrieving elements of an alignment object
Setting elements of an alignment object
AJAX library files for handling alignments are listed in the table (Table 6.17, “AJAX Library Files for Handling Alignments”). Library file documentation, including a complete description of datatypes and functions, is available at:
|Library File Documentation||Description|
|ajalign||Functions for handling sequence alignments|
|ajseq||General sequence handling|
ajalign.h/c. Defines the main alignment object (
AjPAlign). It can be used for retrieving an input sequence alignment via ACD file processing. The header file contains most of the functions you will ever need for general handling of sequence alignments. It includes static datatypes and functions for handling alignments at a low level. You are unlikely to need the latter unless you plan to implement code to support new alignment formats. For advice on how to do this ask the EMBOSS developers.
ajseq.h/c. Defines the
AjPSeqset object. This is a set of sequences for general use and is used for handling input alignments from ACD files.
ajseq.h/c contain extensive functions for handling sequence sets and thereby for general manipulations of sequence alignments.
Alignment input is handled as a special case of general sequence set input. The
seqset ACD datatype is used:
Read multiple sequences as a single set.
The ACD data definition must include the
aligned: attribute (see below).
There is a dedicated datatype for output alignments:
A typical ACD definition for an input alignment:
# multiple sequence input read as a single aligned set seqset: sequence [ parameter: "Y" type: "protein" aligned: "Y" ]
A typical ACD definition for an output alignment:
align: outfile [ parameter: "Y" aformat: "srspair" type: "protein" minseqs: "2" maxseqs: "2" aglobal: "Y" ]
All data definitions for alignment input and output should have standard parameter names. These include:
sequence for any aligned input sequences
outfile for alignment output
Alternatives and variations (e.g.
bfile for multiple outputs are allowed)
For more information see Appendix A, ACD Syntax Reference.
Attributes that are typically specified are summarised below. They are datatype-specific (Section A.5, “Datatype-specific Attributes”) unless they are indicated as being global attributes (Section A.4, “Global Attributes”).
parameter: Alignments are typically the primary input or output of an EMBOSS application and, as such, should be defined as parameters by using the global attribute
type: Specifies the type of the sequences in the input or output alignment and is used for validation purposes. See:
seqsetall datatype must have the
aligned: attribute set to indicate whether the sequences are aligned or not.
aformat: The output format is normally set at the command line but a default may be hard-coded with
aformat:. All common alignment formats are supported (see the EMBOSS Users Guide).
minseqs: Specifies the minimum number of expected sequences and is used for validation of output.
maxseqs: Specifies the maximum number of expected sequences and is used for validation of output.
aglobal: A boolean attribute which is set to
"Y" if the output can contain more than one alignment from the same input.
For handling alignments, including input alignments defined in the ACD file, use:
For handling output alignments defined in the ACD file use:
Datatypes and functions for handling alignments via the ACD file are shown below (Table 6.18, “Datatypes and Functions for Alignment Input and Output”).
|To read an alignment||To write an alignment|
|To retrieve from ACD|
Your application code will call
embInit to process the ACD file and command line (see Section 6.3, “Handling ACD Files”). All values from the ACD file are read into memory and files are opened as necessary. You have a handle on the files and memory through the
ajAcdGet* family of functions which return pointers to appropriate objects.
To retrieve an input alignment an object pointer is declared and then initialised using
AjPSeqset seqset=NULL; seqset = ajAcdGetSeqset("sequence");
To retrieve an output alignment stream an object pointer is declared and initialised using
AjPAlign outfile=NULL; outfile = ajAcdGetAlign("outfile");
It is your responsibility to close any files and free up memory at the end of the program.
To close an output alignment stream the AJAX function
ajAlignClose is used:
Alignment output objects are typically loaded from ACD file processing (see above). In the unlikely event that you need to create one manually you can use the default alignment object constructor
ajAlignNew. All constructors return the address of a new object. In the following code the pointer does not need to be initialised to
NULL but it is good practice to do so:
AjPAlign align = NULL; align = ajAlignNew(); /* The object is instantiated and ready for use */
You must free the memory for an object, once you are finished with it. The default destructor function is:
void ajAlignDel (AjPAlign* pthys); /* Destructor for Alignment objects */
It is used as follows:
AjPAlign align=NULL; align = ajAcdGetAlign("align"); /* Do something with alignment */ ajAlignDel(&align);
Applications that create alignment outputs usually generate aligned sequences which are then used to populate the alignment object. The following functions are available:
/* Defines a sequence set as an alignment. The sequences are stored internally and may be edited by alignment processing. */ AjBool ajAlignDefine (AjPAlign pthys, AjPSeqset seqset); /* Defines a sequence pair as an alignment. The sequences are stored internally and may be edited by alignment processing. */ AjBool ajAlignDefineSS (AjPAlign pthys, AjPSeq seqa, AjPSeq seqb); /* Defines a pair of char* strings as an alignment (names of sequences are also required) */ AjBool ajAlignDefineCC (AjPAlign pthys, const char* seqa, const char* seqb, const char* namea,const char* nameb);
There are several AJAX functions for writing out alignment information. Applicatons will usually create an alignment object through ACD processing, populate it with aligned sequences (see above) and call
/* Writes an alignment file */ void ajAlignWrite (AjPAlign thys); /* Reset to allow resue of Alignment objects */ void ajAlignReset (AjPAlign thys); /* Opens a new align file. Called bvy ACD processing*/ AjBool ajAlignOpen (AjPAlign thys, const AjPStr name); /* Writes an alignment header. Called by ajAlignWrite */ void ajAlignWriteHeader (AjPAlign thys); /* Writes an alignment tail Called by ajAlignWrite */ void ajAlignWriteTail (AjPAlign thys); /* Sets the default format for an alignment to 'gff' if not already defined */ AjBool ajAlignFormatDefault (AjPStr* pformat);
Alignment object elements rarely need to be examined by the programmer. Functions are available to retrieve internal values.
/* Returns the filename for an alignment. If the alignment has more than one subalignment, returns the total. */ ajint ajAlignGetLen (const AjPAlign thys); /* Returns the filename */ const char* ajAlignGetFilename (const AjPAlign thys); /* Returns the sequence format */ const AjPStr ajAlignGetFormat (const AjPAlign thys);
Alignment objects have elements that are used to populate the header and tail sections of the output (where the output format can include such extra detail).
Elements that can be set in the alignment header include:
Functions for this are below. Note the matrix name can be set directly or from a matrix object:
/* Setting elements of the alignment header */ void ajAlignSetGapI (AjPAlign thys, ajint gappen, ajint extpen); void ajAlignSetGapR (AjPAlign thys, float gappen, float extpen); void ajAlignSetMatrixName (AjPAlign thys, const AjPStr matrix); void ajAlignSetMatrixNameC (AjPAlign thys, const char* matrix); void ajAlignSetMatrixInt (AjPAlign thys, AjPMatrix matrix); void ajAlignSetMatrixFloat (AjPAlign thys, AjPMatrixf matrix); void ajAlignSetScoreI (AjPAlign thys, ajint score); void ajAlignSetScoreL (AjPAlign thys, ajlong score); void ajAlignSetScoreR (AjPAlign thys, float score);
The standard properties in alignment subheader are:
The function to set these is:
void ajAlignSetSubStandard (AjPAlign thys, ajint iali);
Alternatively, you may set these manually using:
void ajAlignSetStats (AjPAlign thys, ajint iali, ajint len, ajint ident, ajint sim, ajint gaps, const AjPStr score);
The header section can include an optional comment. Similarly, the tail section is free text available to report any special notes on the alignment. Comments can be set in the (sub)header and (sub)tail, or appended or prepended too, using the following functions:
/* Setting comments of the alignment (sub)header and (sub)tail */ void ajAlignSetHeader (AjPAlign thys, const AjPStr header); void ajAlignSetHeaderApp (AjPAlign thys, const AjPStr header); void ajAlignSetHeaderC (AjPAlign thys, const char* header); void ajAlignSetSubHeader (AjPAlign thys, const AjPStr subheader); void ajAlignSetSubHeaderApp (AjPAlign thys, const AjPStr subheader); void ajAlignSetSubHeaderC (AjPAlign thys, const char* subheader); void ajAlignSetSubHeaderPre (AjPAlign thys, const AjPStr subheader); void ajAlignSetSubTail(AjPAlign thys, const AjPStr tail); void ajAlignSetSubTailC(AjPAlign thys, const char* tail); void ajAlignSetSubTailApp(AjPAlign thys, const AjPStr tail); void ajAlignSetTail (AjPAlign thys, const AjPStr tail); void ajAlignSetTailApp (AjPAlign thys, const AjPStr tail); void ajAlignSetTailC (AjPAlign thys, const char* tail);
The alignment type (protein or nucleic) may be set directly (if it is not set already):
void ajAlignSetType (AjPAlign thys);
A range (or sub-ranges) of sequences to output can be set using:
/* Setting sequence range to output */ AjBool ajAlignSetRange (AjPAlign thys, ajint start1, ajint end1, ajint len1, ajint off1, ajint start2, ajint end2, ajint len2, ajint off2); AjBool ajAlignSetSubRange (AjPAlign thys, ajint substart1, ajint start1, ajint end1, AjBool rev1, ajint len1, ajint substart2, ajint start2, ajint end2, AjBool rev2, ajint len2);
Finally, there is a function to set the alignment object to use external sequence references, which are references (copied pointers) rather than clones of the actual sequences:
/* Setting properties of alignment object */ void ajAlignSetExternal (AjPAlign thys, AjBool external);
This is intended for alignments of large sequences where it is undesirable to keep many copies.
ajAlignConsStats function calculates a consensus sequence and statistics (percent identity and similarity and alignment length) for a multiple alignment.
ajAlignFindFormatis used in ACD processing to match the specified alignment format to the internal list of known formats.
AjBool ajAlignConsStats (const AjPSeqset thys, AjPMatrix mymatrix, AjPStr *cons, ajint* retident, ajint* retsim, ajint* retgap, ajint* retlen); AjBool ajAlignFindFormat (const AjPStr format, ajint* iformat);