6.10. Handling Comparison Matrices

6.10.1. Introduction

Matrices are commonly used in molecular sequence analysis to compare seque nce characters at the same position in two or more aligned sequences.

Matrix objects are created by reading a matrix local data file, either through ACD (where the user has a choice of files), or by directly reading a named file where the filename is fixed.

EMBOSS includes sets of comparison matrix files in the data directory which can be used as examples when creating new files.

Matrix objects are in two very similar forms, using integers (AjPMatrix) for speed, and floating point numbers (AjPMatrixf) for flexibility. Both types include a set of column labels (usually sequence characters), and a set of row labels which usually matches the column labels (asymmetric matrices are used in some applications). For rows and columns there is also a matrix size value. The numbers in the data file become a two-dimensional table of comparison values.

For sequence alignments functions are provided to use a matrix object to align two sequences.

Applications that analyse sequence alignments, for example prettyplot, can directly use the conversion table and character codes in a matrix to look up comparison scores using an AjPSeqcvt object.

6.10.2. AJAX Library Files

AJAX library files for handling matrices are listed in the table (Table 6.15, “AJAX Library Files for Handling Matrices”). Library file documentation, including a complete description of datatypes and functions, is available at:

http://emboss.open-bio.org/rel/dev/libs/
Table 6.15. AJAX Library Files for Handling Matrices
Library File DocumentationDescription
ajmatricesComparison matrix handling functions

ajmatrices.h/cDefines the AjPMatrix and AjPMatrixf objects and functions for handling comparison matrices.

6.10.3. ACD Datatypes

There are two datatypes for handling comparison matrix input:

matrix

Integer comparison matrix.

matrixf

Floating point comparison matrix.

There are two datatypes for handling comparison matrix output:

outmatrix

Output integer comparison matrix.

outmatrixf

Output floating point comparison matrix.

6.10.4. ACD Data Definition

Typical ACD definitions for comparison matrix input:

#Integer matrix (input)
matrix: matrix
[
    information: "Matrix file"
    protein: "$(acdprotein)"
]

# Floating point matrix (input)
matrixf: matrixf
[
    information: "Matrix file"
    protein: "$(acdprotein)"
]

Typical ACD definitions for comparison matrix output:

# Integer matrix (output)
outmatrix: outmatrix
[
    information: "Matrix file"
    protein: "$(acdprotein)"
]

# Floating point matrix (output):
outmatrixf: outmatrixf
[
    information: "Matrix file"
    protein: "$(acdprotein)"
]

6.10.4.1. Parameter Name

All data definitions for comparison matrix input and output should have a standard parameter name, which is matrix. For further information see Appendix A, ACD Syntax Reference.

6.10.4.2. Common Attributes

Attributes that are typically specified are summarised below. They are datatype-specific (Section A.5, “Datatype-specific Attributes”) unless they are indicated as being global attributes (Section A.4, “Global Attributes”).

information: A global attribute. It specifies the user prompt and is used in the application documentation.

protein: A boolean attribute which if set specifies that the matrix is for proteins. If not set the matrix is presumed to be for nucleic acids.

6.10.5. AJAX Datatypes

For handling comparison matrices, including input matrices defined in the ACD file, use:

AjPMatrix

Integer comparison matrix (for matrix ACD datatype).

AjPMatrixf

Floating point comparison matrix (for matrixf ACD datatype).

For handling comparison matrix output use:

AjPOutfile

General output file (for outmatrix and outmatrixf ACD datatypes).

It is sometimes necessary to convert a sequence into numerical form for convenient processing. The AJAX datatype for this is:

AjOSeqCvt

Used for sequence conversion into numerical form.

6.10.6. ACD File Handling

Datatypes and functions for handling comparison matrices via the ACD file are shown below (Table 6.16, “Datatypes and Functions for Comparison Matrix Input and Output”).

Table 6.16. Datatypes and Functions for Comparison Matrix Input and Output
ACD datatypeAJAX datatypeTo retrieve from ACD
Comparison Matrix Input
matrixAjPMatrixajAcdGetMatrix
matrixfAjPMatrixfajAcdGetMatrixf
Comparison Matrix Output
outmatrixAjPMatrixajAcdGetOutmatrix
outmatrixfAjPMatrixfajAcdGetOutmatrixf

Your application code will call embInit to process the ACD file and command line (see Section 6.3, “Handling ACD Files”). All values from the ACD file are read into memory and files are opened as necessary. You have a handle on the files and memory through the ajAcdGet* family of functions which return pointers to appropriate objects.

6.10.6.1. Input Comparison Matrix Retrieval

To retrieve a comparison matrix, an object pointer is declared and then initialised using the appropriate ajAcdGet* function.

6.10.6.1.1. Integer comparison matrix
    AjPMatrix matrix = NULL;

    matrix = ajAcdGetMatrix("matrix");
6.10.6.1.2. Floating point comparison matrix
    AjPMatrixf matrix = NULL;

    matrix = ajAcdGetMatrixf("matrix");

6.10.6.2. Output Comparison Matrix Retrieval

To retrieve an output comparison matrix an object pointer is declared and initialised using the appropriate ajAcdGet* function.

6.10.6.2.1. Integer comparison matrix
    AjPOutfile outmatrix=NULL;

    outmatrix = ajAcdGetOutmatrix("outmatrix");
6.10.6.2.2. Floating point comparison matrix
    AjPOutfile outmatrixf=NULL;

    outmatrixf = ajAcdGetOutmatrixf("outmatrixf");

6.10.6.3. Processing Command line Options and ACD Attributes

Currently there are no functions for this.

6.10.6.4. Memory and File Management

It is your responsibility to close any files and free up memory at the end of the program.

6.10.6.4.1. Closing Files

You must close the output file for any outmatrix or outmatrixf definitions in the ACD file by calling ajOutfileClose with the address of the output file object:

    AjPOutfile outmatrix  = NULL;
    AjPOutfile outmatrixf = NULL;

    outmatrix = ajAcdGetOutmatrix("outmatrix");
    outmatrixf = ajAcdGetOutmatrixf("outmatrixf");

    /* Do something with matrices */

    ajOutfileClose(&outmatrix);
    ajOutfileClose(&outmatrixf);
6.10.6.4.2. Freeing Memory

You must call the default destructor function (see below) on any comparison matrix objects returned by calls to ajAcdGet*.

6.10.7. Matrix Object Memory Management

6.10.7.1. Default Object Construction

Matrix objects are usually created through an ACD definition, reading a named EMBOSS local data file. To create a matrix object directly (perhaps where there is no choice of matrix filename) the object must be constructed from a given local data file by calling:

AjBool  ajMatrixNewFile (AjPMatrix* pthis, const AjPStr filename);                 
AjBool  ajMatrixfNewFile (AjPMatrixf* pthis, const AjPStr filename);

The functions take the name of the data file to open. The file must be found in the EMBOSS data path, including the current directory and the installed data files.

All constructors return the address of a new object. The pointers do not need to be initialised to NULL but it is good practice to do so:

AjPMatrix  intmatrix   = NULL;
AjPMatrixf floatmatrix = NULL;
AjPStr filename = NULL;

filename = ajStrNewC("EBLOSUM62");

intmatrix = ajMatrixNewFile(filename);
floatmatrix = ajMatrixfNewFile(filename);

All constructors return the address of a new object. The pointers do not need to be initialised to NULL but it is good practice to do so:

6.10.7.2. Default Object Destruction

You must close any output files and free the memory for your objects once you are finished with them.

To close an output file (AjPOutfile) call ajOutfileClose, or call ajFileClose for general file objects (AjPFile):

void  ajFileClose (AjPFile* Pfile);         
void  ajOutfileClose (AjPOutfile* Pfile);  

The objects are freed by calling the destructor functions:

void  ajMatrixDel (AjPMatrix *thys);    
void  ajMatrixfDel (AjPMatrixf *thys);  

They are used as follows:

    AjPMatrix matrix  = NULL;
    AjPMatrix matrixf=  NULL;

    matrix = ajAcdGetMatrix("matrix");
    matrixf = ajAcdGetMatrixf("matrixf");

    /* Do something with matrices */

    ajMatrixDel(&matrix);
    ajMatrixfDel(&matrixf);

6.10.7.3. Alternative Object Construction and Loading

Internally, the matrix object constructor functions are:

AjPMatrix  ajMatrixNew (const AjPPStr codes, ajint n, const AjPStr filename);
AjPMatrixf  ajMatrixfNew (const AjPPStr codes, ajint n, const AjPStr filename);

These will create a new matrix with values initialised to zero. The functions take the matrix name (filename), a string (codes) containing characters for the matrix labels, and an integer (n) that is the number of labels. If the matrix is a residue substitution matrix then the string should contain defined sequence characters.

The matrices that are created by ajMatrixNew and ajMatrixfNew are square, having the same number of rows and columns. To create a matrix with an unequal number of rows and columns call:

AjPMatrix  ajMatrixNewAsym (const AjPPStr codes, ajint n,  const AjPPStr rcodes, ajint rn, const AjPStr filename);
AjPMatrixf  ajMatrixfNewAsym (const AjPPStr codes, ajint n, const AjPPStr rcodes, ajint rn, const AjPStr filename);

These will create a new matrix with values initialised to zero. The functions take the matrix name (filename), and two strings (codes and rcodes) containing characters for the matrix column and row labels respectively, and two integers (n and rn) that are the number of column and row labels.

EMBOSS requires all matrix objects to be loaded from data files. No functions are provided to add or change the matrix object values. The following functions read from an EMBOSS data file. For more information on EMBOSS data files, see the EMBOSS Users Guide.

A matrix can be constructed from a given local data file by calling:

AjBool  ajMatrixNewFile (AjPMatrix* pthis, const AjPStr filename);                 
AjBool  ajMatrixfNewFile (AjPMatrixf* pthis, const AjPStr filename);

The functions take the name of the data file to open.

6.10.8. Functions for Retrieving the Properties of a Matrix.

Most elements of a matrix object can be retrieved by calling one of the ajMatrixGet* or ajMatrixfGet* functions:

To return the comparison matrix as an array of integer or floating point arrays call:

AjIntArray*  ajMatrixGetMatrix (const AjPMatrix thys);                                  
AjFloatArray*  ajMatrixfGetMatrix (const AjPMatrixf thys);

Sequence characters are indexed in this array using the internal sequence conversion table in the matrix. AjIntArray and AjFloatArray are defined ajdefine.h as arrays of C-type integer and floating point numbers:

typedef float*  AjFloatArray;
typedef int*  AjIntArray;

To return the label (sequence character or the column name for an asymmetric matrix) for a matrix row or column in position i call:

AjPStr  ajMatrixGetLabelNum (const AjPMatrix thys, ajint i);
AjPStr  ajMatrixfGetLabelNum (const AjPMatrixf thys, ajint i);

To return the sequence character conversion table for a matrix call:

AjPSeqCvt  ajMatrixGetCvt (const AjPMatrix thys);                                       
AjPSeqCvt  ajMatrixfGetCvt (const AjPMatrixf thys);

This table converts any character defined in the matrix to a positive integer, and any other character is converted to zero.

To return the character codes for each offset in the matrix call:

AjPStr  ajMatrixGetCodes (const AjPMatrix thys);
AjPStr  ajMatrixfGetCodes (const AjPMatrixf thys);

To return the name of a matrix object (which typically is the filename from which it was read), call:

const AjPStr  ajMatrixGetName (const AjPMatrix thys);
const AjPStr  ajMatrixfGetName (const AjPMatrixf thys);

To return the comparison matrix size (or the number of columns for an asymmetric matrix) call:

ajuint  ajMatrixGetSize (const AjPMatrix thys);                                         
ajuint  ajMatrixfGetSize (const AjPMatrixf thys);

For an asymmetric matrix the number of rows can be returned by calling:

ajuint  ajMatrixGetRows (const AjPMatrix thys);                                         
ajuint  ajMatrixfGetRows (const AjPMatrixf thys);

6.10.9. Functions for Indexing a Matrix

To convert a sequence to index numbers using the matrix's internal conversion table call:

AjBool  ajMatrixSeqIndex (const AjPMatrix thys, const AjPSeq seq, AjPStr* numseq);
AjBool  ajMatrixfSeqIndex (const AjPMatrixf thys, const AjPSeq seq, AjPStr* numseq);

Sequence characters not defined in the matrix are converted to zero.

6.10.10. Sequence Conversion

These functions handle sequence conversion objects. The basic constructor ajSeqcvtNewStr uses an array of strings as the column labels. For sequence comparison matrices these strings will be one character each. The other constructors renumbers the codes for the specific expectations of some older legacy code and are not recommended for general use. The most useful functions are those which return the numeric code for a base or residue, and are frequently used to look up a sequence character in a conversion table.

/* constructors with base codes as a string*/
AjPSeqCvt  ajSeqcvtNewStr (const AjPPStr bases, ajint n);
AjPSeqCvt  ajSeqcvtNewC (const char* bases);
AjPSeqCvt  ajSeqcvtNewNumberC (const char* bases);
AjPSeqCvt  ajSeqcvtNewEndC (const char* bases);

/* asymmetrix conversion table constructor */

AjPSeqCvt    ajSeqcvtNewStrAsym (const AjPPStr bases, ajint n, const AjPPStr rbases, ajint rn);

/* destructor */

void         ajSeqcvtDel (AjPSeqCvt* thys);

/* return conversion table length */

ajuint       ajSeqcvtGetLen (const AjPSeqCvt thys);

/* return numeric code for a residue code (matrix column label) */

ajint        ajSeqcvtGetCodeK (const AjPSeqCvt thys, char ch);
ajint        ajSeqcvtGetCodeS (const AjPSeqCvt thys, const AjPStr ch);

/* return numeric code for a column or row in an asymnmetric matrix */

ajint        ajSeqcvtGetCodeAsymS (const AjPSeqCvt cvt, const AjPStr str);
ajint        ajSeqcvtGetCodeAsymrowS (const AjPSeqCvt cvt, const AjPStr str);