2.4. Modifying matcher

The EMBOSS application matcher generates pairwise local alignments given either DNA or protein sequences. It has different options depending on the type of input. When wrapping such programs, for instance for incorporation into a graphical user interface, it is often convenient to split them into separate nucleotide and protein applications with the same application code underneath. SRS is a good example of a system which wraps EMBOSS applications in this way.

To deepen your knowledge, use matcher as a template to produce new, separate applications for protein (matcherpro) and nucleotide (matchernuc) sequences.

2.4.1. Planning

You need to change how the matcher functionality is presented to the user, therefore you'll need two new ACD files and a copy of the matcher source code for each. Both new applications will effectively use the same source code with only trivial differences.

The steps to create these applications are as follows:

  1. Create the application source code (matcherpro.c and matchernuc.c files) in:

    .../embassy/myemboss/src

    Copy the source code for matcher from .../emboss/matcher.c.

  2. Create the application ACD files (matcherpro.acd and matchernuc.acd) in:

    .../embassy/myemboss/emboss_acd

    Use .../emboss/matcher.acd as a template.

  3. Add the applications to the myemboss package by updating the two Makefile.am files:

    .../embassy/myemboss/src/Makefile.am
    .../embassy/myemboss/emboss_acd/Makefile.am
  4. Compile and test the applications.

2.4.2. Editing the ACD Files

The original matcher.acd is shown below (help: attributes are not shown):

application: matcher 
[
  documentation: "Waterman-Eggert local alignment of two sequences"
  groups: "Alignment:Local"
]

section: input 
[
  information: "Input section"
  type: "page"
]

  sequence: asequence  
  [
    parameter: "Y"
    type: "any"
  ]

  sequence: bsequence  
  [
    parameter: "Y"
    type: "@($(acdprotein) ? stopprotein : nucleotide)"
  ]

  matrix: datafile  
  [
    additional: "Y"
    information: "Matrix file"
    protein: "$(acdprotein)"
  ]

endsection: input

section: additional 
[
  information: "Additional section"
  type: "page"
]

  integer: alternatives  
  [
    additional: "Y"
    information: "Number of alternative matches"
    default: "1"
    minimum: "1"
  ]

  integer: gapopen  
  [
    additional: "Y"
    information: "Gap penalty"
    default: "@($(acdprotein)? 14 : 16)"
    minimum: "0"
    valid: "Positive integer"
    expected: "14 for protein, 16 for nucleic"
  ]

  integer: gapextend  
  [
    additional: "Y"
    information: "Gap length penalty"
    default: "@($(acdprotein)? 4 : 4)"
    minimum: "0"
    valid: "Positive integer"
    expected: "4 for any sequence"
  ]

endsection: additional

section: output 
[
  information: "Output section"
  type: "page"
]

  align: outfile  
  [
    parameter: "Y"
    aformat: "markx0"
    minseqs: "2"
    maxseqs: "2"
  ]

endsection: output

The ACD file introduces several new concepts:

  • The groups: attribute in the application definition assigns the application to a group (see Section 4.2, “Application Definition”).

  • The sequence ACD datatype is used to define two input sequences; asequence and bsequence.

  • The sequence type of asequence is set by the type: attribute, in this case to "any", i.e. any type of sequence is acceptable.

  • For bsequence, the sequence type is calculated from the ACD variable acdprotein; if acdprotein is true then type: is set to stopprotein, otherwise it's set to nucleotide.

    acdprotein is an "automatic ACD variable" with a boolean type whose value is set automatically when the first sequence is read in. So, if the first sequence is a protein, then acdprotein will be true. Automatic ACD variables are described in detail elsewhere (Section 4.4, “Operations”).

  • The matrix datatype is used to define a substitution matrix (called matrixfile). EMBOSS will search for this data file in the EMBOSS data directory (see the EMBOSS Users Guide).

  • information: is used to set a user-prompt for some of the data items. It is not needed for the sequence inputs (EMBOSS will automatically ghenerate a suitable prompt) but can be given for the other types used (see Section 4.3, “Data Definition”).

  • Qualifiers and parameters in the ACD file are organised into sections (input, additional and output). These help to tidy the ACD file and are exploited by user interfaces (see Section 4.3, “Data Definition”).

  • Options in the additional section are defined to be 'additional qualifiers' by the attribute additional: "Y". Values for additional qualifiers are not prompted for (the default value will be used instead) unless -options is given on the command line, which will turn prompting on for these qualifiers (see Section 4.1, “Introduction to ACD File Development” for more information.)

  • Two gap penalties (gapopen and gapextend) are defined as integer ACD types. The minimum:, valid: and expected attributes are used to set minimum and expected values and a corresponding message to the user.

  • There is a single output, a sequence alignment (outfile) which is defined by the type align. The format (markx0) and minimum and maximum number of sequences (2 in both cases, i.e. a pairwise alignment) are set using the attributes aformat:, minseqs: and maxseqs respectively.

The changes necessary for matcherpro.acd are:

  • The application name should be changed to matcherpro.

  • The documentation: attribute should state that the application works on protein sequences only.

  • The type: attribute of the first input sequence should be changed from any to protein.

  • The type of the second input sequence should be stopprotein.

  • The residue substitution matrix should be of type protein. Currently this is given as protein: "$(acdprotein)" which means that the protein: attribute will be set to true if the first sequence is a protein. $(acdprotein) should be replaced with y.

  • All other occurrences of lines containing acdprotein should be replaced with the protein term; the first value in such lines containing ACD @ function calls.

The parts of matcherpro.acd which have been modified and differ from matcher.acd are shown below:

application: matcherpro [
  documentation: "Waterman-Eggert local alignment of two sequences"
  groups: "Alignment:Local"
]

... lines omitted  

  sequence: asequence  [
    parameter: "Y"
    type: "protein"
  ]

  sequence: bsequence  [
    parameter: "Y"
    type: "stopprotein"
  ]

... lines omitted  


  matrix: datafile  [
    additional: "Y"
    information: "Matrix file"
    protein: "Y"
  ]

... lines omitted  

  integer: gapopen  [
    additional: "Y"
    information: "Gap penalty"
    default: "14"
    minimum: "0"
    valid: "Positive integer"
    expected: "14"
  ]

  integer: gapextend  [
    additional: "Y"
    information: "Gap length penalty"
    default: "4"
    minimum: "0"
    valid: "Positive integer"
    expected: "4 for any sequence"
  ]

... lines omitted  

2.4.3. Editing the C Source File

The main() function for matcher is shown below. The application includes several functions and macros that are not shown:

#include "emboss.h"


/* @prog matcher **************************************************************
**
** Finds the best local alignments between two sequences
**
******************************************************************************/

int main(int argc, char **argv)
{
    AjPStr aa0str = 0;
    AjPStr aa1str = 0;
    const char *s1;
    const char *s2;
    ajint gdelval;
    ajint ggapval;
    ajuint i;
    ajint K;

    AjPAlign align = NULL;

    embInit("matcher", argc, argv);

    seq     = ajAcdGetSeq("asequence");
    ajSeqTrim(seq);
    seq2    = ajAcdGetSeq("bsequence");
    ajSeqTrim(seq2);
    matrix  = ajAcdGetMatrix("datafile");
    K       = ajAcdGetInt("alternatives");
    gdelval = ajAcdGetInt("gapopen");
    ggapval = ajAcdGetInt("gapextend");
    align   = ajAcdGetAlign("outfile");


    /*
      create sequence indices. i.e. A->0, B->1 ... Z->25 etc.
      This is done so that ajBasecodeToInt has only to be done once for
      each residue in the sequence
    */

    ajSeqFmtUpper(seq);
    ajSeqFmtUpper(seq2);

    s1 = ajStrGetPtr(ajSeqGetSeqS(seq));
    s2 = ajStrGetPtr(ajSeqGetSeqS(seq2));

    sub = ajMatrixGetMatrix(matrix);
    cvt = ajMatrixGetCvt(matrix);


    aa0str = ajStrNewRes(2+ajSeqGetLen(seq)); /* length + blank + trailing null */
    aa1str = ajStrNewRes(2+ajSeqGetLen(seq2));
    ajStrAppendK(&aa0str,' ');
    ajStrAppendK(&aa1str,' ');

    for(i=0;i<ajSeqGetLen(seq);i++)
        ajStrAppendK(&aa0str,(char)ajSeqcvtGetCodeK(cvt, *s1++));

    for(i=0;i<ajSeqGetLen(seq2);i++)
        ajStrAppendK(&aa1str,ajSeqcvtGetCodeK(cvt, *s2++));

    matcher_Sim(align, ajStrGetPtr(aa0str),ajStrGetPtr(aa1str),
                seq,seq2,
                K,(gdelval-ggapval),ggapval,
                ajSeqGetOffset(seq), ajSeqGetOffset(seq2), 2);

    ajStrDel(&aa0str);
    ajStrDel(&aa1str);

    ajSeqDel(&seq);
    ajSeqDel(&seq2);

    embExit();

    return 0;
}

matcher.c introduces many new concepts:

  • Two AJAX strings (aa0str and aa1str) are defined using AjPStr.

  • An alignment object (align) is created using AjPAlign.

  • The values from the ACD file are picked up by calls beginning with ajAcdGet. The functions are named to make the types returned obvious, ajAcdGetSeq returns sequences, ajAcdGetMatrix returns a matrix and so on.

  • The next block of code creates the sequences indices. You needn't concern yourself with the details. Functions for this include ajMatrixGetMatrix, ajMatrixGetCvt and ajSeqcvtGetCodeK (see the library documentation for an explanation). The function matcher_Sim (not shown) is also called and is a function local to matcher.

  • Strings and sequences that were allocated are freed by calling ajStrDel and ajSeqDel.

  • The application then exits cleanly by calling embExit.

One small edit is needed in the src/matcherpro.c file:

  • Change the embInit call to embInitP and add "myemboss" as the last argument so EMBOSS knows you are in the myemboss package

  • Change the program name in the embInitP call to "matcherpro" so that it uses the new ACD file.

  • Change the program name in the main function comment call to "matcherpro" to match the program name.

The parts of matcherpro.c which have been modified and differ from matcher.c are shown below:

/* @prog matcher **************************************************************;

embInitP("matcherpro", argc, argv, "myemboss");

2.4.4. Compilation

To compile and install the application from the myemboss directory, type:

make

Or, if you are using myemboss as part of a fully installed EMBOSS system then type:

make install

2.4.5. Testing All Is Well

matcherpro has just been added to your system, therefore you may need to make it visible in your path, for example, by running the UNIX command rehash, if you use a csh shell.

To check all is well, you can run:

acdvalid matcherpro

and check that no errors or warnings are reported. As matcher.acd has already been validated it should be fine.

You should now have a new application matcherpro which should produce the same results as matcher with protein input, but will refuse to run with DNA sequences. This is shown below:

% matcherpro
Finds the best local alignments between two protein sequences
Input protein sequence: embl:L07770
Error: Sequence is not a protein

Error: Unable to read sequence 'embl:L07770'
Input protein sequence:
...

Here it is being used correctly:

% matcherpro
Finds the best local alignments between two protein sequences
Input protein sequence: swissprot:UBR5_RAT
Second protein sequence:
...

2.4.6. Further Developments

From the above it should be obvious what's required to implement matchernuc.acd and matchernuc.c, so generating two distinct applications, one for nucleotide and one for protein sequences.

If you are familiar with EMBOSS, you might notice that the output format of matcherpro is not the same as water, another popular alignment program. This could be confusing to users, so as a final modification you can make matcherpro consistent with water. To do this, set the aformat: attribute of the outfile data definition, which currently has the value "markx0" to the value ("srspair") defined in the water.acd file.

A similar effect could be achieved by running matcherpro with -aformat srspair on the command line. If for some reason the old format was required, matcherpro could be run with -aformat markx0 on the command line.