10.6. HMMER Wrapper: hmmbuild

The basic function of hmmbuild is to read a multiple sequence alignment file, build a new profile HMM and save the HMM to file. It is called as follows:

ehmmbuild [options] alignfile hmmfile

By default the model is configured to find one or more non-overlapping alignments with the complete model: multiple global alignments with respect to the model and local with respect to the sequence. Various other alignment strategies can be set by using the appropriate option.

One limitation is that the user must provide the full filename of an alignment for the alignfile option and not an indirect reference to a set of sequences, so a USA (see the EMBOSS Users Guide) is not acceptable. This is because hmmbuild (which ehmmbuild wraps) requires an alignment and does not support USAs.

Differences between the wrapper and the original software are as follows:

10.6.1. HMMER Wrapper: ehmmbuild.acd

10.6.1.1. Application Definition and Inputs

The start of the ACD file is shown below. Text for the help: attribute is not shown but is given in the ACD files.

You can see that the alignfile option is handled by a seqset sequence input type. As mentioned before, this has to be an alignment file and not a USA referring indirectly to a set of sequences. This limitation could be overcome by first translating the USA into a local file, but this wasn't done for this version of the wrapper:

application: ehmmbuild 
[
# EMBOSS wrapper to hmmbuild from Sean Eddy's HMMER package 
# v.2.3.2  
    documentation: "Build a profile HMM from an alignment."  
    groups: "HMM"  
    gui: "yes"  
    batch: "yes"  
    cpu: "medium"  
    embassy: "hmmernew"
]

section: input 
[  
    information: "Input section"  
    type: "page"
]

seqset: alignfile
[
# User must provide the full filename of an alignment, not an
# indirect reference to a set of sequences, e.g. a USA is NOT 
# acceptable.    
    parameter: "Y"    
    type: "gapstopprotein"    
    aligned: "Y"
] 

...

There are then three infile ACD definitions to handle various HMMER input files. All of these are advanced ACD options not normally set by the user. Note that a NULL default value is set for them which requires the nullok: attribute to be specified and set to True.

infile: prior
[
#   Advanced input file    
    information: "Dirichlet prior file."    
    knowntype: "dirichlet prior"    
    default: ""    
    nullok: "Y"
]

infile: null 
[
#   Advanced input file    
    information: "NULL model file"    
    knowntype: "hmmer null model"    
    default: ""    
    nullok: "Y"
]

infile: pam  
[
#   Advanced input file    
    information: "PAM file"    
    knowntype: "hmmer matrix file"    
    default: ""    
    nullok: "Y"
]

float: pamwgt
[    
    default: "20.0"    
    information: "Weighting for PAM."
]

endsection: input

10.6.1.2. Required Section

The required section is shown below. As you can see it is quite sparse. It contains a string to specify the name of the HMM and a list which is used to set the alignment strategy. The list replaces 3 individual HMMER options. There is also a default setting so this list has 4 entries in total:

section: required
[  
    information: "Required section"  
    type: "page"
]

string: n
[    
    standard: "Y"    
    default: ""    
    information: "Name for this HMM."    
    word: "Y"    
    knowntype: "name"
]

list: strategy  
[    
    standard: "Y"    
    default: "D"    
    minimum: "1"    
    maximum: "1"    
    values: "D:global-multidomain,F:local-multidomain,G:global-singledomain,S:local-singledomain"    
    delimiter: ","    
    codedelimiter: ":"    
    header: "Alignment preference"    
    information: "Select preference"    
    button: "Y"
]

endsection: required

10.6.1.3. Advanced Section

The bulk of the HMMER options are defined as "expert" options in the original HMMER documentation and so are given in the advanced section of the ACD file. These options are not normally set by the user and a default value, taken from the HMMER documentation, is given:

section: advanced 
[  
    information: "Advanced section"  
    type: "page"
]

integer: pbswitch
[    
    default: "1000"    
    information: "Threshold to switch to position-based weights."
]

float: archpri
[    
    default: "0.85"    
    information: "Architecture prior"
]

boolean: binary
[    
    default: "N"    
    information: "Write HMM as binary."
]

boolean: fast
[    
    default: "N"    
    information: "Work in fast mode"
]

float: gapmax
[    
    default: "0.5"    
    information: "Fast mode control"
]

boolean: hand
[    
    default: "N"    
    information: "Specify model by hand."
]

float: sidlevel
[    
    default: "0.62"    
    information: "Cutoff ID threhold"
]

The sequence weighting algorithm is also specified as an advanced ACD qualifier. This one list replaces the 6 command line options given in the original HMMER:

boolean: noeff
[    
    default: "N"    
    information: "Turn off the effective sequence number calculation."
]

float: swentry
[    
    default: "0.5"    
    information: "Probability control for local entries"
]

float: swexit
[    
    default: "0.5"    
    information: "Probability control for exits"
]

boolean: verbosity
[    
    default: "N"    
    information: "Verbosity."
]

list: weighting  
[    
    default: "G"    
    minimum: "1"    
    maximum: "1"    
    values:  "B:Blosum, G:Gerstein/Sonnhammer/Chothia, K:Krogh/Mitchison, W:Henikoff, V:Sibbald/Argos Voronoi, N:None"    
    delimiter: ","    
    codedelimiter: ":"    
    header: "Weighting method"    
    information: "Select weighting"    
    button: "Y"  
]

endsection: advanced 

10.6.1.4. Output Section

The output section is shown here. This contains the new parameter defined for the HMM output file, which was written directly to stdout, and two other output files used by HMMER.

section: output:
[  
    information: "Output section"  
    type: "page"
]

outfile: hmmfile  
[    
    parameter: "Y"    
    knowntype: "hmm file"    
    append: "Y"
]

outfile: o
[    
    nullok: "Yes"    
    nulldefault: "Yes"    
    information: "Resave starting alignment."    
    knowntype: "selex file"
]

outfile: cfile  
[    
    nullok: "Yes"    
    nulldefault: "Yes"    
    information: "Emission and transition count file"    
    knowntype: "hmmer count file"
]

endsection: output 

10.6.2. HMMER Wrapper: ehmmbuild.c

10.6.2.1. Header Documentation

The start of the file of C source code is shown below. This just shows the standard documentation that should be given for any EMBOSS application. There is also a line (#include emboss.h) to import the AJAX and NUCLEUS library interfaces:

/* @source ehmmbuild application
**
** EMBOSS wrapper to hmmbuild from Sean Eddy's HMMER package v.2.3.2
** Build a profle HMM from an alignment.
** 
** @author Copyright (C) Jon Ison 
** @@
**
** This program is free software; you can redistribute it and/or
** modify it under the terms of the GNU General Public License
** as published by the Free Software Foundation; either version 2
** of the License, or (at your option) any later version.
**
** This program is distributed in the hope that it will be useful,
** but WITHOUT ANY WARRANTY; without even the implied warranty of
** MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
** GNU General Public License for more details.
**
** You should have received a copy of the GNU General Public License
** along with this program; if not, write to the Free Software
** Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
******************************************************************************/

#include "emboss.h"

10.6.2.2. main() Function

This shows the main() function and the variable declarations. All the variables for handling the ACD data items have the same name as the corresponding qualifier. This style is not enforced but it is recommended because it makes reading the source code much easier.

Housekeeping variables are given sensible names. All variables are initialised to NULL or zero. It is good practice to do that, in fact dangerous not to, because some parts of the EMBOSS libraries assume that non-NULL pointers have had memory assigned to them; if there is a junk value assigned to them at run time then you may be heading for a segmentation fault.

/* @prog ehmmbuild ***********************************************************
**
** EMBOSS wrapper to hmmbuild from Sean Eddy's HMMER package v.2.3.2
** Build a profle HMM from an alignment.
**
******************************************************************************/
int main(int argc, char **argv)
{
    /* ACD data item variables */
    AjPSeqset alignfile = NULL; 
    AjPFile       prior = NULL;
    AjPFile        null = NULL;
    AjPFile         pam = NULL;
    float        pamwgt = 0.0;
    AjPStr            n = NULL;
    AjPStr    *strategy = NULL;
    ajint      pbswitch = 0;
    float       archpri = 0.0;
    AjBool      binary  = ajFalse;
    AjBool         fast = ajFalse;
    float        gapmax = 0.0;
    AjBool         hand = ajFalse;
    float       idlevel = 0.0;
    AjBool        noeff = ajFalse;
    float       swentry = 0.0;
    float        swexit = 0.0;
    AjBool    verbosity = ajFalse;
    AjPStr   *weighting = NULL;
    AjPFile     hmmfile = NULL;
    AjPFile           o = NULL;
    AjPFile       cfile = NULL;

    /* Housekeeping variables */
    AjPStr          cmd = NULL;
    AjPStr         rnd1 = NULL;
    AjPStr         rnd2 = NULL;
    AjPStr          tmp = NULL;
    AjPStr          fmt = NULL;
    char            option;
    AjBool        fmtok = ajFalse;
    AjPStr  hmmfilename = NULL;

10.6.2.3. Processing the ACD File

The code below shows the function calls for processing the ACD file. embInitP processes the ACD file and prompts the user for any required values that are not specified on the command line. The prefix ajAcdGet family of functions are used to retrieve values from the ACD data definitions and store them in the variables defined earlier:

 /* ACD file processing */
    embInitP("ehmmbuild",argc,argv,"HMMERNEW");

    alignfile = ajAcdGetSeqset("alignfile");
    prior     = ajAcdGetInfile("prior");
    null      = ajAcdGetInfile("null");
    pam       = ajAcdGetInfile("pam");
    pamwgt    = ajAcdGetFloat("pamwgt");
    n         = ajAcdGetString("n");
    strategy  = ajAcdGetList("strategy");
    pbswitch  = ajAcdGetInt("pbswitch");
    archpri   = ajAcdGetFloat("archpri");
    binary    = ajAcdGetBoolean("binary");
    fast      = ajAcdGetBoolean("fast");
    gapmax    = ajAcdGetFloat("gapmax");
    hand      = ajAcdGetBoolean("hand");
    idlevel   = ajAcdGetFloat("sidlevel");
    noeff     = ajAcdGetBoolean("noeff");
    swentry   = ajAcdGetFloat("swentry");
    swexit    = ajAcdGetFloat("swexit");
    verbosity = ajAcdGetBoolean("verbosity");
    weighting = ajAcdGetList("weighting");
    hmmfile   = ajAcdGetOutfile("hmmfile");
    o         = ajAcdGetOutfile("o");
    cfile     = ajAcdGetOutfile("cfile"); 

10.6.2.4. Housekeeping and File Format Handling

The start of the application code proper is shown below. First of all there is some housekeeping code. Then there is a block of code to check that the sequence alignment input file is in a format that HMMER can understand. An exception is raised if an unsupported format is specified. This could be replaced in the future with code to reformat the alignment file into an appropriate format. At the time of writing, it was not fully tested whether all alignment formats, including SELEX and STOCKHOLM, could be interconverted without any loss of data or annotation, so the safe option was chosen:

 /* MAIN APPLICATION CODE */
    /* 1. Housekeeping */
    cmd  = ajStrNew();
    rnd1 = ajStrNew();
    rnd2 = ajStrNew();
    tmp  = ajStrNew();
    fmt  = ajStrNew();
    hmmfilename = ajStrNew();

    ajStrAssignC(&hmmfilename, ajFileGetNameC(hmmfile));


    /* 2. Ensure alignfile is in format HMMER can understand.  These include
       FASTA, GENBANK,EMBL, GCG, PIR, STOCKHOLM, SELEX, MSF,CLUSTAL and PHYLIP.
       EMBOSS name definitions are taken from seqInFormatDef in ajseqread.c and
       seqOutFormat in ajseqwrite.c */
    fmtok=ajFalse;
    ajStrAssignS(&fmt, ajSeqsetGetFormat(alignfile));
    if(ajStrMatchC(fmt, "fasta")    ||
       ajStrMatchC(fmt, "genbank")  ||
       ajStrMatchC(fmt, "embl")     ||
       ajStrMatchC(fmt, "gcg")      ||
       ajStrMatchC(fmt, "pir")      ||
       ajStrMatchC(fmt, "stockholm")||
       ajStrMatchC(fmt, "selex")    ||
       ajStrMatchC(fmt, "msf")      ||
       ajStrMatchC(fmt, "clustal")  ||
       ajStrMatchC(fmt, "phylip"))
         fmtok = ajTrue;
    /* This could be replaced with code to reformat the file. */
    if(!fmtok)
        ajFatal("Input alignment ('alignfile' ACD option) is not in a format "
        "HMMER understands. Please use a file in FASTA, GENBANK, "
        "EMBL, GCG, PIR, STOCKHOLM, SELEX, MSF,CLUSTAL or PHYLIP format."); 

10.6.2.5. Building the Command line

The first part of the code for building the command line is shown below. The command line is constructed in a specific order to make updating the wrapper for new releases easier. First the application name is pasted into a string, then the original HMMER options are given in the order they appear in the ACD file. Next the HMMER options that do not have any parallel in the ACD file are given. Finally, new parameters and options that are specific to the EMBASSY wrapper are given:

 /* 3. Build hmmbuild command line */
    /* Command line is built in this order: 
       i.  Application name.
       ii. HMMER 'options' (in order they appear in ACD file)
       iii.HMMER 'options' (that don't appear in ACD file)
       iv. HMMER and new parameters.
       */
    ajFmtPrintS(&cmd, "hmmbuild ");

    if(prior)
        ajFmtPrintAppS(&cmd, " --prior %s ", ajFileGetNameC(prior));

    if(null)
        ajFmtPrintS(&cmd, " --null %s ", ajFileGetNameC(null));

    if(pam)
        ajFmtPrintAppS(&cmd, " --pam %s  --pamwgt %f ",
                       ajFileGetNameC(pam), pamwgt);

    ajFmtPrintAppS(&cmd, " -n %S ", n);

    /* ACD option only allows one selection */
    option = ajStrGetCharFirst(strategy[0]);

    if(option == 'F')
        ajStrAppendC(&cmd, " -f ");
    else if(option == 'G')
        ajStrAppendC(&cmd, " -g ");
    else if(option == 'S')
        ajStrAppendC(&cmd, " -s ");

    /* else go with default ('D' option in ACD file) */
    ajFmtPrintAppS(&cmd, " --pbswitch %d ", pbswitch);
    ajFmtPrintAppS(&cmd, " --archpri %f ", archpri);

    if(binary)
        ajStrAppendC(&cmd, " --binary ");

    if(fast)
        ajFmtPrintAppS(&cmd, " --fast --gapmax %f ", gapmax);

    if(hand)
        ajStrAppendC(&cmd, " --hand ");

    ajFmtPrintAppS(&cmd, " --idlevel %f ", idlevel);

    if(noeff)
        ajStrAppendC(&cmd, " --noeff ");

    ajFmtPrintAppS(&cmd, " --swentry %f ", swentry);
    ajFmtPrintAppS(&cmd, " --swexit %f ", swexit);

    if(verbosity)
        ajStrAppendC(&cmd, " --verbose "); 

The rest of the code for building the command line is below. The only thing to point out is that the append option is always set. This means that output should always be appended to whatever is given in the specified output file. EMBOSS clears its output files by default though, so for this to work the append: attribute of the hmmfile ACD data item must be set to True:

  /* ACD option only allows one selection */
    option = ajStrGetCharFirst(weighting[0]);

    if(option == 'B')
         ajStrAppendC(&cmd, " --wblosum ");
    else if(option == 'G')
         ajStrAppendC(&cmd, " --wgsc ");
    else if(option == 'K')
         ajStrAppendC(&cmd, " --wme ");
    else if(option == 'W')
         ajStrAppendC(&cmd, " --wpb ");
    else if(option == 'V')
         ajStrAppendC(&cmd, " --wvoronoi ");
    else if(option == 'N')
         ajStrAppendC(&cmd, " --wnone ");

    if(o)
         ajFmtPrintAppS(&cmd, " -o %s ", ajFileGetNameC(o));

    if(cfile)
         ajFmtPrintAppS(&cmd, " --cfile %s ", ajFileGetNameC(cfile));

    /* -A (append) always set but file will be wiped by EMBOSS first unless 
    ** append: "Y" is set for "hmmfile" in the ACD file.
    */
    ajStrAppendC(&cmd, " -A -F ");
    ajFmtPrintAppS(&cmd, " %S %S", hmmfilename, ajSeqsetGetFilename(alignfile)); 

10.6.2.6. Invoking the Application

The code below shows the system call to invoke the hmmbuild application using the command line just constructed. Note that system() is used here but that should probably be replaced with a call to exec() for reasons explained earlier. There is also some housekeeping code for memory management to ensure that the application can close cleanly:

 /* 4. Close ACD files */
    ajSeqsetDel(&alignfile);
    ajFileClose(&prior);
    ajFileClose(&null);
    ajFileClose(&pam);
    ajFileClose(&hmmfile);
    ajFileClose(&o);
    ajFileClose(&cfile);


    /* 5. Call hmmbuild */
    ajFmtPrint("\n%S\n", cmd); 
    system(ajStrGetPtr(cmd));    


    /* 6. Exit cleanly */
    ajStrDel(&n);
    ajStrDel(&cmd);
    ajStrDel(&rnd1);
    ajStrDel(&rnd2);
    ajStrDel(&tmp);
    ajStrDel(&fmt);
    ajStrDel(&hmmfilename);
   
    embExit();

    return 0;
}