6.8. Handling Sequence Translation

6.8.1. Introduction

Translation of a nucleotide sequence into a protein sequence is a common task. AJAX provides all the basic functionality you would expect. The nucleic sequence can be in a variety of forms (an AJAX string (AjPStr), C-type string (char *) or AJAX sequence object (AjPSeq) and can be translated in all reading frames. The reverse complement of a sequence can also be translated.

6.8.2. AJAX Library Files

The AJAX library file for handling sequence translation is listed in the table (Table 6.11, “AJAX Library Files for Handling Sequence Translation”). Library file documentation, including a complete description of datatypes and functions, is available at:

http://emboss.open-bio.org/rel/dev/libs/
Table 6.11. AJAX Library Files for Handling Sequence Translation
Library File DocumentationDescription
ajtranslateSequence translation

ajtranslate.h/cDefines a sequence translation object (AjPTrn) and include functions for handling sequence translation.

6.8.3. ACD Datatypes

There is no dedicated ACD datatype for handling translation. Such operations are performed on a nucleotide sequence and so require a sequence input of the appropriate type. A genetic code is also required, a choice of which is provided to the user (usually) via a menu implemented by a list ACD datatype. The ACD datatypes you'll require are therefore:

list

A list of options (text descriptions) with text labels. The user is presented with a limited list of options they can choose from. The choices can be labelled by any arbitrary text label.

sequence

A single input sequence.

For general information on menu and sequence handling see:

Handling of ACD menus (Section 6.19, “Handling Menus”)
Handling of sequences (Section 6.7, “Handling Sequences”)

6.8.4. ACD Data Definition

A typical ACD definition for single sequence input:

sequence: sequence  
[
    parameter: "Y"
    type:      "nucleotide"
]

The available genetic codes must be defined in the ACD file and the list datatype may be used for this. EMBOSS supports a standard set of genetic codes which are given as follows:

list: table
[
    additional: "Y"
    default: "0"
    minimum: "1"
    maximum: "1"
    header: "Genetic codes"
    values: "0:Standard; 1:Standard (with alternative initiation
             codons); 2:Vertebrate Mitochondrial; 3:Yeast Mitochondrial;
             4:Mold, Protozoan, Coelenterate Mitochondrial and
             Mycoplasma/Spiroplasma; 5:Invertebrate Mitochondrial; 6:Ciliate
             Macronuclear and Dasycladacean; 9:Echinoderm Mitochondrial;
             10:Euplotid Nuclear; 11:Bacterial; 12:Alternative Yeast Nuclear;
             13:Ascidian Mitochondrial; 14:Flatworm Mitochondrial;
             15:Blepharisma Macronuclear; 16:Chlorophycean Mitochondrial;
             21:Trematode Mitochondrial; 22:Scenedesmus obliquus;
             23:Thraustochytrium Mitochondrial"
    delimiter: ";"
    codedelimiter: ":"
    information: "Code to use"
]

The order of the codes is currently important; the list must be given in the exact order shown above.

6.8.5. AJAX Datatypes

For handling sequence translation, which requires sequence and menu input, use:

AjPTrn

Used for sequence translation.

AjPSeq

Single input sequence (for sequence ACD datatype).

AjPStr

Single selection from a menu (for list ACD datatype).

6.8.6. ACD File Handling

Datatypes and functions for handling translation via the ACD file are shown below (Table 6.12, “Datatypes and Function for Sequence Translation”). Here a single selection from the list is retrieved but other types of menu, input sequence or access methods could be used.

Table 6.12. Datatypes and Function for Sequence Translation
To read a sequenceTo read a single selection from a list
ACD datatypesequencelist
ObjectAjPSeqAjPStr
To retrieve from ACDajAcdGetSeqajAcdGetListSingle

Your application code will call embInit to process the ACD file and command line (see Section 6.3, “Handling ACD Files”). All values from the ACD file are read into memory and files are opened as necessary. You have a handle on the files and memory through the ajAcdGet* family of functions which return pointers to appropriate objects.

6.8.6.1. Sequence or Menu Selection Retrieval

To retrieve the sequence or menu selection object pointers are declared then initialised using the appropriate ajAcdGet* function.

6.8.6.1.1. Input sequence

To retrieve an input sequence:

    AjPSeq seq=NULL;

    seq = ajAcdGetSeq("sequence");
6.8.6.1.2. Menu selection

The option selected from the list of genetic codes is required as an integer. ajAcdGetListSingle returns the selection as a string, the initial part of which is converted to an integer using ajStrToInt. This integer is passed to ajTrnNewI for the creation of the translation table object. This is why the list order in the ACD file is important! You must also declare a translation object pointer:

    AjPTrn trnTable = NULL; /* Translation object pointer */
    AjPStr gcode = NULL; /* Genetic code (selection from list) */
    ajint  n = 0;  /* Selection */

    gcode    = ajAcdGetListSingle("table");
    ajStrToInt(gcode,&n);

    trnTable = ajTrnNewI(n);

6.8.6.2. Memory Management

It is your responsibility to free up memory at the end of the program. You must call the default destructor function for the translation, sequence and string objects used for the ACD data definitions:

/* Deletes a translation table object */
void  ajTrnDel(AjPTrn* pthis);    

/* Delete a string object. */
void  ajStrDel (AjPStr *Pstr);    

/* Delete a sequence object. */
void  ajSeqDel (AjPSeq* Pseq);

Function ajTrnExit is automatically called on exit to clean up internal memory used for housekeeping of translation processing:

void  ajTrnExit(void);   

6.8.7. Translation Object Memory Management

6.8.7.1. Default Object Construction

To use a translation object you must first instantiate the appropriate object pointer. Default construction functions are provided. They will read a translation data file from the EMBOSS data search directory (see the EMBOSS Users Guide) called EGC.n, where n is the number of the genetic code to use. This number can be provided explicitly to ajTrnNewI. Alternatively a file can be opened by filename by calling ajTrnNew or ajTrnNewC:

/* Reads EGC.trnFileNameInt where trnFileNameInt is supplied as a parameter. */
AjPTrn  ajTrnNewI (ajint trnFileNameInt);     

/* Reads trnFileName. */
AjPTrn  ajTrnNew (const AjPStr trnFileName);  

/* Reads trnFileName. */
AjPTrn  ajTrnNewC (const char *trnFileName);  

All constructors return the address of a new object. The pointers do not need to be initialised to NULL but it is good practice to do so:

    AjPStr gcode =NULL;
    AjPTrn trnTable = NULL;
    ajint n = 0;

    gcode = ajAcdGetListSingle("table");
    ajStrToInt(gcode,&n);

    trnTable = ajTrnNewI(n);
    /* The object is instantiated and ready for use */

Alternatively:

    AjPStr name     = NULL;
    AjPStr gcode    = NULL;
    AjPTrn trnTable = NULL;
    ajint n = 0;

    gcode = ajAcdGetListSingle("table");
    ajStrToInt(gcode,&n);

    name = ajStrNew();
    ajFmtPrintS(&name, "EGC.%d", n);   /* Create the string EGC.n */
    trnTable = ajTrnNew(name);
    /* The object is instantiated and ready for use */

6.8.7.2. Default Object Destruction

For the examples above you must free a single string and sequence:

    AjPSeq seq   =NULL;
    AjPStr gcode =NULL;

    seq   = ajAcdGetSeq("sequence");
    gcode = ajAcdGetListSingle("table");
    ajStrToInt(gcode,&n);

    /* Do something */

    ajSeqDel(&seq);
    ajStrDel(&str);

You must free the memory for the translation object before the pointer is re-used and also once you are finished with it. A default destructor function is provided:

/* Deletes a translation table object */
void  ajTrnDel(AjPTrn* pthis);    

It is used as follows:

    AjPStr gcode =NULL;
    AjPTrn trnTable = NULL;
    ajint n = 0;

    gcode = ajAcdGetListSingle("table");
    ajStrToInt(gcode,&n);

    trnTable = ajTrnNewI(n);
    /* The object is instantiated and ready for use */

    ajTrnDel(&trnTable);

    /* The memory is freed and the pointer reset to NULL, ready for re-use. */

6.8.8. Translation

ajTrnSeqSeqOrig creates a peptide sequence containing the full translation of a nucleotide sequence, including any trailing partial codon (1 or 2 base) which translate to X unless the first 2 bases can only define one amino acid:

AjPSeq  ajTrnSeqSeqOrig (const AjPTrn trnObj, const AjPSeq seq, ajint frame);

A nucleotide sequence held in an AJAX string (AjPStr), C-type string (char *) or AJAX sequence object (AjPSeq) can be translated into protein using:

void  ajTrnSeqSeq (const AjPTrn trnObj, const AjPStr str, AjPStr *pep);
void  ajTrnSeqC (const AjPTrn trnObj, const char *str, ajint len, AjPStr *pep);    
void  ajTrnSeqSeq (const AjPTrn trnObj, const AjPSeq seq, AjPStr *pep);

These functions translate in frame 1 (from the first base) to the last full triplet codon. If there are 1 or 2 bases extra at the end then they are ignored.

To translate the reverse complement of a sequence call:

void  ajTrnSeqRevC (const AjPTrn trnObj, const char *str, ajint len, AjPStr *pep);
void  ajTrnSeqRevS (const AjPTrn trnObj, const AjPStr str,  AjPStr *pep);
void  ajTrnSeqRevSeq (const AjPTrn trnObj, const AjPSeq seq, AjPStr *pep);

These functions translate in frame -1 (from the last base) to the first full triplet codon. If there are 1 or 2 bases extra at the start then they are ignored. All functions will append the translation to the input peptide.

Alternative translation is available for people who define frame '-1' as being the frame starting from the first base of a reverse-complemented sequence. To translate the reverse complement of a sequence call:

void  ajTrnSeqAltRevC (const AjPTrn trnObj, const char *str, ajint len, AjPStr *pep);
void  ajTrnSeqAltRevS (const AjPTrn trnObj, const AjPStr str, AjPStr *pep);
void  ajTrnSeqAltRevSeq (const AjPTrn trnObj, const AjPSeq seq, AjPStr *pep);

These functions translate in frame -4 (from the last base) to the first full triplet codon, (i.e. if there are 1 or 2 bases extra at the start then they are ignored. All functions will append the translation to the input peptide.

The frame of translation may be specified:

void    ajTrnSeqFrameC (const AjPTrn trnObj, const char *seq, ajint len, ajint frame, AjPStr *pep);
void    ajTrnSeqFrameS (const AjPTrn trnObj, const AjPStr seq, ajint frame, AjPStr *pep);
void    ajTrnSeqFrameSeq (const AjPTrn trnObj, const AjPSeq seq, ajint frame, AjPStr *pep);
AjPSeq  ajTrnSeqFramePep (const AjPTrn trnObj, const AjPSeq seq, ajint frame);

All functions will append the translation to the input peptide. In contrast, ajTrnSeqFramePep returns a AjPSeq object with the new peptide.

These functions translate in the specified frame (which must be one of 1,2,3,-1,-2,-3,4,5,6,-4,-5,-6) to the last full triplet codon, i.e. if there are 1 or 2 bases extra at the end, they are ignored. Frames -6 to -1 give translations in the reverse sense, frames 1 to 3 give normal forward translations. Frames 4 to 6 reverse complement the DNA sequence then reverse the peptide sequence. Frames 4 to 6 are therefore reversed protein sequences useful mainly for displaying beneath the original DNA sequence.

Frame -1 is defined as the translation of the reverse complemented sequence which matches the codons used in frame 1. For example, in the sequence ACGT the first codon of frame 1 is ACG and the last codon of frame -1 is the reverse complement of ACG i.e. CGT.

Frame -4 is defined as the translation of the reverse complement, starting the translation in the first codon of the reversed sequence. In the sequence ACGT, the last codon is CGT and so frame -4 translates from the reverse complement of CGT (i.e. ACG) - this is for those people who define frame -1 as using the first codon when the sequence is reverse-complemented. This is also known as the 'alternative frame -1'.

Frame -5 starts on the penultimate base. (Alternative frame -2). Frame -6 starts on the ante-penultimate base. (Alternative frame -3). Frame 4 is the same as frame -1, 5 is -2, 6 is -3.

To complete a translation by attempting to translate the last 1 or two bases of a frame call:

ajint  ajTrnSeqDangleC (const AjPTrn trnObj, const char *seq, ajint frame, AjPStr *pep);
ajint  ajTrnSeqDangleS (const AjPTrn trnObj, const AjPStr seq, ajint frame, AjPStr *pep);

In both cases, the translation is appended to the input peptide.

There are functions to translate a single codon into one-letter or three-letter amino acid codes. Alternative functions that take a C-type (char *) string are available but not shown:

/* Translates a codon into a 3-letter code. */
char  ajTrnCodonS (const AjPTrn trnObj, const AjPStr codon);            

/* Translates the reverse complement of a codon into a 3-letter code. */
char  ajTrnCodonRevS (const AjPTrn trnObj, const AjPStr codon);         

/* Translates a codon into a 1-letter code. */             
char  ajTrnCodonC (const AjPTrn trnObj, const char *codon);      

/* Translates the reverse complement of a codon into a 1-letter code. */
char  ajTrnCodonRevC (const AjPTrn trnObj, const char *codon);

6.8.9. Miscellaneous Functions

There are a couple of functions for retrieving the elements of a translation object:

AjPStr  ajTrnGetTitle (const AjPTrn thys);       
AjPStr  ajTrnGetFilename (const AjPTrn thys);

To check whether the input codon is a start codon, a stop codon or something else, call

ajint  ajTrnCodonstrTypeC (const AjPTrn trnObj, const char *codon, char *aa);

ajint ajTrnCodonstrTypeS (const AjPTrn trnObj, const AjPStr codon, char *aa);

To return the genetic code description as a string for a given translation table file name number call:

const AjPStr  ajTrnName(ajint trnFileNameInt);

To create a suitably named sequence object to hold a peptide translation call:

AjPSeq  ajTrnNewPep(const AjPSeq nucleicSeq, ajint frame);

To read a translaton data file (used internally when translation is initialised):

void  ajTrnReadFile(AjPTrn trnObj, AjPFile trnFile);