5.3. Objects (C Data Structures)

To program effectively using the AJAX and NUCLEUS libraries you need to understand how the EMBOSS derived datatypes (objects) are defined and used. These include simple objects such as dynamic AJAX strings and arrays and more complex biological datatypes, such as sequences and alignments. To extend the functionality of the libraries with new datatypes and functions you'll need a deeper understanding of C pointers and memory management, and their particular implementation in EMBOSS. This section covers objects, pointers and memory management and provides a foundation for using and developing new EMBOSS datatypes and functions.

EMBOSS borrows the concept of objects from C++. An object can be thought of as a 'black box' with clearly defined inputs and outputs, but possibly more opaque internals with which the developer need not be concerned. An object stores its own (member) data and knows how to perform certain actions via member functions. From the perspective of the developer it doesn't matter what is going on inside so long as the interface, i.e. the inputs and outputs, remains the same. The use of objects allows the programmer to model their code on the problem more closely, breaking it down into small easily managed pieces.

In EMBOSS the objects are the C data structure definitions. Elements in the structures are the member data. There are no member functions as such, however all the functions that use an object are documented, along with the data elements, in the structured documentation in the C source file. The consistent structuring of code sections and their documentation enforces the naming and classification of all functions in sections for each datatype. This documentation (see Appendix D, Code Documentation Standards) is used online and is accessible via SRS (see Section 1.3, “Developer Documentation”). EMBOSS maintains this link between datatypes and the functions that act upon them so that it's easy to find the objects and functions you need.

5.3.1. Object Definition

Objects in EMBOSS are defined (Appendix C, C Coding Standards) and documented (Appendix D, Code Documentation Standards) in a standard way.

A typical definition, for the public AJAX string object, is shown below and includes the standard documentation:

/* @data AjPStr ***************************************************************
**
** Ajax string object.
**
** Holds a null terminated character string with additional data.
** The length is known and held internally.
** The reserved memory size is known and held internally.
** The reference count is known and held internally.
** New pointers can refer to the same string without needing
** to duplicate the character data.
**
** If a string has multiple references it cannot be changed. Any
** instance to be changed is first copied to a new string. This
** means that any function which can change the character data must
** pass a pointer to the string so that the string can be moved.
**
** A default null string is provided. New strings are by default
** implemented as pointers to this with increased reference counters.
**
** AjPStr is implemented as a pointer to a C data structure.
**
** @alias AjPPStr
** @alias AjSStr
** @alias AjOStr
** @iterator AjIStr
**
** @attr Res [ajuint] Reserved bytes (usable for expanding in place)
** @attr Len [ajuint] Length of current string, excluding NULL at end
** @attr Ptr [char*] The string, as a NULL-terminated C string.
** @attr Use [ajuint] Use count: 1 for single reference, more if several
**                   pointers share the same string.
**                   Must drop to 0 before deleting. Modifying means making
**                   a new string if not 1.
** @attr Padding [ajint] Padding to alignment boundary
** @@
******************************************************************************/

typedef struct AjSStr 
{
    ajuint  Res;
    ajuint  Len;
    char   *Ptr;
    ajuint  Use;
    ajint   Padding;
}   AjOStr;
#define AjPStr AjOStr*
typedef AjPStr* AjPPStr;

You can see that the declaration defines:

  • The object name (AjSStr)

  • A datatype for the string object proper (AjOStr)

  • A datatype for the string object pointer (AjPStr)

  • A datatype for a pointer to the string object pointer (AjPPStr)

The object pointer (AjPStr) is the datatype you'll commonly use and for this reason an AjPStr is often referred to as a "string object", rather than the more cumbersome "string object pointer". Of course an AjPStr points to a string object in memory.

The datatype naming conventions are supposed to make the names intuitive:

  • Aj indicates that the object belongs to the AJAX library

  • P indicates that the datatype is a pointer

  • Str gives a clue to the name of the AJAX library file in which the object is defined i.e. the string-handling library ajstr.h

Many other basic datatypes are available. For example the basic file object AjPFile is defined in the AJAX file ajfile.h whereas the input (AjPSeq) and output (AjPSeqout) sequence objects are in ajseqdata.h and ajseqwrite.h respectively. Different naming conventions apply for public NUCLEUS objects and for private objects including those listed in the application code (see Appendix C, C Coding Standards).

5.3.2. Object Functions

You should never access the elements of an object directly, that is what the library functions are for. Objects should always be accessed by calling the appropriate AJAX or NUCLEUS functions. These are fully described by structured comments in the source files in a similar way to the objects themselves (see Appendix D, Code Documentation Standards). Functions are organised by the datatype they act upon and, for easier navigation, into sections of related functionality. This documentation is available online and via SRS (see Section 1.3, “Developer Documentation”).

The function sections are supposed to help you to quickly find the functions you need. Functions in the same section tend to have similar names and return types, and similar number, order and type of parameters. As functions in the same section are all used in a similar way then programming with the libraries is reasonably intuitive.

Most of the sections are unique to a given library file however there are some common sections. For example, many of the library files have an "element retrieval" section for functions which return a data element of an object and an "element set" section for functions that set the value of a data element directly. Most of the complex biological datatypes have "input" and "output" sections for reading or writing the data to file in a formatted way. This includes input (AjPSeq) and output (AjPSeqout) sequence objects, application reports (AjPReport) and sequence alignments (AjPAlign). The common sections are described in more detail in Appendix D, Code Documentation Standards.

The main thing to be aware of when using objects with functions is that object pointers (for example an AjPStr) are always used: a data structure proper is never passed to or returned from a function for reasons of efficiency. Furthermore, for consistency, all functions in EMBOSS should obey the following rules:

  • If a function changes the pointer (so that it points to a new object) or changes the data pointed to in any way then the address of the object pointer is passed.

  • If the function merely reads the data pointed to and does not change the pointer itself then the plain object pointer is passed.

Consider, for example, two functions in the AJAX string library ajstr.c: ajStrMatchS and ajStrAssignS. ajStrMatchS compares two strings and returns ajTrue if they are the same whereas ajStrAssignS copies one string value into another.

You can deduce that ajStrMatchS merely reads two string values and therefore object pointers will be passed. In contrast, ajStrAssignS must change the value of the destination string, possibly allocating a new string (as a failsafe in case a NULL pointer is passed to it) or reallocating a new bigger string in cases where a destination string is passed but is too small to hold the new value. In either case the string value will be changed and possibly the pointer too, therefore the address of the object pointer for the destination string is required.

The prototypes show that this is indeed the case:

AjBool ajStrMatchS(const AjPStr str, const AjPStr str2);
AjBool ajStrAssignS(AjPStr* Pstr, const AjPStr str);

The two strings (str and str2) passed to ajStrMatchS are only read from, therefore the parameters are object pointers (AjPStr). The source string (str) of ajStrAssignS is also read-only, whereas the destination string (Pstr) is modified and therefore the address must be passed (AjPStr* Pstr).

Tip

When you are using the existing library functions you needn't worry about function internals. The documentation describes whether an object pointer or the address of it is required. So long as you pass to functions what is shown in their prototype you will be fine. Furthermore, the EMBOSS application code is a rich source of examples of how the functions are used in practice.