Software development and maintenance under EMBOSS is made easy. EMBOSS has powerful inbuilt functionality which any native application can make use of with little or no additional coding, saving you a great deal of effort. It includes extensive C programming libraries for extending the core functionality and developing new applications. Well defined processes are in place for key aspects such as quality assurance testing, installation, maintenance and support. General aspects are handled by the EMBOSS developers, leaving you to the support the parts specific to your own software.
Major inbuilt areas of functionality include:
Support for biological datatypes
Support for common file formats
Simple database configuration
Command line handling
Command line qualifiers
Sequence and sequence feature specification
EMBOSS supports a variety of standard and biological datatypes (Section 4.3, “Data Definition”). Datatypes for application inputs and outputs are defined in the AJAX Command Definition (ACD) file, which describes in detail the application parameters and command line interface. Datatype attributes allow an input or output to be specified in greater detail, for example by setting a default value or by defining the permitted type of an input sequence. Calculated attributes, such as the length of an input sequence, are available once the data has been read during ACD file processing but before the application proper starts, allowing for further fine control in the ACD file.
EMBOSS supports a wide range of file formats for input and output and is moving towards using standard report formats for application output. Biological sequences can be read and written using all common formats. EMBOSS detects the format of an input sequence automatically, but this can be specified for efficiency. Many other formats for sequence features, alignments, data files and so on are handled automatically. Any application you write will support these formats automatically too. Furthermore, when new input and output data formats are added to EMBOSS, your applications will automatically be able to use them; no application code needs to change. For information on supported file formats, see the EMBOSS Users Guide.
Database configuration is simple and flexible. A variety of databases (which can include remote data servers) and diverse access methods are supported. New databases and access methods are easily added. For more information see the EMBOSS Administrators Guide.
Your application will use the EMBOSS command line which is consistent across the applications. The ACD file defines the datatype and permissible values for all application options. The ACD file and all input at the command line is validated at startup, before the application proper starts. Command line behaviour, for example the production of a sensible prompt and reprompting for values that are out of range, is handled automatically. For more information, see the EMBOSS Users Guide.
EMBOSS includes various command line qualifiers which are not (typically) set in an ACD file. These include global qualifiers and various datatype-specific qualifiers. Global qualifiers are available to all applications and control general application behaviour. Datatype-specific qualifiers are used to specify particular inputs and outputs, for instance sequences, in more detail. For more information see the EMBOSS Users Guide.
Sequences are referenced on the command line by their Uniform Sequence Address (USA) which is a powerful and flexible specification for defining the location and format of sequence data. Similarly, sequence features are defined using a Uniform Feature Object (UFO). Any application you write will automatically support USAs and UFOs. For more information see the EMBOSS Users Guide.
Before you start coding, please discuss your ideas with the EMBOSS developers:
Also let other EMBOSS users know. Many people may wish to collaborate with you or suggest easier ways of doing things:
There are a few basic things to consider before you start coding, especially if you intend your code to be incorporated into the public pacakge:
Application ACD file. Any new application will need an ACD file. They are easy to write once you are familiar with the general layout, syntax (Appendix A, ACD Syntax Reference) and programming methods (Chapter 5, C Programming).
Linking to the EMBOSS Libraries. Application code must be linked to the EMBOSS libraries. The minimum requirement is code for processing the application ACD file (see Section 6.3, “Handling ACD Files”). In some cases you will want to leave the original source code untouched and develop an EMBOSS wrapper application to it instead (Chapter 10, Incorporating Third-party Applications). If you are not clear on what to do then the EMBOSS developers will assist you.
Choice of programming language. Ideally code should be in ANSI C or C++, however, contributions in other languages will be considered and are welcome.
Coding standards. Code should adhere to basic conventions and standards (Appendix C, C Coding Standards). This is not strictly required but is encouraged. Certainly, do not be put off submitting code because you don't think it's up to scratch. It can be improved later if necessary.
Licensing. Ideally code will be licensed under a GPL licence, however, code with non-GPL licences can be incorporated under EMBASSY (see Section 1.1, “Licence Information”).
Testing. All submitted code should be well tested and bug-free. If you've developed new applications then ideally you will provide formal quality assurance tests for it. For further information see Chapter 7, Quality Assurance.
Documentation. All applications and the underlying code should be well documented. The minimum requirement is that others are able to follow what you have done and use the software. Ideally the documentation will conform to basic standards defined for the code (Appendix D, Code Documentation Standards) and the applications (Section 8.1, “Application Documentation Standards”). Don't be put off if you think your code is not up to scratch. The EMBOSS developers will advise you on how it can be improved.
For more information, see Section 3.1, “EMBOSS Programming”.
There's more to software than just coding, especially when developing for EMBOSS which is widely distributed and deployed in production environments. All code that is to be integrated into the package must be tested, documented and managed throughout its lifetime. Regardless of whether you intend to develop the libraries or new applications, many of the steps are the same:
ACD file development (applications only)
C source code development
Integration and compilation
Quality assurance testing
Maintenance, support and training
It's worth familiarising yourself with these steps, described below, before starting a project. This should help you plan your software effectively, develop it efficiently and deliver your projects on time.
All software begins with an idea and ends, ultimately, with the software being retired. An appreciation of the steps involved will help the developer avoid pitfalls and problems along the way, or at least be prepared for them, saving much time and effort. Yet many projects are started without this foresight, especially by programmers who are eager to get on coding. This section gives a basic overview of the software life cycle for new EMBOSS programmers.
There is a single major release of EMBOSS on St Swithun's day (15th July) each year. The version number usually, but not always, increases by 1 each year. Other versions are released in the interim as required. The EMBASSY packages do not have a fixed release date or formal version numbering scheme. To avoid confusion, version numbers for EMBASSY packages which wrap third party software are set to the same as the version number of the software that has been wrapped. Individual EMBOSS or EMBASSY applications do not have their own version stage or number. New programs are announced on the EMBOSS mailing lists to solicit feedback from users.
You should decide upon a general model for the development and release of your software. No model has been universally adopted however all projects begin with a pre-development stage, before coding starts, which is followed by one or more versions of the software reflecting the maturity of the code. The following stages are typical:
In practice, very often there is no formal software release corresponding to every stage. For example, the users might only see a beta version and the final release. Nonetheless it is helpful to bear these stages in mind as it helps when managing the expectations of your users. It should go without saying that, for instance, it would be foolish to distribute broadly code that had not been well tested in a beta version. In practice, early releases often contain many bugs and for this reason people are usually wary of software until it has matured over a period of months or even years.
Pre-development. The pre-development stage occurs before coding starts. Requirements are specified and expected work throughout the software lifetime is planned.
Pre-Alpha. Pre-Alpha software is still under active design and development. Usually it is a prototype for the version that might eventually be released. From the user's perspective a pre-alpha release is useful to get the gist of what's to come, from the developer's perspective it provides a means to get early feedback from users in order to inform the design and development.
Alpha. Alpha software is usually reasonably well developed but not ready for general release. There will be very limited, if any, deployment. Such deployment is often restricted to the primary user group or some other select group of users who are known to provide good feedback. Typically, most major features will be implemented and most major bugs removed, but there will be many missing minor features, bugs and other issues. The alpha release might be a complete rewrite or an evolution of the pre-alpha version.
Beta. Beta software is the first significant release. Its main purpose is to get the software tested by as large a group of users as possible and to solicit feedback. Beta software might not be perfect, it might not work under all circumstances, but it should be relatively bug-free. Typically, the majority, if not all, known major bugs should be fixed.
Release candidate. The release candidate is a refinement of the beta version in which all known bugs are fixed. It may have several new features if these were heavily requested during beta testing, but these should not be a major departure from its basic function. Such major requests should (ideally) have been caught earlier on. Major features are now "frozen", and only bug fixes or minor new features implemented from this point on. The release candidate can be thought of as a half-way house between versions used for testing and that which is deployed. Release candidates should be subject to heavy testing by the developer and in real use-case scenarios.
Release. The release software is fully developed and tested and should be ready for use in anger by your users. It is the final version of the software arising from testing and refinement of the release candidate. The release will normally have a number, for example, Version 1.0.
The EMBOSS developers use a combination of the "Code and Fix" and "Synchronise and Stabilise" models (see below). This has proved to be the most pragmatic way to keep users happy, especially balancing the requirements for new functionality with that for stability and consistency in the package as a whole.
The general approach is to fix bugs as soon as they are reported and provide bug fix files (see the EMBOSS Users Guide) as required. New features or applications are developed in close collaboration with the users, typically as requested, and are developed and released in a scheduled manner. This is done in as timely a manner as possible allowing for all bugs to be fixed and major issues to be addressed. All bug fixes and requests for features that cannot immediately be implemented are logged on the SourceForge website (see Section 1.5, “Contributing Software to EMBOSS”). In this way valuable contributions are not lost even if they cannot be acted upon immediately.
Many software life cycle models have emerged and all tend to focus on the development process itself. They are summarised below. It is up to you to adopt one of these models (or something else) that is most appropriate to your needs and situation.
Code and fix model. The simplest model iterates a cycle of coding and bug-fixing until the software is fit for purpose:
Write source code
Compile and run program
Test and find bugs
Waterfall model. In this model development proceeds in a linear series of definite management or development steps. The steps might be followed by testing, which must be passed before proceeding to the next step. Typically, return to a previous step is only allowed if a new problem is identified in the existing step. Thus development proceeds incrementally and logically. The drawback however is that such models are rather inflexible to changing requirements.
Spiral model. The spiral model attempts to address the inflexibility of the waterfall model by iterative cycles of development. This introduces more chances for user involvement but the drawback is that the model is rather cumbersome and therefore difficult to stick to in practice. Each cycle has four phases:
Specification: user requirements, constraints and possible approaches are identified.
Evaluation: alternative approaches are evaluated, typically through development of prototype software.
Development: code is designed in detail, implemented and tested.
Planning: the software and plans for the next cycle are evaluated in light of user feedback.
Prototyping. The prototyping models try to ensure that software meets user requirements by introducing definite evaluation points in the development process. At these points the user reviews the software, typically a prototype version, via a demonstration or by hands-on testing. The software under review might be a small part of a larger system. As development progresses the developers tend naturally to focus on design and implementation issues, whilst the users remain focused on their requirements. The prototyping models attempt to avoid any divergence of views through regular appraisal.
Rapid application development. This is a general approach to software development and a move away from inflexible formal models. Typically a series of prototypes are reviewed by the user and developer to ensure user requirements are met and fine-tuned in the face of the continuing developments. Analysis, design and implementation at each stage is usually limited to a definite time. The drawbacks to this approach are that, in the absence of a formal statement of a design up front, it can be difficult to agree when a project is finished and the software can evolve in an undesirable or uncontrolled way.
Synchronise and stabilise model. This model coordinates multiple developers who are working in parallel and attempts to quickly produce software that fits most needs, and to improve it in subsequent versions. It relies on frequent depositions of code which are synchronised to ensure compatibility. Testing occurs at all stages and in parallel with development and all reported bugs are fixed immediately. Features tend to be implemented incrementally, starting with those of highest priority to the user. The software is stabilised at key stages in its lifetime through extensive user testing and bug fixing.
Iterative and incremental models. Such models rely on a set of specific use-cases and a detailed software architecture. Each use-case describes what an individual or group of users want from the system, and together describe the complete functionality. The software architecture describes all significant components of the software and environments in which it will run. The use-cases drive the development whereas testing ensures that the use-cases are implemented correctly. Coding is usually divided to cover groups of related use cases which are tackled iteratively. Each iteration results in an increment of the software, i.e. one with more use-cases catered for. In this model a design may well evolve as required, especially in the early stages, with later iterations tending to add to, rather than modify, what has already been done.
The EMBOSS mailing lists (Section 1.4, “Project Mailing Lists”) allow discussions between the user community and developers. As a developer you should subscribe to the lists. Any new software should be discussed on the mailing lists before it's implemented, especially to establish whether it is really needed or already exists.
Software must fulfil the requirements of the intended users and use cases. It is therefore valuable, essential even, that your users are involved in the development of the software throughout its lifespan, especially these four stages:
Testing and validation
There are additional requirements (e.g. software testing and surveying of user requirements) and strategy (e.g. how best to implement for evolving requirements) which are beyond the scope of this manual, but the essential message is simple:
Stay in touch with your users, even if you're the only user, at all stages in the software lifespan.
Pre-development. It's essential to get as complete a picture as possible of the user or use-case requirements at the pre-development stage, before coding starts. Try to pin your users down to exact, specific requirements. The devil is in the detail as any misinterpretations can lead to wasted developments. The requirements must be balanced by the constraints on the developer and it might be that some features are planned for the first release and others are left for a later date, but are considered in the design. It is usually far harder to modify existing code for new features than to code (or at least consider) them from the start, so any effort here will be amply rewarded.
Development. Demonstrate your software to its intended users at all stages of development, but especially early on. This allows for confirmation that functionality is developing as planned. As implementation proceeds and becomes fixed on a particular approach, there can be a tendency for the developer's and user's notion of the software to diverge. This is mitigated if good communication is maintained. Inevitably, during implementation, issues arise and ideas evolve. By keeping in touch you avoid unnecessary developments and can fine-tune the user requirements in light of the continuing developments.
Testing and validation. A most valuable contribution of users is in testing software. They will find ways to use (and abuse) the software that you did not anticipate. Crucially, this should validate that the software works as required in real-case scenarios and meets the user's needs. Often, new bugs and requirements will arise from testing, which must be fixed or implemented as the beta software is refined. Testing should result in a release candidate in which major features are frozen. From that point on only minor bug fixes and features should be implemented for the release proper. Be organised so that suggestions can be incorporated at the appropriate time: suggestions for major new features are not appreciated the day before the release date! Try to get a maintain a clear picture of your users requirements and plan your work accordingly.
Evolution. Even software implemented to the very best design can only fulfil the requirements that were known at the time. In practice, your software must evolve to keep pace with your users, whose needs are often evolving. In some cases, when adapting software for changing requirements, it's necessary to move the software back into beta or even to start from scratch where a complete redesign is necessary. In such cases an appreciation of the current stage of the software helps to manage expectations.
Before you start coding you should think deeply about what exactly the code is supposed to achieve. Do this on at least two different occasions so that you are certain you've considered everything. Once you start coding it'll be much more difficult to incorporate new ideas. The more detailed the planning the better!
Except for trivial cases it is helpful to code to some sort of design. This could be a "disposable" first version of the code, intended only to test out ideas, that is discarded when coding proper begins. Otherwise use formal and informal methods such as flowcharts, logical steps in prose etc. Anything here that helps to form a crystal clear idea of the purpose and behaviour of the code will suffice. Give particular attention to inputs and outputs, major logical steps and any tricky steps. The process should result in a design that is fit to adopt for coding.
EMBOSS includes many applications and it's likely that one exists which does something similar to what you need. Before you start coding you should therefore check the application documentation:
There is a guide on navigating the documentation in the EMBOSS Users Guide.
There are three basic types of application development:
Modify an existing application
Develop a new application
Develop a wrapper application to existing third-party software
If it seems that only a simple modification to an existing application is required, this might easily be achieved by modifying the ACD file without any new C coding at all. See ACD file development (Chapter 5, C Programming) and ACD syntax (Appendix A, ACD Syntax Reference).
In other cases an entirely new application is required. The AJAX and NUCLEUS libraries (see Appendix B, Libraries Reference) provide a comprehensive toolkit for developing applications from scratch. The existing application ACD files and C source code might provide a valuable starting point or offer clues as to how to proceed.
There are also cases where you either already have the source code for an application that you want to incorporate into EMBOSS, or you otherwise want to provide an EMBOSS-style command line interface to an existing application. The wrapping and porting of applications under EMBOSS is described later (Chapter 10, Incorporating Third-party Applications).
When considering a new application there are two basic rules (with a few exceptions):
An EMBOSS application should perform a single clearly defined operation.
A new application should only be written if it differs by more than one extra major parameter from an existing program.
Three areas in planning deserve special attention:
Inputs and outputs
Objects and functions
Write some preliminary documentation for the application and its code before coding. This might include, but is by no means limited to:
A single-line description of what the application does
A more detailed description of what the application does
A list of the input and output files
A list of the other application parameters
A single line description of each parameter
A default value for each parameter and whether a value is required from the user
Numbered comments for all the major logical steps to your program (goes into the C source code file)
Comments for tricky steps in your program
Pseudo-code or other structured comments for each major step to capture the program logic, loop structures and so on.
Give particular attention to the application parameters; all of its inputs, outputs and any other options. You should make a definitive list of the data definitions (Section 4.3, “Data Definition”) you'll need in your ACD file.
You should know what objects (C data structures) and functions you'll need before you start coding, so that everything you need is available to you. This is just the same as collecting all your tools together before starting a DIY project. Progress will be slow if you have to repeatedly break off to find functions when what you should be doing is concentrating on the program logic. If any new objects and functions are required then refer to the guidelines for library development (Section 5.5, “Programming with Objects”).
The AJAX and NUCLEUS libraries are easily extended, but might already contain the functionality you need. The first thing to do therefore is to check the documentation for AJAX and NUCLEUS (see Appendix B, Libraries Reference).
Library developments include:
Extending the functionality of the existing code
Developing new objects and functions
Developing entirely new library files
When developing library code, be careful to avoid redundancy with the existing code. You should, if possible, modify existing objects or functions, or at least use the C source code as a starting point or for ideas on how to proceed.
Where the code is specialised (used by one application only) it should stay in the application C source code file until it is more generally used. Only code that is likely to be of general use should go into AJAX or NUCLEUS.
Where new code is in the area of an existing AJAX or NUCLEUS library file it should be added to that file. Where a new category of functionality is, a new library file should be created for it.
An ACD file describes in detail the parameters and command line interface for a single EMBOSS application. It specifies exactly what input values are required or permitted, and how user input is prompted for, to ensure that the application can run correctly. It also defines the basic functionality of other interfaces, e.g. GUIs, derived from the ACD files.
An ACD file must contain a single application definition (Section 4.2, “Application Definition”) and then a data definition (Section 4.3, “Data Definition”) for each parameter. A definition consists of one or more attributes which are name:value pairs and describe the application or parameter in detail. The application definition is given first followed by the data definitions, which are organised into sections for "Input", "Output" and so on.
It's recommended to write the ACD file before the C source code because it defines much of the program's function. Writing the ACD file should be easy if you already have a detailed list of all the application parameters from your design. ACD files are written in the ACD syntax (Appendix A, ACD Syntax Reference); see the programming guidelines (Chapter 5, C Programming).
Various utilities (Section 4.6, “ACD Utilities”) are provided for validating, debugging and processing ACD files. You should make use of these tools to ensure that the ACD file is valid and properly formatted. For example, you can test an ACD file for functionality by using the acdc utility. Once you have a functional version you can generate one with standard formatting by using the acdpretty utility.
To ensure consistency in the EMBOSS code, all C code that you write should conform to a basic style. You should familiarise yourself with the standards (Appendix C, C Coding Standards) before coding.
The coding standards mostly concern the layout of code, but includes some general guidelines for C programming in EMBOSS and particularly programming datatypes and functions.
You will need to create (if necessary) a new application C source code (
) file in the appropriate directory. The file should have a short meaningful name which, if possible, bears some resemblance to applications of similar function. Typically the EMBASSY myemboss package is used for individual developments. Alternatively the applications directory in another EMBASSY package or EMBOSS itself could be used:
A few basic steps are recommended when writing application C source code:
Consider the correct layout for the code (see Appendix C, C Coding Standards) including structured comments. Copy the layout of an existing application, or (better still) use the template (
.../myemboss/src/template.c) file that is provided in the myemboss.
Add an empty
Add prototypes for any new functions. These can be moved to AJAX or NUCLEUS later if required.
Add definitions for any new datatypes (see Section 5.5, “Programming with Objects”). Again, these can be moved to the libraries later.
Add any comments, pseudo-code etc from your preliminary design.
Declare variables and implement C code for reading any ACD data items (see Section 6.3, “Handling ACD Files”).
Add code at the end of
main() for cleaning up memory. You know from your ACD file the parameters for which memory will have been allocated.
Write the remaining C source code for the
Write the C source code for any functions.
Ensure the code in general and functions and datatypes in particular are adequately documented (see Appendix D, Code Documentation Standards).
New C source code (
*.c) and / or header files (
*.h) may be created in the core AJAX or NUCLEUS directory as required:
The files should have a short, meaningful name. AJAX files must begin with the prefix
aj, for example
ajstr.c are used for the string-handling library. NUCLEUS files must have the prefix
emb, for example
embaln.c are used for the alignment algorithm code.
To integrate large collections of code, for example third-party libraries, a new directory would usually be created under the main AJAX directory:
For example, files in the zlib library live in:
There are some basic steps when writing library code:
Consider the correct layout for the code (see Appendix C, C Coding Standards) including structured comments. Copying the layout of an existing library file is a good strategy.
Add function prototypes.
Add datatype definitions (see Section 5.5, “Programming with Objects”). This includes private datatypes (in the source file) and public datatypes (in the header file).
Write the C source code for the functions. This includes private (static) and public (external) functions, both of which go in the source file.
Ensure the code in general and functions and datatypes in particular are adequately documented (see Appendix D, Code Documentation Standards).
New applications, regardless of whether they are destined for EMBOSS or an EMBASSY package, will typically be developed first in myemboss (Section 3.2, “Integration and Compilation”). This is an almost empty EMBASSY package and provides a way to isolate local application development from the rest of EMBOSS. Applications are built using a
Makefile file that is generated from a
Makefile.am file, which must be edited to mention the new applications. An application developed and tested in myemboss can be added to the main package or an EMBASSY package using essentially the same process.
To add an entirely new EMBASSY package you must create the directory structure and required files, which includes the application ACD files, C source code files and documentation, and edit the file
configure.in and several
New functions and datatypes for AJAX and NUCLEUS must have structured comments to be properly integrated with the rest of the libraries. Entirely new library files are integrated by editing the
Makefile.am in the AJAX or NUCLEUS directory as appropriate. You must also add any new header files to
ajax.h and / or
For more information see Section 3.2, “Integration and Compilation”.
Any contributed code should be as bug-free as possible. EMBOSS includes inbuilt features for debugging applications. In addition, debugging functions are provided for fixing specific modules of code. For further information see Section 3.3, “Debugging”.
Various quality assurance (QA) tests are performed on the EMBOSS code and documentation to maintain quality. All applications are run on test data to ensure they work as advertised. Regular compilation and memory leak tests of the whole package ensure the integrity of the code; for example that applications are not broken by recent changes to the library. The structured documentation for objects and functions is validated. This ensures, for example, that functions and parameters have meaningful and consistent names and functions are listed in the correct section of the file.
QA testing is handled by the EMBOSS developers but there are ways to help. Your own code should be tested thoroughly in real use-case scenarios. In particular, it should be tested for memory leakage (see Section 3.3, “Debugging”) before it is contributed: if you are unclear how to do this then ask on the EMBOSS mailing lists (Section 1.4, “Project Mailing Lists”). Any library code for submission, or at least new public datatypes and functions, should be documented with structured comments (see Appendix D, Code Documentation Standards) so that validation checks can be performed. Any new applications should, ideally, have test cases and test data (see Chapter 7, Quality Assurance).
Software without documentation often has little value whereas good documentation can enhance the usefulness of software immensely. All EMBOSS application and library code for submission should be adequately documented. End-user documentation is also required for any new applications.
The process for generating the application end-user documentation, which is split into different sections, involves combining manually written components with parts that are generated automatically (for example, from parsing the ACD file). The process is largely automated. If you develop a new EMBASSY package then it will require documentation above and beyond that for individual applications.
All library code must be adequately documented. In particular, structured comments are required for public datatypes and functions. If you develop a new library file then it will require documentation, above and beyond that for individual objects and functions.
Structured comments at the head of application code mostly contain licence and bibliographic information. However, all public functions and datatatypes must also be documented using structured comments, which are parsed to generate the online library documentation (Section 1.3, “Developer Documentation”). Functions are organised in the library files under functional sections, which themselves are documented using structured comments. These define, amongst other things, naming rules for functions and their parameters. The structured comments at the head of the library file usually contain extensive documentation. If you develop a new library file then you will need to provide documentation to the same level.
See the instructions on code documentation (Appendix D, Code Documentation Standards) and application end-user documentation (Section 8.1, “Application Documentation Standards”) for more information.
Any contributed code will be widely distributed and installed on many different platforms. EMBOSS is known to work on almost every UNIX system, Microsoft Windows and MacOSX. There is nothing to do here because this is managed by the EMBOSS developers, although it helps greatly if you integrate, test and properly document any new software as summarised above.
Any contributed code will need maintenance and support. This, to an extent, is covered by the EMBOSS developers who also run EMBOSS training courses. However, it helps greatly if you maintain your own source code and support it through the EMBOSS mailing lists (Section 1.4, “Project Mailing Lists”).