5.3. Introduction to Feature Formats

5.3.1. What is a Feature?

A feature is a region of interest in a molecular sequence. Features include things like restriction enzyme cut sites, protein secondary structure prediction states, exon positions, regions of motif matches etc.

A vast number of programs generate features in one form or another, leading to a huge number of file formats used for features. The output types range from graphical displays of where restriction enzymes cut, to probabilities of the three states of a protein secondary structure prediction along a sequence, to rigidly defined text tables of the start and end positions of predicted exons or motif matches.

5.3.2. Supported Feature Formats

To handle the diversity, EMBOSS, where possible, uses the well defined and flexible feature formats that were developed for the major sequence databases:

  • EMBL, Genbank, DDBJ

  • Swissprot

  • PIR, NBRF

  • GFF GFF3 format

  • GFF2 the older and less strict GFF2 format

The feature formats used by EMBOSS are identical to that used in the sequence database formats of the same name, e.g. EMBL feature format is the same as the (subset of the) EMBL sequence database format. This holds true regardless of whether features are written together with their sequence or in a raw feature table (see below).

The support for a set of standard feature formats enables programs to be interoperable; being able to read or write each other's output without the need for file interconversion. As the EMBOSS project matures, the feature formats will become the default way of reporting features. This will also give a consistent look and feel, helping you to compare features in different sequences and from different programs more easily. For descriptions and examples of the supported formats see Section A.2, “Supported Feature Formats”.

The supported feature formats are summarised in the table below. The columns are as follows: Output format (format name), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented) and Description (short description of the format).

Table 5.3. Input feature formats
Output FormatNucProDescription
emblYesNoembl/genbank/ddbj format
gff2YesYesGFF version 1 or 2
gff3YesYesGFF version 3
pirNoYesPIR format
swissNoYesSwissProt format
Table 5.4. Output feature formats
Output FormatNucProDescription
dasgffYesYesDAS GFF format
debugYesYesDebugging trace of full internal data content
emblYesNoembl format
genbankYesNogenbank format
gffYesYesGFF version 3
gff2YesYesGFF version 2
pirNoYesPIR format
swissNoYesSwissProt format

5.3.3. How are Features Stored ?

In EMBOSS, a feature is a region of interest in a nucleic or protein sequence and is described by:

  • Name describing the feature

  • Start and end position

  • The sense (in a nucleic sequence)

  • The reading frame (in a translated nucleic sequence)

  • A score

Features may also explicitly or implicitly hold the name of the program or database that they are derived from and various other descriptive data (see the EMBOSS Developers Guide).

A feature table is simply a group of features. They are stored in one of three ways:

  • As part of a sequence file

  • As part of a database entry

  • As a raw feature table: a file that does not contain the sequence the features refer to.

Most feature table definitions have a controlled vocabulary, i.e. there is a specified list of feature key names that can be used to label features. This means that a software developer cannot edit feature tables to add in features with new keys. If a feature table is edited, one must stick to the allowed set of feature keys.

5.3.4. Applications for Features

Some applications for handling generic sequence features are summarised below (???).

Feature Applications

ApplicationDescription
coderetExtract CDS, mRNA and translations from feature tables.
extractfeatExtract features from a sequence.
maskfeatMask off features of a sequence.
showfeatShow features of a sequence.
twofeatFinds neighbouring pairs of features in sequences.

In addition, the diffseq and seqret applications also handle the feature table of an input sequence.

5.3.5. Specifying Features on the Command line

The Uniform Feature Object or UFO (Section 6.7, “The Uniform Feature Object (UFO)”) is the standard way, used by EMBOSS, of referring to feature input and output files on the command line. A UFO is used to specify a feature file by name and by the format of the features in the file.

Various qualifiers are provided for flexible handling of features on the command line (see Section 6.4, “Datatype-specific Command Line Qualifiers”). These allow you to set such things as file name and format and the region of the sequence containing the features of interest.

For example, the format of input features is specified with -fformat Format, where Format is the name of a supported feature format. Here, embl format is specified:

extractfeat myfile.feat -fformat embl

This could also have been specified in the UFO (Section 6.7, “The Uniform Feature Object (UFO)”) of the output sequence:

extractfeat embl:myfile.feat