5.3. Introduction to Feature Formats

5.3. Introduction to Feature Formats
Prev	Chapter 5. File Formats	Next

5.3.1. What is a Feature?

A feature is a region of interest in a molecular sequence. Features include things like restriction enzyme cut sites, protein secondary structure prediction states, exon positions, regions of motif matches etc.

A vast number of programs generate features in one form or another, leading to a huge number of file formats used for features. The output types range from graphical displays of where restriction enzymes cut, to probabilities of the three states of a protein secondary structure prediction along a sequence, to rigidly defined text tables of the start and end positions of predicted exons or motif matches.

5.3.2. Supported Feature Formats

To handle the diversity, EMBOSS, where possible, uses the well defined and flexible feature formats that were developed for the major sequence databases:

EMBL, Genbank, DDBJ
Swissprot
PIR, NBRF
GFF GFF3 format
GFF2 the older and less strict GFF2 format

The feature formats used by EMBOSS are identical to that used in the sequence database formats of the same name, e.g. EMBL feature format is the same as the (subset of the) EMBL sequence database format. This holds true regardless of whether features are written together with their sequence or in a raw feature table (see below).

The support for a set of standard feature formats enables programs to be interoperable; being able to read or write each other's output without the need for file interconversion. As the EMBOSS project matures, the feature formats will become the default way of reporting features. This will also give a consistent look and feel, helping you to compare features in different sequences and from different programs more easily. For descriptions and examples of the supported formats see Section A.2, “Supported Feature Formats”.

The supported feature formats are summarised in the table below. The columns are as follows: Output format (format name), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented) and Description (short description of the format).

Table 5.3. Input feature formats
Output Format	Nuc	Pro	Description
embl	Yes	No	embl/genbank/ddbj format
gff2	Yes	Yes	GFF version 1 or 2
gff3	Yes	Yes	GFF version 3
pir	No	Yes	PIR format
swiss	No	Yes	SwissProt format

Table 5.4. Output feature formats
Output Format	Nuc	Pro	Description
dasgff	Yes	Yes	DAS GFF format
debug	Yes	Yes	Debugging trace of full internal data content
embl	Yes	No	embl format
genbank	Yes	No	genbank format
gff	Yes	Yes	GFF version 3
gff2	Yes	Yes	GFF version 2
pir	No	Yes	PIR format
swiss	No	Yes	SwissProt format

5.3.3. How are Features Stored ?

In EMBOSS, a feature is a region of interest in a nucleic or protein sequence and is described by:

Name describing the feature
Start and end position
The sense (in a nucleic sequence)
The reading frame (in a translated nucleic sequence)
A score

Features may also explicitly or implicitly hold the name of the program or database that they are derived from and various other descriptive data (see the EMBOSS Developers Guide).

A feature table is simply a group of features. They are stored in one of three ways:

As part of a sequence file
As part of a database entry
As a raw feature table: a file that does not contain the sequence the features refer to.

Most feature table definitions have a controlled vocabulary, i.e. there is a specified list of feature key names that can be used to label features. This means that a software developer cannot edit feature tables to add in features with new keys. If a feature table is edited, one must stick to the allowed set of feature keys.

5.3.4. Applications for Features

Some applications for handling generic sequence features are summarised below (???).

Feature Applications

Application	Description
coderet	Extract CDS, mRNA and translations from feature tables.
extractfeat	Extract features from a sequence.
maskfeat	Mask off features of a sequence.
showfeat	Show features of a sequence.
twofeat	Finds neighbouring pairs of features in sequences.

In addition, the diffseq and seqret applications also handle the feature table of an input sequence.

5.3.5. Specifying Features on the Command line

The Uniform Feature Object or UFO (Section 6.7, “The Uniform Feature Object (UFO)”) is the standard way, used by EMBOSS, of referring to feature input and output files on the command line. A UFO is used to specify a feature file by name and by the format of the features in the file.

Various qualifiers are provided for flexible handling of features on the command line (see Section 6.4, “Datatype-specific Command Line Qualifiers”). These allow you to set such things as file name and format and the region of the sequence containing the features of interest.

For example, the format of input features is specified with -fformat Format, where Format is the name of a supported feature format. Here, embl format is specified:

extractfeat myfile.feat -fformat embl

This could also have been specified in the UFO (Section 6.7, “The Uniform Feature Object (UFO)”) of the output sequence:

extractfeat embl:myfile.feat

Prev	Up	Next
5.2. Introduction to Sequence Formats	Home	5.4. Introduction to Alignment Formats