EMBOSS provides excellent database support. All the common sequence formats you are likely to come across are supported. See the EMBOSS Users Guide.
A variety of indexing and access methods are supported. For example, EMBL entries can be read from :
A non-indexed EMBL-format flatfile held locally.
Original EMBL flatfiles using the CD-ROM, Staden or EMBOSS indexes
Original EMBL flatfiles using local SRS indexes
A file indexed for use with BLAST version 2 indexes
GCG database format
A query to the EMBL-EBI DBFETCH service
A query to the EMBL-EBI web server
A query to the Entrez web server
A query to any MRS web server
A query to any SRS web server (local or remote)
A relational database such as Sybase or Oracle by calling a local application.
Databases can be held locally and both indexed and non-indexed local files are supported. Tools for database indexing (Section 4.5, “Database Indexing”) are provided. One is a variation on the
emblcd system, the other uses an updatable tree. They provide rapid access to single sequences and rapid queries of flat file databases. The dbi* indexing applications assume that you have one or both of ID and accession number in each record and that they are unique for the whole database index, whereas the dbx* applications can handle non-unique (duplicate) IDs and source files >2Gb in size. Use of the dbx* indexing applications is preferred.
EMBOSS also provides methods for retrieving sequences via the WWW. If sequences on a server are in a format unknown to EMBOSS, it might be possible to specify they are converted to FASTA format before they are served. There are three methods for interaction with a local SRS installation or SRS on a remote public server. SRS queries can be made not only by ID and Accession number, but also (depending on the way a database has been indexed) on words in the description line, sequence version (or GI numbers), keywords or organism names.
Specialised access methods are provided for databases served by
entrez and EMBL-EBI's
For more general access through web servers, the
url access method allows a database to be defined as a URL into which a user-specified ID is inserted.
For other non-flatfile databases or flat file databases in formats not currently supported by EMBOSS, it is possible to configure an external application to retrieve sequences.
There are three basic levels of query:
A single entry specified by database ID or accession number is retrieved.
One or more entries matching a wildcard string in the Uniform Sequence Address (USA, see the EMBOSS Users Guide) are retrieved (this can be slow for some methods).
All entries are read sequentially from a database.
One or more query levels may be specified for each database configuration.
There are many methods (Section 4.3, “Database Access Methods”) for accessing databases. The available methods depend on the query level: i.e. whether a single entry, a wildcard-specified set of entries or all of the database entries are to be retrieved. For example, a web server might be suitable for retrieving a single or few entries but probably, quite sensibly, will not allow an entire database to be retrieved over the Internet. In contrast, a flat file database with no index is often (depending upon its size) only useful for reading all the entries sequentially ('all' retrieval level).
A database can be defined with a single retrieval method using the
method attribute. Alternatively, multiple methods may be defined, depending on which type (entry, query, all) of access is required. The attributes
methodall are used for this. This would be essential in the cases described above, to access the database in the different locations.
In addition, each access method needs to know something about the database. What is needed will be different for each method, although there is, of course, much overlap between them. This information is specified by using the 'key: value' attributes. The required attributes depend on the access method and the query level.
key: value attributes and access methods (Section 4.3, “Database Access Methods”) are described below.
Every database you intend to use must be defined in one of the EMBOSS configuration files:
emboss.default is kept in the top-level EMBOSS directory (e.g.
/usr/local/emboss/share/EMBOSS/emboss.default) and is used for defining site-wide databases. In contrast,
.embossrc lives in your home directory and is used for defining your own databases or, for example, testing database definitions before adding them to the site-wide
Each database is configured using a database definition. The generalised form is:
DBNAME, which is usually shortened to
DB, is followed by the database name (
DatabaseName) then a set of
value attributes that specify that database. The
value attributes are all enclosed by a pair of square brackets.
value pairs are the configuration options and must contain:
A description of the access method (using
method:) or one or more of:
A description of the original format of the sequences (using
value pairs might be required depending on the access methods. Others are optional.
As an illustration, to set up direct access to the EMBL and SwissProt test databases distributed with EMBOSS, your
.embossrc file should look like something like this:
DB embl [ type: "N" method: "direct" format: "embl" dir: "/home/auser/EMBOSS-6.2.0/test/embl/" file: "*.dat" comment: "Test EMBL in EMBOSS distribution" ] DB swissprot [ type: "P" method: "direct" format: "swiss" dir: "/home/auser/EMBOSS-6.2.0/test/swiss/" file: "seq.dat" comment: "Test Swissprot in EMBOSS distribution" ]
Or to set up access to the EMBL and swissprot databases via SRS at the EMBL-EBI, your
.embossrc file should look like this:
DB swissprot [ type: "P" method: "srswww" format: "swiss" url: "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz" comment: "Swissprot via EBI SRS" ] DB embl [ type: "N" method: "srswww" format: "embl" url: "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz" comment: "EMBL via EBI SRS" ]
emboss.default.template file is provided in the EMBOSS distribution. As its name suggests, it gives examples of some of the possible database definitions supported by EMBOSS (see the next section). An excerpt of the
emboss.default.template file is show below:
#SET emboss_tempdata path_to_directory_$EMBOSS/test # Logfile - set this to a file that any user can append to # and EMBOSS applications will automatically write log information #SET emboss_logfile /packages/emboss/emboss/log # pir (cytochrome C plus first entries in other divisions) # === DB tpir [ type: P dir: $emboss_tempdata/pir method: gcg file: pir*.seq format: nbrf fields: "des org key" comment: "PIR in 4 files in GCG format indexed by dbigcg" ] # Genbank (Remote access to an MRS server) # ======= DB genbank [ type: N methodentry: mrs3 format: genbank dbalias: "genbank_release" url: "http://mrs.cmbi.ru.nl/mrs-3/plain.do" comment: "GenBank IDs via MRS" ] # genbank (the first few entries from several sub-section files) # ======= DB tgenbank [ type: N dir: $emboss_tempdata/genbank method: emblcd format: genbank release: 01 fields: "sv des org key" comment: "GenBank native format indexed by dbiflat" ]
To see how databases are set up under EMBOSS, you should look at the configurations for the test databases included in the EMBOSS distribution. The EMBOSS developers use these databases to test database indexing and sequence reading. They also contain the sequences that are used in the usage examples for the applications (see the application documentation online or by running tfm). They include:
emrod (DNA) and
swnew (protein) are in BLAST format)
*.dat for EMBL format,
.seq for gcg format)
.seq for nbrf format)
.dat for swissprot format, 1 file)
.dat for swissprot format, 3 files)
wormpep is in FASTA and BLAST format)
The template file (
emboss.default.template) in the EMBOSS distribution (e.g.
/usr/local/emboss/share/EMBOSS/emboss.default.template) contains configurations for all the test databases. You can use
emboss.default.template as a template for entries in your own
emboss.default file. For any database definitions you use, change the definition of
emboss_tempdata to point to your test directory and uncomment the line. You'll then be able to use the test databases as "tembl", "tsw" and so on.
One of the first things an EMBOSS application does when it runs is to read in the installed
emboss.default (and then the
~/.embossrc file, if it exists). This means that any changes to these definition files take effect as soon as they are made.
For example, change:
# swissprot (Puffer fish entries) # ========= DB tsw [ type: P dir: $emboss_tempdata/swiss method: emblcd format: swiss release: 36 fields: "sv des org key" comment: "Swissprot native format with EMBL CD-ROM index" ]
# swissprot (Puffer fish entries) # ========= DB tsw [ type: P dir: /home/auser/EMBOSS-6.2.0/test/swiss method: emblcd format: swiss release: 36 fields: "sv des org key" comment: "Swissprot native format with EMBL CD-ROM index" ]
Alternatively, to get all the test databases supported, rename or copy
emboss.default and edit the file as follows. This line:
# SET emboss_tempdata path_to_directory_$EMBOSS/test
must be uncommented and the definition changed to the directory where the databases are installed. In the following example this is
/usr/local/share/EMBOSS/test. For example:
SET emboss_tempdata /usr/local/share/EMBOSS/test # or SET emboss_tempdata /home/auser/workspace/emboss/emboss/test/ # or something else
The directory where the test databases are installed can be changed with
--prefix when you configure EMBOSS.
Having defined your databases (see Section 4.1, “General Database Configuration”), you can run
showdb -full and you should see them all appear in the list of databases. If the message
Warning: Bad database definition is generated or if a database doesn't appear then something is seriously wrong with your definition. Go back to it and check things. Common mistakes include:
Have you left off the terminal square bracket
Did you leave out a colon character
: in an attribute?
Have you forgotten to put in the closing quotes around some text?
emboss.default file world-readable?
showdb displays your database, check that all of your required access methods are listed as
OK. If something is not
OK then another access method might be required.
showdb finds a database definition does not mean the database is working correctly: showdb does not attempt to extract any entries from your database. Therefore you should try extracting one or more known entries from the database using seqret. If you get errors, you should check that the database is set up correctly and defined correctly. Things to check include:
Are the data files and indexes world-readable?
emboss did you index the data files?
app: is the application in your
app: is the
PATH specified correctly?
app: is the application world-executable?
srswww is the server up?
srswww is the server URL correct?
file: wildcards specified correctly?
directory: paths specified correctly?
Have you put the files there yet?
If using any SRS method, did you use
If using any SRS method, check the
dbalias: name in the SRS server.
If accessing by
ORG, did you remember to specify these when you indexed the database?
If accessing by
ORG, did you specify
Take another look at the format. Is that really
fasta, or is it
Do you have duplicate entries? The dbi* program indices must have unique entry names.