4.3. Database Access Methods

4.3.1. Introduction

The available database access methods are described below (Table 4.2, “Database Access Methods”).

Table 4.2. Database Access Methods
MethodScopeComments
EMBOSS*Uses a B+-tree index from the programs dbxflat (used for "flat" files, i.e. files in their native database format) or dbxfasta (FASTA-format files).
EMBLCD*Uses an EMBLCD index from the programs dbiflat (used for "flat" files, i.e. files in their native database format) or dbifasta (FASTA-format files).
SRS*This calls getz locally, using the -e switch to return whole entries in original format. Query fields supported are 'id' 'acc' 'gi' 'sv' 'des' 'org' and 'key'.
SRSFASTA*As for SRS, but uses getz -d -sf fasta to read the sequence in FASTA format. Query fields supported are 'id' 'acc' 'gi' 'sv' 'des' 'org' and 'key'.
SRSWWWsingle entryUses a defined SRS WWW server to read a single entry. Query fields supported are 'id' 'acc' 'gi' 'sv' 'des' 'org' and 'key'.
MRS*This uses a defined MRS server to read a single entry. Query fields supported are 'id' and 'acc'.
Entrez*This uses Entrez at NCBI to read data. Query fields supported are 'id' 'acc' 'gi' 'sv' 'des' 'org' and 'key'. Data is returned in the original format (e.g. Genbank)
Dbfetch*This uses Dbfetch REST acces at EBI to read data. Query fields supported are 'id' and 'acc'.
BLAST*Uses an EMBLCD index from the program dbiblast.
EMBOSSGCG*Uses a B+-tree index from the program dbxgcg to access a database reformatted for GCG 8, 9 or 10 by GCG programs such as embltogcg.
GCG*Uses an EMBLCD index from the program dbigcg to access a database reformatted for GCG 8, 9 or 10 by GCG programs such as embltogcg.
DIRECTallOpens the database file(s) and returns each entry sequentially. Query fields supported are 'id' 'acc' 'gi' 'sv' 'des' 'org' and 'key'.
URLsingle entryUses any other Web server (for example the EBI's emblfetch or swissfetch queries) to return an entry.
APP EXTERNAL*Run an external application or a simple script which returns one/more/all entries.

4.3.2. Description of Database Access Methods

4.3.2.1. EMBOSS

The EMBOSS index method is preferred over the older EMBLCD method. It allows for non-unique index terms e.g. non-unique IDs. It can also cope with files over 2Gb in size.

This method uses B+-tree indexes from the programs dbxflat (flatfiles - database native format files) or dbxfasta (FASTA format files). This can cope with all levels of access. Queries use the index files. Reading all entries uses the list of files in the [database].ent file and opens each in turn.

Supports queries by

id
acc
sv
key
org
des

(Not by key and org if the database was indexed by dbxfasta as these cannot be found in the FASTA format description line.)

The directory containing the sequence files and indexes to be read must be specified using the directory: attribute. If the indexes are in a directory other than the one containing the sequence files, then the index directory can be explicitly set using the indexdirectory: attribute.

The available fields should be specified using the fields: attribute if more than just the default ID name and Accession number fields have been indexed. As these indexes allow non-unique IDs then each of the fields may return a list of matches i.e. type query is used throughout.

For example:

DB mydb [
  type: N
  method: emboss
  format: embl
  fields: "sv des org key"
  directory: /data/embl
]

The EMBOSS B+-tree index files include the filenames indexed by dbxflat or dbxfasta. You can use the file: and exclude: attributes to create file-specific subsets from a single index.

4.3.2.2. EMBLCD

Uses an EMBLCD index from the programs dbiflat (flatfiles - database native format files) or dbifasta (FASTA format files). This can cope with all levels of access. Queries use the index files. Reading all entries uses the list of files in the division.lkp file and opens each in turn.

Supports queries by

id
acc
sv
key
org
des

(Not by key and org if the database was indexed by dbifasta because there is no way to find these in the FASTA format description line.)

The directory containing the sequence files and indices to be read must be specified using the directory: attribute. If the indices are in a directory other than the one containing the sequence files, then the index directory can be explicitly set using the indexdirectory: attribute.

The available fields should be specified using the fields: attribute if more than just the default ID name and Accession number fields have been indexed. A wildcard search for unique fields (id or sv), or any search for acc, des, org or key is of type query and returns a list of entries. A search for a single id or sv is of type entry and will find the first match in the index and assume no other matches. The ID has to be unique in an EMBLCD database.

For example:

DB mydb [
  type: N
  method: emblcd
  format: embl
  fields: "sv des org key"
  directory: /data/embl
]

The EMBLCD index files include the filenames indexed by dbiflat or dbifasta. You can use the file: and exclude: attributes to create file-specific subsets from a single index. Use of the indexdir: attribute is common, allowing index files to be in a different directory from the source flat files.

4.3.2.3. SRS

This requires a local installation of SRS. This calls getz locally, using the -e switch to return whole entries in the original format. It is expected that getz is in the path.

Supports queries by

id
acc
sv
key
org
des

If the SRS server has a different name for this database than that specified as the DBNAME, then you must specify it using the dbalias: attribute.

EMBOSS expects the SRS local access program to be called getz, but you can explicitly override this using the app: attribute. This can be used to call getz using its explicit path, rather than relying on getz being in the path.

Database definitions using method: srs should also specify methodall: direct plus directory: and file: for reading all entries directly. This is much faster than using getz to read and format all entries (unless the database is very small).

For example:

DB mydb [
  type: N
  format: embl
  method: srs
  dbalias: embl
  fields: "sv des org key"

# define 'all' access method
  methodall: direct
  directory: /data/embl
  file: *.seq   
]

As SRS returns the results using getz -e; the format should match the format of the original data. For some formats this might be problematic (PIR for example). In that case you can consider using SRSFASTA although this will lose information that is not included in the FASTA format SRS output.

4.3.2.4. SRSFASTA

As for SRS, but uses:

getz -d -sf fasta

to read the sequence in FASTA format. It is used for databases like dbEST.reports where EMBOSS does not understand the entry format but SRS can convert it to FASTA. As the database format is not understood by EMBOSS, a search of the entire database would be forced to use getz to convert each entry, which would be slow.

Supports queries by

id
acc
sv
key
org
des

If the SRS server has a different name for this database from that specified as the DBNAME, then you must specify it using the dbalias: attribute.

EMBOSS expects the SRS local access program to be called getz, but you can explicitly override this using the app: attribute. This can be used to call getz using its explicit path, rather than relying on getz being in the path.

Database definitions need to specify methodall: direct plus directory: and file: to read all entries directly. This is much faster than using getz to read and format all entries.

For example:

DB mydb [
  type: N
  format: fasta
  method: srsfasta
  dbalias: embl
  fields: "sv des org key"

# define 'all' access method
  methodall: direct
  directory: /data/embl
  file: *.seq 
]

4.3.2.5. SRSWWW

Uses a defined SRS WWW server to read a single entry. This can be useful, for example, to get the GenBank version of an EMBL entry. Wildcard entry names are not recommended because SRS servers are not intended to return large numbers of entries.

Supports queries by

id
acc
sv
key
org
des

If the SRS server has a different name for this database from that specified as the DBNAME, then you must specify it using the dbalias: attribute.

The remote SRS web server must be specified using the url: attribute.

Database definitions should define this as methodentry or methodquery to avoid returning the entire database. Failure to do so could lead to a request to return the entire database. Although an SRS web server can cope with this, EMBOSS would then have to keep the entire web page in memory before stripping out HTML tags in order to read the first entry.

For example:

DB mydb [
  type:        "N"
  format:      "embl"
  methodquery: "srswww"
  dbalias:     "embl"
  fields:      "sv des org key"
  url:         "http://srs.redbrick.ac.uk/srsbin/cgi-bin/wgetz"

# define 'all' access method
  methodall: "direct"
  directory: "/data/embl"
  file:      "*.seq" 
]

Various database definitions for remote retrieval of sequences over the web via SRS are shown below:

DB embl [  
type:    "N" 
method:  "srswww" 
format:  "embl" 
release: "EBI"
url:     "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz"
comment: "EMBL from the EBI" ]

DB em [  
type:    "N" 
method:  "srswww" 
format:  "embl" 
release: "EBI"
url:     "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz"
dbalias: "EMBL"
comment: "EMBL from the EBI" ]

DB uniprot [  
type:    "P" 
method:  "srswww" 
format:  "swiss" 
release: "EBI"
url:     "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz"
comment: "UNIPROT from the EBI" ]

DB uni [  
type:    "P" 
method:  "srswww" 
format:  "swiss"
release: "EBI"
url:     "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz"
dbalias: "UNIPROT"
comment: "UNIPROT from the EBI" ]

4.3.2.6. BLAST

Note

Currently dbiblast can't use the new (format 4) style of BLAST indexes. You must create the old (format 3) style of BLAST indexes by adding -A F to the formatdb command line.

Uses an EMBLCD index from the program dbiblast to access databases in BLAST format. The BLAST database can be DNA or protein, produced by formatdb, pressdb or setdb, with or without the original FASTA format file. This can cope with all levels of access. Queries use the index files, reading all entries uses the list of files in the division.lkp file and opens each in turn.

Supports queries by

id
acc
sv
des

(Not by key and org as there is no way to find these in the BLAST database description line).

The directory containing the BLAST index files (*.nin, *.pin, *.nhr, *.nsq, *.phr, pin, psq, etc,) and the index files produced by dbiblast must be specified using the directory: attribute. If the dbiblast indices are in a directory other than the one containing the BLAST index files, then the dbiblast index directory can be explicitly set using the indexdirectory: attribute. The available fields should be specified using the fields: attribute if more than just the default ID name and Accession number fields have been indexed.

A wildcard search for unique fields (id or sv), or any search for acc, des, org or key is type 'query' and returns a list of entries. A search for a single id or sv is of type 'entry' and will find the first match in the index and assume no other matches. The ID has to be unique in an EMBLCD database.

For example:

DB mydb [
  type: N
  format: embl
  method: blast
  fields: "sv des"
  directory: /data/embl
]

4.3.2.7. EMBOSSGCG

Uses a B+-tree index from the program dbxgcg to access a database reformatted for GCG 8, 9 or 10 by GCG programs such as embltogcg. As only the .ref and .seq files are used, any "GCG" distribution of the databases can be used with dbxgcg without the need to create GCG-specific index files. This can cope with all levels of access. Queries use the index files. Reading all entries uses the list of files in the [database].ent index files and open each in turn.

Supports queries by

id
acc
sv
key
org
des

The directory containing the sequence files and indexes to be read must be specified using the directory: attribute. If the indexes are in a directory other than the one containing the sequence files, then the index directory can be explicitly set using the indexdirectory: attribute. The available fields should be specified using the fields: attribute if more than just the default ID name and Accession number fields have been indexed.

As the B+-tree indexes allow duplicate keys then all queries may return a list of entries i.e. type query is used throughout.

For example:

DB mydb [
  type:      "N"
  format:    "embl"
  method:    "embossgcg"
  fields:    "sv des org key"
  directory: "/data/gcg/gcgembl"
]

You can use the file: and exclude: attributes to create file-specific subsets from a single index.

4.3.2.8. GCG

Uses an EMBLCD index from the program dbigcg to access a database reformatted for GCG 8, 9 or 10 by GCG programs such as embltogcg. As only the .ref and .seq files are used, any "GCG" distribution of the databases can be used with dbigcg without the need to create GCG-specific index files. This can cope with all levels of access. Queries use the index files. Reading all entries uses the list of files in the division.lkp index files and open each in turn.

Supports queries by

id
acc
sv
key
org
des

The directory containing the sequence files and indices to be read must be specified using the directory: attribute. If the indices are in a directory other than the one containing the sequence files, then the index directory can be explicitly set using the indexdirectory: attribute. The available fields should be specified using the fields: attribute if more than just the default ID name and Accession number fields have been indexed.

A wildcard search for unique fields (id or sv), or any search for acc, des, org or key is type query and returns a list of entries. A search for a single id or sv is of type entry and will find the first match in the index and assume no other matches. The ID has to be unique in an EMBLCD database.

For example:

DB mydb [
  type:      "N"
  format:    "embl"
  method:    "gcg"
  fields:    "sv des org key"
  directory: "/data/gcg/gcgembl"
]

You can use the file: and exclude: attributes to create file-specific subsets from a single index.

4.3.2.9. DIRECT

Opens database flat file(s) and returns each entry sequentially.

This method assumes there is no indexing done on the data, so it can only process all entries - you should explicitly set up other methods for entry and query access to the same database if these are required. It is possible to access a database with the direct: method and an ID or field in the USA, but EMBOSS will read the entire database to look for matching entries if no other method is specified.

The directory containing the sequence files to be read must be specified using the directory: attribute. The files to be read must be specified using the file: attribute. You may use the exclude: attribute to exclude some selected files from consideration.

EMBL can be defined as *.dat to avoid adding the explicit filenames e.g. est18, hum3, htg2.

If the file format supports additional fields, they can be included in the definition as fields: to allow their use in USAs.

For example:

DB mydb [
  type: N
  format: embl
  methodall: direct
  directory: /data/embl
  file: *.dat
]

4.3.2.10. URL

Uses any other Web server (for example the EBI's emblfetch or swissfetch queries) to return an entry.

The remote web server's URL must be specified using the url: attribute. This URL is expected to contain one or more instances of the character pair '%s' - each of these pairs are replaced by the value of the ID name when this database is accessed. Any HTML formatting will be stripped from the resulting web page. For example:

DB mydb [
  type: N
  format: embl
  methodentry: url
  url: "http://server.commercial.com/cgi-bin/getseq?%s&format=embl"
]

4.3.2.11. APP

Run an external application or a simple script which returns one/more/all entries. The application can be in the user's path or have an explicit path provided. EXTERNAL is the same thing as APP, but it is obsolete and its use is discouraged.

The database definition must have app: defined to specify the application command.

The database and entry name will be appended to the application command as

application dbname:entry

Both ID and Accession number can be used to specify the entry. Alternatively, if the app: attribute value contains the character pair '%s', it is replaced by the value of the ID name or Accession number when this database is accessed. You can also use GCG's typedata as an external application, to save reindexing a GCG database.

This could be a good way to search a set of databases, for example to get the first entry from SwissNew, SwissProt, TrEmbl and TrEmblNew with the ID, accession number or PID as the entryname.

For example:

DB mydb [
  type:   "N"
  format: "embl"
  method: "app"
  app:    "/usr/local/bin/accessdb -db embl -query %s"
]

4.3.3. Mixed Access Methods

For any given method: declaration, EMBOSS will use that method for those access modes supported by the method. If you wish to specify which query level (all, query or single) should be handled by which database retrieval method then the methodsingle:, methodquery: and methodall: declarations should be used instead of method:.

For example:

DB mydb [
methodsingle: "app"
format:       "fasta"
app:          "customapp myproteindb"
methodall:    "direct"
dir:          $emboss_db_dir/myproteindb
file:         "myproteindb.dat"
type:         "P"
comment:      "single and all access for myproteindb"
]

You can mix these, for example, to use a script to query a file and direct access to read all entries. For example:

  methodall: 'direct'
  methodquery: 'app'

4.3.4. Database Farms

Currently there is no simple way of defining several data sources that could be defined as a single, composite database. The closest you can get is to define a database that calls an application that can return sequences from any one of a set of previously-defined EMBOSS databases.

A script has been developed for this task by Simon Andrews. It is shown below or you can download it as the file http://emboss.open-bio.org/downloads/databasefarm.sh.

#!/usr/bin/perl -w
#
# change the above line to match the location of perl on your system
#


use strict;

# EMBOSS farm file script
#
# Written by Simon Andrews
# simon.andrews@bbsrc.ac.uk
# Dec 2001
#
# This script allows you to set up a farm
# of EMBOSS databases which can be queried
# by a single instance of seqret.  The
# program must be accompanied by an entry
# in emboss.default which looks like this:
#
# DB name_of_database [
#       type: N (or P if we're dealing with proteins)
#       method: app
#       format: fasta
#       app: "/path/to/this/emboss_farm.script"
#       comment: "Whatever text you'd like to see in showdb" 
# ]
#

# First we need to set a few preferences
#
# What is the full path to seqret?
# If you are sure that seqret will always
# be somewhere in your path, then you can
# just leave this as 'seqret'.

my $seqret_path = 'seqret';


# Now we need to know the names of the
# databases you'd like included in the
# search.  These must be dabases which
# have already been indexed, and installed
# correctly into emboss.default.  Simply
# enter the database names between the
# brackets, separated by spaces.

my @databases = qw(dbase1 dbase2 dbase3);


##### End of bits which need to be edited #########

my ($reference) = @ARGV;

if ($reference =~ /:(.+)$/){
  $reference = $1;            
}

else {
  die "\n*** FARM ERROR *** Couldn't get accession after : from
$reference\n\n";
}


foreach my $database (@databases){

  my $sequence = `$seqret_path $database:$reference fasta::stdout 2>/dev/null`;

  if ($sequence){
        print $sequence;
        exit;
  }

}

warn "\n*** FARM ERROR *** Couldn't find $reference in any of '@databases'\n\n";

To use this simply copy and paste the text of the script to a file on your system, then make sure that this file is readable and executable by everyone (chmod 755 filename). The comments in the script tell you what changes you need to make to the script itself, and the format of the entry you need to create in emboss.default.

It will work with seqret (and will output any format you like), and can also be used as part of a USA for any of the standard EMBOSS programs. The script requires a UNIX-like OS, but could trivially be adapted to run under Win32.