4.2. Database Attributes

4.2.1. Introduction

There are a few things to consider when specifying attributes for a database:

  • Each database must have attributes that specify what it is and how to access it. This information is given as a set of pairs of key: and value attributes. These attributes are held in the DB definition structure (see above).

  • The key: value pairs in a DB structure can be specified either on separate lines or separated by spaces on the same line.

  • If the value part of the attribute contains spaces then it should be quoted to prevent it being prematurely terminated at the first space. For example, key: "value with many words in".

  • The minimum set of attribute keys are method: and format: - these two are mandatory. It is also typical (but not mandatory) to specify the type: attribute.

  • Some forms of method: require subsidiary attributes giving further information on how to access the data.

The available attributes are described below (Table 4.1, “Attributes used to Specify a Database”).

Table 4.1. Attributes used to Specify a Database
KeyValueDescription
method
methodall
methodentry
methodquery
srs
srsfasta
srswww
url
app
external
direct
emblcd
emboss
entrez
gcg
embossgcg
blast
dbfetch
mrs
direct
Specifies the method used to access the database.
format
formatentry
formatquery
formatall
A valid sequence format name (see the EMBOSS Users Guide)Specifies what sequence format to expect when reading entries from the database.
type
N or PSpecifies whether the database is nucleic or protein.
fields
One or more of: sv, des, org, keySpecifies which search fields have been indexed and are available for searching with.
directory
Any valid directory pathSpecifies the directory of files that have been specified with the filename: attribute. It also specifies the default directory of indexes and files produced by the dbi* and dbx* indexing programs (see indexdirectory:).
filename
A file name (may be wildcarded) or list of file namesSpecifies the sequence file(s) to read in when accessing the database.
exclude
A file name (may be wildcarded) or list of file namesThis is used to exclude a subset of files from consideration.
indexdirectory
Any valid directory pathSpecifies the directory of index files (produced by the dbi* and dbx* programs) if this is different to the directory specified by directory:.
url
Any valid URLSpecifies the URL to use when getting sequences from remote Web sites.
httpversion
1.0 or 1.1Specifies the HTTP protocol version to be used. Version 1.0 transmits the results in one block. Version 1.1 chunks data and is preferred for large data transfers. The default is 1.1
proxy
host:portIn the access methods srswww and url, you can specify a proxy host and port to use when accessing the URL. If a proxy is globally defined, it can be bypassed for any database by specifying ":" as an empty value.
app
appentry
appquery
appall
Any script or program nameSpecifies the name or command line of an external (i.e. non-EMBOSS) program or script (application) that should be run to extract the sequence from the database.
dbalias
The true name of a databaseThis is used to specify the name of a database at a (e.g. SRS) site where the name differs from the name that given as the DBNAME. This allows the EMBOSS database definition to use another name (e.g. srsembl) or to specify a less obvious name when contacting the server (e.g. emblrelease)
caseidmatch
Used to flag databases that have case-sensitive identifiersA boolean set to "Y" to define a database where identifiers can differ only in upper or lower case characters. An example is a sequence database derived from PDB entries where the chain identifiers 'a' and 'A' are not the same.
hasaccession
Used to flag databases that do not have access by accession numberA boolean set to "N" to define a database with no accession numbers (e.g. PDB used as a source of sequence data)
comment
Any textA comment, usually to describe the database.
release
Any textThis is the release number or date.

4.2.2. Description of Attributes

4.2.2.1. method, methodall, methodentry, methodquery

This specifies the method used to access the database.

This field is mandatory - there must be at least one form of the method key specified. More than one different type of method key can be specified.

If method: is specified, then this is the default method covering all forms of access ('query', 'entry' or 'all'). Specific methods for the 'query', 'entry' or 'all' forms of access (i.e. methodquery:, methodentry: or methodall:) should be specified explicitly if you wish to have several ways of accessing the data e.g.

method: "emblcd"
methodall: "direct"

4.2.2.2. format, formatentry, formatquery, formatall

The format: attribute specifies what sequence format to expect when reading entries from the database.

This attribute is mandatory. If you need to specify different formats for any of the different access methods (Section 4.3, “Database Access Methods”), then you may use the variants of format: with the suffix entry, query or all. An example of format is:

format: ncbi

4.2.2.3. type

This specifies whether the database is nucleic or protein.

Although it is not strictly required, it is normal to specify the type of the database as this should be known. If the type is not specified it will be determined by the EMBOSS applications when they read sequences in. (You will not get error messages when you run showdb as this doesn't read in sequences.) The value Nucleotide or Nspecifies a nucleic database, Protein or P specifies a protein database, e.g.

type: "Nucleotide"

4.2.2.4. fields

This specifies which search fields have been indexed and are available for searching.

It is assumed that Accession number and ID name are always available when a database is set up. Depending how you set up the database, access by one or more of these fields might be possible:

sv - Sequence Version or GI Number
des - Description line
org - Organism's taxonomic classification
key - Keywords

The access methods srs, srsfasta and srswww allow access to these search fields. The methods emboss, emblcd and gcg may or may not have some or all of these fields indexed, depending on the parameters given to the programs dbxflat, dbxgcg, dbiflat and dbigcg. The programs dbxfasta, dbiblast and dbifasta only allow you to select any of sv, des and acc (the default). An example specification is:

fields: "sv des org key"

The use of these fields in searches is described elsewhere (see the EMBOSS Users Guide).

FASTA format has only an ID and a parsable description line. If accession numbers are not defined then set hasaccession: "N" to turn off the default attempt to include this field in searches. A common case is the PDB protein structure database when used as a source of sequences, as PDB has no accesion number system.

4.2.2.5. directory

This specifies the directory of files that have been specified with the filename: attribute. It also specifies the directory of indexes and files produced by the dbx* or dbi* programs.

It is only required with the access methods (see Section 4.3, “Database Access Methods”):

emboss
direct
gcg
emblcd
blast

It is common to use variables (see the EMBOSS Users Guide) to specify part or all of the path:

directory: $dbdir/genomes

4.2.2.6. filename

This specifies the sequence file(s) to read in when accessing the database.

It is only required with the access method direct (see Section 4.3, “Database Access Methods”). It may also be used with the access methods:

emboss
gcg
emblcd
blast

to indicate which files should be included back in after using the exclude: attribute to specify which indexed files should be ignored. (See exclude: below). The files may be wildcarded using *. The attribute key filename: is commonly abbreviated to file: e.g.

file: pir*.seq

A list of file names may also be given; each name must be separated with a space or comma.

4.2.2.7. exclude

This is used to exclude a subset of files from consideration.

To exclude certain files, specify exclude: *file*. This is used in conjunction with filename: to specify a subset of files in a directory. Exclude: is checked first, then the rest of the files are included with filename:. The files searched are therefore: - the files in the directory specified by directory: - but not the exclude: files (if any) - but include back the filename: files (if any) e.g.

exclude: mouse.*

If you have indexed all of the files in the EMBL database, then you can specify subsets using the same set of files and indexes as:

DB embl [
  type:     "N"
  format:   "embl"
  method:   "emblcd"
  dir:      "/data/embl"
  comment:  "All of EMBL"
]

DB emblminus [
  type:     "N"
  format:   "embl"
  method:   "emblcd"
  dir:      "/data/embl"
  exclude:  "est*.dat"
  comment:  "EMBL without the ESTs"
]

DB emblhumest [
  type:     "N"
  format:   "embl"
  method:   "emblcd"
  dir:      "/data/embl"
  exclude:  "*.dat"
  filename: "est_hum*.dat"
  comment:  "EMBL human ESTs"
]

DB human [
  type:     "N"
  format:   "embl"
  method:   "emblcd"
  dir:      "/data/embl"
  exclude:  "*.dat"
  filename: "hum*.dat"
  comment:  "EMBL human"
]

4.2.2.8. indexdirectory

This specifies the directory of index files (produced by the dbx* or dbi* programs) if this is different to the directory specified by directory:.

For the dbi* applications it is sensible to hold the indexes in a different directory to the one holding the sequence database files when you have many sequence databases in the same directory. This is because the indices for every database all have the same names (acnum.hit, acnum.trg, division.lkp, etc.) and these would be over-written if you have indexed several databases in the same directory. In this case, you should create the indices in a different directory (often but not necessarily a subdirectory) for each database. That way the index files will not become confused. These index directories can be specified using the attribute indexdirectory:, while the directory containing the sequence data files can still be specified using dir:.

It is only used with the access methods (see Section 4.3, “Database Access Methods”):

emboss
gcg
emblcd
blast

It is common to use variables to specify part or all of the path. The attribute key indexdirectory: is commonly abbreviated to indexdir: e.g.

indexdir: $dbdir/genomes/embl

4.2.2.9. url

This specifies the URL to use when retrieving sequences from remote Web sites.

It is only required with the access methods (see Section 4.3, “Database Access Methods”):

srswww
url

The database (or the name specified in a dbalias attribute) and entry Accession number (or Sequence version, GI number, Description, Organism, or Key-word) can then appended to create a functional SRS query line. Often it is only necessary to specify the remote wgetz application alone e.g.

url: "http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz"

The URL can also contain one or more instances of the character pair %s - each of these pairs are replaced by the value of the ID name when this database is accessed. Any HTML formatting will be stripped from the resulting web page e.g.

url: "http://www.ebi.ac.uk/htbin/emblfetch?%s"
# or
url: "http://www.ncbi.nlm.nih.gov/htbin-post/Entrez/query?db=s&form=6&dopt=g&html=no&uid=%s"

The URL must begin with http:// and have a lower case host address.

4.2.2.10. proxy

In the access methods srswww mrs entrez dbfetch and url, you can specify a proxy host and port to use when accessing the URL

For example:

proxy: "proxy.mydomain.com:8888"

If the global variable EMBOSS_PROXY is defined in the emboss.default file (see the EMBOSS Users Guide) then the attribute

proxy: ":"

will turn off proxy access for this database. This is useful if the database is on an internal server.

4.2.2.11. httpversion

In the access methods srswww mrs entrez dbfetch and url, you can specify the HTTP prpotocol version to use when accessing the URL. The default version 1.1 supports delivery of results in chunks. The older 1.0 protocol can only deliver all results in one block.

For example:

httpversion: "1.0"

If the global variable EMBOSS_HTTPVERSION is defined in the emboss.default file (see the EMBOSS Users Guide) then this nwill set a global default for all URL-based data access. The default is 1.1.

4.2.2.12. app, appentry, appquery, appall

This specifies the command line of an external (third party) application that should be run to extract a sequence from a database.

This application can be in the user's path or have an explicit path provided. The database and entry name will be appended to the application command as application dbname:entry. Both ID and Accession number can be used to specify the entry. Alternatively, if the app: attribute value contains the character pair %s, it is replaced by the value of the ID name or Accession number when this database is accessed.

This attribute is only required with the access method app (see Section 4.3, “Database Access Methods”). If you need to specify different applications for any of the different access methods, then you may use the variants of app: with the suffix entry, query or all. e.g.

app: efetch
# or
app: "getz [embl:%s]"

4.2.2.13. dbalias

This is used to specify the name of a database at a (e.g. SRS) site where the name differs from the DBNAME.

It is only required with the access methods (see Section 4.3, “Database Access Methods”):

mrs
mrs3
srswww
srsfasta
srs

e.g.

dbalias: emblnew

4.2.2.14. comment

This is a comment to describe the database.

It is displayed in showdb e.g.

comment: "This is my subset of refseq"

4.2.2.15. release

This is the release number or date.

It is displayed in showdb.

Caution

Unless you are zealous in updating release: values, this will rapidly become out of synch with the actual data.

The dbx* and dbi* indexing programs ask for the database name, release number and index date. These are stored in the index files. This information is not available to EMBOSS programs and is not reported by showdb. They are part of the index file formats, but EMBOSS does not currently make use of them.

release: "21.0 (Oct 2009)"

4.2.2.16. hasaccession

This turns off attempts to read data by accession number.

Most sequence databases follow the example set by the major public protein and nucleotide reosurces by providing unique accession numbers. Where these are not available the accession number search can be disabled by defining

hasaccession: "N"

4.2.2.17. caseidmatch

This makes identifier tests case-sensitive.

Most sequence databases attach no significance to upper or lower case for identifiers. In a few case, especially in site-specific local data, there may be a distinction between two otherwise identical names. An early example was a database of sequences derived from PDB where the chain name 'a' or 'A' in the identifier was significant.

caseidmatch: "Y"