4.5. Database Indexing

4.5.1. Introduction

To gain experience in database indexing under EMBOSS, you can practice with the example databases included in the EMBOSS distribution. These include:

  • test/data

  • test/embl

  • test/pir

  • test/swiss

  • test/swnew

  • test/wormpep

You can reindex these files using the dbx* or the dbi* programs.

The dbx* applications are preferred.

4.5.2. Resource Definitions, Cachesize and Pagesize

The dbx* programs require two variables to be set in the emboss.default file and at least one Resource Definition to be present. In contrast the dbi* programs do not require these definitions.

For example:

SET PAGESIZE 2048
SET CACHESIZE 200

RES embl 
[
  type: Index
  idlen:  15
  acclen: 15
  svlen:  15
  keylen: 25
  deslen: 25
  orglen: 25
]

The dbx* applications buffer disc pages in order to improve performance. The PAGESIZE should usually be set to the size, in bytes, that your operating system uses to buffer disc pages, though the value is not critical. The CACHESIZE should be set to the number of such pages that you wish to be cached. The values of 2048 and 200 given above are good general purpose ones. We recommend a CACHESIZE greater than 100.

You should have at least one Resource RES Definition in your emboss.default file, though we recommend having one per database you wish to index. The dbx* programs will ask for the name of a RES entry when they run. The definitions have a compulsory type: Index attribute followed by length attributes for each of the fields that can be indexed. These lengths represent the maximum length of the field before potential truncation occurs. Truncation of ID keys is usually to be avoided as it can lead to duplicate IDs being indexed. It is appropriate to set the idlen, acclen and svlen attributes a little larger than the maximum size field you expect in the source file. Values for keylen, deslen and orglen are more a matter of preference.

4.5.3. Indexing and Configuration

4.5.3.1. Flatfile Databases

Flatfile databases are plain text files in a defined format such as those released by EMBL, GenBank etc. The EMBOSS program dbxflat is used to generate EMBOSS indexes that can be used for all types of database access. The dbiflat application can also be used but cannot cope with large source database files (greater than 2Gb) or duplicate IDs or ACs.

dbxflat (and the EMBOSS access method) requires the databases to be uncompressed. The examples given here will not probe the deeper secrets of dbxflat (for which the reader is referred to the application documentation, or failing that the source code) but will show a typical installation for a common database.

We assume that EMBOSS has been installed and works. This can be tested with the command:

wossname -auto

which should list all the programs available.

In this example you will index and configure the EMBL database for use with EMBOSS. First download and unpack the EMBL database. This will require a considerable amount of disc space. If you do not have sufficient space available then just download a subset of the database. Use cd to move the directory in which you have unpacked EMBL. This should look something like this when you run ls:

% ls
.
rel_est_fun_01_r98.dat
rel_est_fun_02_r98.dat
rel_est_fun_03_r98.dat
.
Output truncated
.
wgs_cabc_pro.dat
wgs_cabd_mam.dat
wgs_cabe_fun.dat

Run dbxflat to create the EMBOSS indices. This assumes you have set up a RES definition and cache and page sizes as described above.

% dbxflat

Index a flat file database using b+tree indices
Basename for index files: embl
Resource name: embl
      EMBL : EMBL
     SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
        GB : Genbank, DDBJ
    REFSEQ : Refseq
Entry format [SWISS]: EMBL   
Wildcard database filename: *.dat
Database directory [.]: .
        id : ID
       acc : Accession number
        sv : Sequence Version and GI
       des : Description
       key : Keywords
       org : Taxonomy
Index fields [id,acc]: id,acc
General log output file [outfile.dbxflat]: embllog.dbxflat

dbxflat should happily chug away for some considerable time (depending on the speed of your machine) and will generate (eventually) the following index files:

% ls
embl.ent
embl.xid
embl.xac
embl.pxid
embl.pxac
embllog.dbxflat

Now create an entry in the EMBOSS configuration files to access the database. It is probably a good idea to try new database definition in your local configuration file first. Put the following entry in your .embossrc:

DB embl 
[
   type:      "Nucleotide"
   method:    "emboss"
   format:    "embl"
   directory: "$emboss_db_dir/embl"
   filename:  "*.dat"
   release:   "98.0"
   comment:   "EMBL release 98.0"
]

You will have needed to predefine $emboss_db_dir somewhere in your emboss.default or .embossrc using a directive such as:

set emboss_db_dir /path_to_databases

Save .embossrc and try running showdb. You should see a line that looks like:

% showdb
.. output deleted
embl          N    OK  OK  OK  EMBL release 63.0
.. output deleted
4.5.3.1.1. Fine Tuning the Installation

It can be a good idea to set up subsections of the database so that end-users can search just the regions they wish to search. This section applies to all access methods (Section 4.3, “Database Access Methods”) that use EMBOSS style indexes and to others as well (e.g. EMBLCD).

Files can be included with the declaration:

filename:

or excluded with the declaration

exclude:

In order to just take the EST files in our EMBL database try the following:

DB emblest 
[
   type:      "Nucleotide"
   method:    "emboss"
   format:    "embl"
   directory: "$emboss_db_dir/embl"
   filename:  "rel_est*.dat"
   release:   "98.0"
   comment:   "EMBL release 98.0"
]

Files can also be given as a space-separated list enclosed in quotes. For example, to set up a database of all mammalian sequences (except genomes) try the following:

DB emblallmam 
[
   type:     "Nucleotide"
   method:   "emboss"
   format:   "embl"
   directory: "$emboss_db_dir/embl"
   filename:  "rel_std_rod*.dat rel_std_mus*.dat rel_std_hum*.dat rel_std_mam*.dat"
   release:  "98.0"
   comment:  "EMBL release 98.0"
]

As you can see from these two examples, the filename: tag takes a space delimited list of filenames enclosed in quotes that can contain normal wildcard (?*) characters. It can be quite tedious to set up a long list of sequences to search. In many cases you can use the exclude: tag to make things easier:

DB emblnoest 
[
   type:      "Nucleotide"
   method:    "emboss"
   format:    "embl"
   directory: "$emboss_db_dir/embl"
   filename:  "*.dat"
   exclude:   "rel_est*.dat"
   release:   "98.0"
   comment:   "EMBL release 98.0"
]

This configures the emblnoest database to contain all of EMBL except the EST's.

4.5.3.2. GCG Format Databases

EMBOSS can access GCG formatted databases, thus avoiding having multiple copies of the same databases in different formats for those who still use GCG alongside the flatfiles. EMBOSS creates b+tree indices for the GCG format databases using the program dbxgcg. This runs in much the same way as dbxflat. You will need the GCG format .seq and .ref files in order to create an EMBOSS indexed database.

Move to the GCG database directory containing your data and run dbxgcg:

% dbxgcg
Index a GCG formatted database
Basename for index files: emblgcg
Resource name: embl
EMBL : EMBL
SWISS : Swiss-Prot, SpTrEMBL, TrEMBLnew
GENBANK : Genbank, DDBJ
PIR : NBRF
Entry format [SWISS]: embl
Database directory [.]: 
Wildcard database filename [*.seq]: *.seq
Wildcard database filename [*.seq]: 
        id : ID
       acc : Accession number
        sv : Sequence Version and GI
       des : Description
       key : Keywords
       org : Taxonomy
Index fields [id,acc]: 
General log output file [outfile.dbxgcg]: emblgcglog.dbxgxg

When dbxgcg prompts for the entry format:

Entry format [EMBL]:

you should enter the original database format before you ran embltogcg or similar to generate the GCG databases. The program will run for a while and will then generate the EMBOSS index files for the GCG format database.

The following entry should be put in your .embossrc file:

DB gcgembl 
[
   type:      "Nucleotide"
   method:    "embossgcg"
   format:    "embl"
   directory: "$emboss_db_dir/embl"
   filename:  "*.dat"
   release:   "98.0"
   comment:   "EMBL release 98.0"
]

showdb should show your newly configured database.

You can configure subsets of the databases in the same way as for the original format databases, as described above. One difference to dbxflat indexing is that both the .seq and .header files are listed in the [database].ent file. The filename: and exclude: directives should therefore be of the form:

exclude:
*/rel_est*

instead of just:

*/rel_est*.seq

4.5.3.3. BLAST Databases

BLAST format databases are generated for efficient homology searching using the BLAST programs. It can be convenient to avoid redundant copies of databases so EMBOSS provides a mechanism for accessing these databases.

BLAST format databases are those generated using the tools distributed with NCBI-BLAST or with WU-BLAST.

For indexing of one BLAST database, move to the directory containing your BLAST format databases and run dbiblast:

% dbiblast
Index a BLAST database
Database name: blastsw
Database directory [.]: 
database base filename [blastsw]: 
Release number [0.0]: 
Index date [00/00/00]: 
         N : nucleic
         P : protein
         ? : unknown
Sequence type [unknown]: p
         1 : wublast and setdb/pressdb
         2 : formatdb
         0 : unknown
Blast index version [unknown]: 2

The program will run for a while and will then generate the EMBLCD index files for the BLAST format database.

The following entry (or one like it that is more appropriate to your particular installation) should be put in your .embossrc file:

DB blastsw 
[
   type:      "Protein"
   method:    "blast"
   format:    "ncbi"
   directory: "$emboss_db_dir/blastsw"
   filename:  "blastsw"
   release:   "38.9"
   comment:   "BLAST format Swissprot"
]

showdb should show your newly configured database.

Because of the way BLAST works, many sites may group their BLAST databases in the same directory. You can index these in situ with dbiblast but this may require some extra steps if your databases are not of the same type; generation of subsequent index files will overwrite those that already exist. To avoid overwriting of index files you can index many databases with one set of index files, or you can use the -indexdir options to place the indexes in a different directory.

There are two requirements for indexing several databases together in one index. The first is that the databases are the same type (protein/nucleic acid) and generated with the same tool (pressdb or formatdb); the second is that all the ID and accession numbers in the combined databases are unique.

Run dbiblast as before but specify all the databases you wish to be included when prompted for the database filename:

% dbiblast
Index a BLAST database
Database name: alldbs
Database directory [.]: 
database base filename [alldbs]: dbone dbtwo dbthree dbfour 
Release number [0.0]: 
Index date [00/00/00]: 
         N : nucleic
         P : protein
         ? : unknown
Sequence type [unknown]: p
         1 : wublast and setdb/pressdb
         2 : formatdb
         0 : unknown
Blast index version [unknown]: 2

These can then be configured by using the filename: and exclude: tags as appropriate.

When you have databases of different types, generated with different programs or where the ID/accession numbers are duplicated between databases the preferred strategy is probably to keep the source data for the individual databases in separate directories and index them there.

Alternatively you can place the index files in a separate directory. This requires that you run dbiblast with the -indexdirectory and set the indexdirectory: tag in the database configuration to point to the correct database.

The example below illustrates database configuration using the indexdir options:

% dbiblast -indexdir /databases/indices/mydb
Index a BLAST database
Database name: mydb
Database directory [.]: 
database base filename [mydb]: 
Release number [0.0]: 
Index date [00/00/00]: 
         N : nucleic
         P : protein
         ? : unknown
Sequence type [unknown]: p
         1 : wublast and setdb/pressdb
         2 : formatdb
         0 : unknown
Blast index version [unknown]: 2

The corresponding entry in .embossrc or emboss.default would look like:

DB mydb 
[
   type:           "Protein"
   method:         "blast"
   format:         "ncbi"
   directory:      "$emboss_db_dir/blastsw"
   indexdirectory: "/databases/indices/mydb"
   filename:       "mydb"
   release:        "1.0"
   comment:        "My BLAST DB with an index in a different directory"
]

Again, multiple indexes cannot coexist in the same directory so care should be taken when using the -indexdir option that an existing database index is not overwritten.

4.5.3.4. FASTA Databases

The FASTA specifications just define the sequence file as a header line that begins with > and subsequent lines contain the sequence. The header line can be present in a seemingly infinite number of formats, several of which can be processed by EMBOSS. EMBOSS attempts to determine the accession number and/or ID for each sequence. For indexing purposes there is no semantic difference between an accession number and an ID. In the real world, accession numbers should be immutable, i.e. they do not change with subsequent releases of the database, but IDs may change.

One of the programs that can be used to process FASTA format databases is dbxfasta. It can recognise the following header line formats, specified on the command line:

simple. 

>id ...

idacc. 

>id accno ...

gcgid. 

>db:id ...

gcgidacc. 

>db:id acc ...

dbid. 

>db id ...

ncbi. 

>...[|accno]|id ...

Other header formats will not be recognised by dbxfasta and will cause indexing and/or database lookup to fail. If you have a header format that dbxfasta cannot yet handle you have two options:

  1. (The preferred option) Get a C programmer to modify the source code for dbxfasta and recompile. If you are a community-spirited person you will also contribute these changes to the main EMBOSS source tree. (email emboss-dev@emboss.open-bio.org for more information on contributing changes to the EMBOSS source code and/or read the EMBOSS developers documentation)

  2. (The quick hack) Write a custom script (using e.g. BioPerl http://www.bioperl.org) to access your database and use method: external to configure it. This is less desirable as you may be limited in the access modes you can use.

To index a FASTA format database, run dbxfasta:

% dbxfasta
Index a fasta file database using b+tree indices
Basename for index files: mydb
Resource name: myresdef
    simple : >ID
     idacc : >ID ACC or >ID (ACC)
     gcgid : >db:ID
  gcgidacc : >db:ID ACC
      dbid : >db ID
      ncbi : | formats
ID line format [idacc]: idacc
Database directory [.]: 
Wildcard database filename [*.dat]: mydb.fasta
        id : ID
       acc : Accession number
        sv : Sequence Version and GI
       des : Description
Index fields [id,acc]: id,acc
General log output file [outfile.dbxfasta]: mydb.dbxfasta

dbxfasta will run for a while and will produce the index files. You can use the same -indexdir options as for dbxflat, dbxgcg and dbiblast to place the indexes in a different directory.

Place (e.g.) the following entry in your .embossrc:

DB mydb 
[
    type:      "Protein"
    method:    "emboss"
    format:    "fasta"
    directory: "$emboss_db_dir/mydb"
    filename:  "mydb.fasta"
    comment:   "My database"
]

format: should be dbid, ncbi or fasta (the latter for every format except dbid or ncbi. The same filename: and include: tags can be used as for the other database indexing programs.

4.5.3.5. Other Databases

Many institutions may have local databases set up in their own Laboratory Information Management System. EMBOSS provides a simple mechanism for interfacing with such systems.

As long as a program is available that can be called noninteractively and returns the specified sequence on standard output, EMBOSS can interface with it. Use method: app and app: program command. The ID given in the USA will be appended to the command used to run the program. It is often best to specify the methods available using the method subsets, methodall:, methodquery: and methodsingle: rather than using the generic method: tag.

4.5.4. Configuring EMBOSS to use SRS for Database Look-up

SRS is a powerful database querying system that can cross reference different databases, launch applications etc. SRS can be run either through a web interface (see the description of the SRSWWW method above for an example) or via the command line program getz. Indexing and configuring databases for SRS is not described here, just how to connect to preconfigured and indexed SRS databases. If getz is already within the scope of your PATH environment variable then insert the following (or similar) into your .embossrc file:

 DB emblgetz 
[ 
    type: N 
    method: srs 
    release: "98" 
    format: embl
    comment: 'EMBL using getz' 
    dbalias: embl 
    app: getz 
]

This will provide access to the SRS database embl as emblgetz:acc. If the SRS database has a different name from the DBNAME (as is the case here) then the dbalias: tag should be used to access the correct SRS database.

This configuration can be extremely slow for the all access mode. It is probably a better idea to set up the database as follows:

 DB emblgetz 
[ 
    type:        "Nucleotide" 
    methodquery: "srs" 
    release:     "63" 
    format:      "embl"
    comment:     "EMBL using getz"
    dbalias:     "embl"
    app:         "getz"
    methodall:   "direct"
    filename:    "*.dat"
    directory:   "$emboss_db_dir/embl"
]

This will use method: srs for the query access mode but will use method: direct for the all access mode, thus speeding up reading of the whole database.

The SRSFASTA access method is identical to the normal SRS method except that it returns the sequence in FASTA format and so does not need a format: tag.

4.5.5. Size of the dbx* Indexes

You might notice that the index files produced by the dbx* applications can be very large. This is normal and is a consequence of three things. First, a tree structure is used, secondly the tree isn't tightly packed and thirdly 64-bit pointers are used throughout. The first will allow on-the-fly updating of the index, the second is for speed of construction/updating and the third is obvious. Another consideration is that, in some cases, the indexes are trees-of-trees to allow duplicate codes to be indexed (e.g. keywords).