6.6. The Uniform Sequence Address (USA)

6.6.1. Introduction

The Uniform Sequence Address (USA) is a standard sequence naming scheme used by all EMBOSS applications. Typically, one or more sequences are read from a file or from a larger database. However, other sources such as an application or web server can be specified in a USA (Section 6.6, “The Uniform Sequence Address (USA)”).

A USA specifies:

  • The sequence format to expect

  • The file or database to open

  • The entry or entries to read

The general format of a USA specification is:

Format :: FileName or DatabaseName : Entry

where Format is the database format of a file of sequences (FileName) or installed database (DatabaseName) you have provided and Entry is the database entry code.

Only FileName or DatabaseName is strictly necessary. If the expected format is omitted then EMBOSS will attempt parsing with a carefully organised list of supported formats (Section A.1, “Supported Sequence Formats”) until one succeeds. If the database entry code is omitted, then all of the entries in the file or database are read.

Here are some common variants of USAs:

FileName
Format::FileName
Format::FileName:Entry
DatabaseName
DatabaseName:Entry
@ListFileName

ListFileName is the name of a listfile which itself can contain a list of valid USAs. The :: and : syntax is to allow, for example, "embl" and "pir" to be both database names and sequence formats.

In the following examples, AccessionNumber is the sequence's accession number in the database, and DatabaseId is its identifier:

Filename
myfile.seq
DatabaseName:AccessionNumber
embl:X65923
DatabaseName:DatabaseId
swissprot:opsd_xenla

6.6.2. USA Syntax

The full command line syntax of the possible USAs are give below. Whitespace has been added for clarity but would not be used on the command line:

  • asis :: Sequence [start : end : reverse]

  • Format :: @ ListFileName [start : end : reverse]

  • Format :: list : ListFileName [start : end : reverse]

  • Format :: DatabaseName : Entry [start : end : reverse]

  • Format :: DatabaseName-SearchField : Word [start : end : reverse]

  • Format :: FileName : Entry [start : end : reverse]

  • Format :: FileName : SearchField : Word [start : end : reverse]

  • Format :: ProgramName ProgramParameters | [start : end : reverse]

The tokens (Sequence, Format etc.) are described below.

6.6.2.1. Sequence

Sequence is an explicit sequence in either upper or lower case, for example:

atgctgacgatgcg
TPRPGKNTEARLNCF
etc.

6.6.2.2. Format

Format must be a name of one of the valid sequence formats (Section A.1, “Supported Sequence Formats”).

The sequence format may usually be omitted when reading in a sequence; EMBOSS will try most known sequence formats until it can read the sequence.

6.6.2.3. ListFileName

ListFileName is the name of a listfile: a file of USAs with one USA per line. Either @ or list: are required before the name of the listfile to indicate that it is a listfile. Listfiles may be nested (a listfile may contain the USA of another listfile).

Where the sequence specification [start : end : reverse] is used, then all the USAs in the listfile are affected, unless these USAs have their own [start : end : reverse] specifier in which case that given on the command line is overridden.

This also holds true where the sequence is specified with the -sbegin or -send or any other command line qualifier (Section 6.4, “Datatype-specific Command Line Qualifiers”) which affects the input sequence: all USAs in the listfile are affected unless they have their own sequence specification.

6.6.2.4. DatabaseName

DatabaseName must be a valid database name as defined in the EMBOSS configuration files (Section 2.8, “Maintenance”).

If the name is not a valid database, a file with the same name is looked for instead. Database names may have Search Field names appended to them (for example embl-des, embl-id) (see below).

6.6.2.5. FileName

FileName is a filename which can be wildcarded.

6.6.2.6. Entry

Entry specifies the ID name or accession number of one or more sequences in a database or file. If it is omitted, then all the files in the database or file will be read. Entry may be wildcarded. For example hs* will match all ID names starting with hs.* indicates that all entries in the database or file will be read.

There may be restrictions on certain databases preventing access to a single entry, wildcarded entries or reading in all entries. This is a consequence of the way some databases are accessed. The restrictions are given in the database definition (see the EMBOSS Administrators Guide).

A database or file location must be given as part of a USA that has an Entry; you cannot give an entry name on its own, i.e. you cannot give just an accession number or ID name and expect EMBOSS to deduce that it is indeed an accession number or ID name and to which database it might refer.

6.6.2.7. SearchField

SearchField is the name of one of the available search fields shown in the table (Table 6.3, “Sequence Retrieval Search Fields”).

Table 6.3. Sequence Retrieval Search Fields
NameSearch Field
accAccession number
desDescription
idID name
keyKeyword
orgOrganism name
svSequence version/GI number

6.6.2.8. Word

Word is the keyword to search for in the search field. Words may be wildcarded.

Words in ORG and KEY fields may contain spaces because the complete key-phrase or organism classification level (the text field (including spaces) between the semicolons (;) delimiting sections of these fields) is indexed as one 'word'.

Words in the DES field contain only alphanumeric characters and thus end at spaces or other non-alphanumeric characters.

The words in ID and ACC fields are equivalent to Entry above.

6.6.2.9. Program and ProgramParameters

Program is the name of a sequence retrieval application in the current path. ProgramParameters are any parameters it takes in order to specify one or more entries.

6.6.2.10. [start : end : reverse]

Any USA may optionally take a subsequence specifier after the main body of the USA in one of the following forms:

[start : end]
[start : end : r]

Where start and end are the required start and end positions. Negative positions count from the end of the sequence. Zero values for start and end stand for the default values, i.e. position 1 and the length of the sequence respectively.

Use of the USA subsequence specifier is equivalent to using the -sbegin or -send or -sreverse command line qualifiers. For more information see Section 6.4, “Datatype-specific Command Line Qualifiers”).

6.6.3. Specifying the Format

The format, if specified, goes right at the start of the USA. For example:

Format :: DatabaseName : Entry
Format :: FileName : Entry

The sequence format can be any of those supported by EMBOSS (Section A.1, “Supported Sequence Formats”).

If the format is omitted from the USA, EMBOSS will check supported formats, in a carefully defined order, until the sequences are read successfully. Therefore it's not usually necessary to specify the format, although the application may run faster if you do as the tests will not need to be performed.

It's never necessary to specify the format of entries in a sequence database. All databases must be defined in the EMBOSS configuration files (Section 2.8, “Maintenance”) and the definitions include the format of the database.

The one case where it is recommended to specify the format is for sequence input in "plain" format, i.e. just the sequence without annotation, title or comments. This is because some variations of "plain" format may not otherwise be recognised by EMBOSS. If a format is not recognised, the application will fail with an informative error message.

6.6.4. Specifying a Database

6.6.4.1. Database Name

The database name is specified in a USA before either an entry to retrieve or a search field:

DatabaseName : Entry
DatabaseName-SearchField : Word

The name of any database you've defined in your EMBOSS installation can be used. Databases are defined in your EMBOSS configuration files (Section 2.8, “Maintenance”). To find out what local databases are available run:

showdb

This will give a table of the database names, whether they are protein or nucleic and the types of access that is possible (see below). If EMBOSS was set up by your system administrator it's likely that one or more of the following major databases will have been set up:

  • EMBL - nucleic sequences from the EMBL-EBI

  • GenBank - nucleic sequences from the NCBI

  • SwissProt - protein sequences from the EMBL-EBI/ExPASy

  • PIR - protein sequences from the NBRF

Abbreviations of these names are often used, for example em for databases in EMBL format. There is no standard naming scheme for databases because total control over database setup (including naming) is given to you or your local system administrator (the person who set up EMBOSS at your site). The dot character ('.') is, however, not allowed in database names. EMBOSS interprets a '.' character as being part of a file name.

6.6.4.2. Database Entry

The simplest way to specify a database entry in a USA is:

DatabaseName:Entry

where DatabaseName is the name of a database and Entry is either the sequence's accession number or ID in that database. For example:

embl:x13776
swissprot:opsd_xenla.

EMBOSS will try searching for your specified sequence by both the accession number field and the ID name field. You don't need to specify whether you gave the accession number or ID. The database name and entry are case-insensitive: they can be in either upper or lower-case. For example: EM:AF061303 is the same as em:af061303.

You cannot specify a sequence in EMBOSS by giving just the ID name or accession number; the database name must be given. You cannot therefore just give X65923 and expect EMBOSS to know what this is - it will assume that X65923 is the name of a database or a file which of course is unlikely to exist.

6.6.4.3. Set of Database Entries

It's common to run an application on all the entries in a database. This can be done by just giving the name of the database. Typically, however, an asterisk is used to indicate all entries are required. Either of the following therefore refer to all of the entries in the EMBL database:

embl
embl:*

Often a set of wildcarded entry names in a database are required. Wildcard text is specified by a * whereas a single wildcard character is specified by using a ? character. For example:

swissprot:*_human

refers to all the human entries in swissprot (strictly, it is all the entries in swissprot whose names end in _human.)

6.6.4.4. Restrictions on Accessing Databases

The specifications for a complete database or wildcarded entry names both refer to multiple entries in a database, but are implemented in EMBOSS in a very different way. When all entries are read, the application starts at the beginning of the database and reads an entry at a time. In contrast, reading wildcarded entries requires an index file of entry ID names and accession numbers. The index file is queried and gives the positions in the database of those entries whose names match the wildcarded specification. For more information on database indexing see the EMBOSS Administrators Guide.

Not all databases will be searchable by all types of sequence specifications. For example, databases that are set up to access a web site will probably not allow retrieval of wildcarded entry name specifications or complete databases: it would take too long to transfer the files across the Internet!

The application showdb will give a list of the available databases, together with the ways in which they can be accessed. This information is given under the three columns ID, Query and All:

ID

Applications can extract a single explicitly-named entry from the database, e.g. embl:x13776

Query

Applications can extract a set of matching wildcard entry names, e.g. swissprot:pax*_human

All

Applications can read all entries sequentially, e.g. embl:*

Ideally all of the databases available on your site will be available using all three methods, but this may well not be the case, so you should check how you can access the databases by running showdb.

Quoting on the UNIX Command line

Be aware that using * or ? on the UNIX command line is problematic. UNIX tries to interpret the word containing the * or ? as a wildcarded filename to be matched to existing files. When this fails UNIX gives an error message without running the application. To avoid this, these characters need to be hidden in quotes or preceded by a backslash on the UNIX command line. For example:

seqret "embl:*"or
seqret embl:\*

Quoting of wildcard characters is only required on the command line. It is not required when replying to an application prompt or when filling in a field on a GUI's form. This, for example, is fine:

% seqret
Reads and writes (returns) sequences
Input sequence(s): embl:*
..

6.6.5. Specifying a Sequence File

The file stdin

There is a system filename (stdin) that you can give whenever an input filename is requested. If you enter this name, then the resulting sequence will be read from the keyboard. This is only useful when you wish to type the sequence immediately, or are 'piping' the results from a previous application into the current application.

You can specify the format to read in by using format::stdin. For example:

gcg::stdin

A sequence filename is specified in a USA before an entry to retrieve or a search field:

FileName : Entry
FileName : SearchField : Word

Any file containing sequences can be used but the sequence must be in one of the formats that EMBOSS supports (Section A.1, “Supported Sequence Formats” The filename is case-sensitive: FRED.SEQ is not the same filename as fred.seq.

6.6.5.1. Multiple Sequence Files

Most sequence formats allow files to contain more than one sequence in the same file. Some formats however, such as gcg, plain, raw, staden do not: they have no indication of where the sequence ends and the next sequence starts.

If just the name of the file containing multiple sequences is specified, then all the sequences in that file will be read. This is the equivalent of specifying filename:*. For example

myclones.seq

is the same thing as

myclones.seq:*

6.6.5.2. Specifying One or More File Entries

The simplest way to specify a single specific sequence in a file containing multiple sequences is:

FileName:Entry

where FileName is the name of a file and Entry is the sequence's ID name or accession number in that file. For example the following USA would specify a sequence in the file myfile.fasta whose ID name is xyz_123:

myfile.fasta:xyz_123

As for database entries, you cannot specify a sequence in EMBOSS by giving just the ID name, the file name must be given.

To help GCG users, an additional syntax is allowed where the entry name is enclosed in curly brackets:

FileName{Entry}.

When given on the command line the brackets must be escaped as follows:

Filename\{Entry\}

To specify wildcarded sequence names, the wildcard characters '*' and '?' are again used. When used on the command line (but not in response to an EMBOSS prompt) they must be enclosed in quotes or preceded by a backslash. For example:

myfile.fasta:IXI* (in response to a prompt)
"myfile.fasta:IXI*" (on the command line)

will read in all sequences in the file myfile.fasta whose ID name starts with IXI.

6.6.5.3. Specifying a Set of Files

To specify a wildcarded set of file names the characters * and ? are again used. For example:

myfile.*
"myfile.*"

will read in all sequences in the files whose base names start with myfile.

6.6.6. Specifying a Listfile

A listfile is specified by giving @ or list: before the name of the listfile as follows:

@ ListFileName
list: ListFileName

An EMBOSS listfile is a file of USAs with one USA per line. They are essentially the same idea as a "File of Filenames" used in the Staden Package. However, instead of containing the sequences themselves, a listfile contains references (USAs) to sequences. Any valid USA can be given as a reference so, for example, you might include database entries, the names of files containing sequences, or even the names of other listfiles. For example, here's a valid listfile:

opsd_abyko.fasta
sw:opsd_xenla
sw:opsd_c*
@another_list

The contents are as follows:

opsd_abyko.fasta is the name of a sequence file.
sw:opsd_xenla is the name of a specific sequence in the swissprot database
sw:opsd_c* specifies all the sequences in swissprot whose ID names start with opsd_c
another_list is the name of a second (nested) listfile

Notice the @ in front of the last entry. This indicates the file is a listfile, not a regular sequence file. Alternatively, list: may be used in place of @.

Any blank lines or lines starting with a # character (typically used for informative comments) are ignored.

6.6.7. Specifying a Sequence "As Is"

The simplest USA specification uses asis to specify a sequence directly, i.e. as a string and not in a file or database. The syntax is:

asis::Sequence

For example: asis::atgctagcttagctgac specifies the sequence atgctagcttagctgac.

Note

asis can only specify one sequence at a time. The sequence has no ID name or title.

6.6.8. Applications

An unusual way of getting a sequence is to run an application to extract it from some other system. This is done by specifying the application's name and the sequence. These must be followed by a pipe (|) character.

ProgramName ProgramParameters |

For example:

getz -e [embl-id:AF061303] |

will invoke getz (the SRS sequence retrieval application) to extract entry AF061303 from EMBL. Any application or script which writes one or more sequences to screen (stdout) can be used in this way.

6.6.9. Specifying Search Fields

So far you have specified individual sequences in files or databases by using their ID name or their accession numbers, which are the default search fields. There are, however, other ways to specify sequences using other data fields defined in sequence database entries. An excerpt from typical sequence entry in EMBL format is shown below:

ID   X65923; SV 1; linear; mRNA; STD; HUM; 518 BP.
XX
AC   X65923;
XX
DT   13-MAY-1992 (Rel. 31, Created)
DT   18-APR-2005 (Rel. 83, Last updated, Version 11)
XX
DE   H.sapiens fau mRNA
XX
KW   fau gene.
XX
OS   Homo sapiens (human)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae;
OC   Homo.
XX
... The rest of entry is not shown

You can see the accession number (AC) and ID name (ID). Sequence retrieval is also possible by sequence version number (SV) and by specifying sequences that contain words occurring in their short description field (the DE line), their "Keyword" field (KW) or the Organism fields (OS and OC lines).

A search for ID name, accession number and version number, which are all usually unique to a sequence, will retrieve a single sequence only. In contrast, words in the description or organism name, for example, are not unique and searches against such fields will probably find more than one match. In this case you will get more than one sequence entry returned, as is often the case when you specify a wildcarded ID name.

You must explicitly specify which field type to search by using one of the search field names given in the table below (Table 6.4, “Database Search Fields”), together with the data to search for.

Table 6.4. Database Search Fields
NameSearch Field
accAccession number
desDescription
idID name
keyKeyword
orgOrganism Name
svSequence Version/GI Number

The type of field to search by is specified by adding a field name to the database name, for example:

embl-des:fau

When specifying a search field in a sequence file (as opposed to a database) the notation is a little different: you use a ':' (colon) instead of a '-' (dash), for example:

myclones.seq:des:fau

This is because myfile.seq-des could be a valid file name whereas myfile.seq:des is not.

Currently you can only specify one search field at a time.

Missing description, keyword, organism or sequence version fields cause queries to fail. If the file or database you are searching doesn't contain the field you are searching for then you will get an error message, something like:

"Error: Unable to read sequence xxx.seq:org:homo"

6.6.9.1. ACC and ID

The id and acc search fields can normally be omitted. If no search field is specified, (for example embl:X13776), then the default is to search for a match in both the id and acc fields .

Using database-acc:AccessionNumber or file:acc:AccessionNumber is a way of telling EMBOSS that it need not try to search for the entry by testing both the ID name field and the accession number field; it only needs to test accession number. This is allowed for ID too, for example, database-id:ID. Specifying the acc and id search fields will make accessing the sequences slightly faster, but they are not required. EMBOSS applications report USAs in this style however, so do not get alarmed when you see it.

6.6.9.2. ORG, KEY and DES

The ORG, KEY and DES fields have the following meaning:

ORG

The full organism classification names (OC field in EMBL).

KEY

Words and phrases that classify the entry by form and function, as specified by the database curators. (KW field in EMBL).

DES

Brief one-line description of the sequence entry. This field is the title line in simple sequence formats, such as fasta format) (DE field in EMBL).

Searches in these fields are by word. For example embl-des:fau will search for the text "fau" in the description field. If you wish to search for part of a word, use an asterisk to indicate a wildcard. For example: embl-des:h*emoglobin. The searches are case-insensitive: 'Human' is the same as 'human'.

The definition of a 'word' in KEY and ORG searches is anything that matches the text field (including spaces) between the semicolons (;) delimiting the sections of these fields, or the entire field if no sections are described as is the case for the KW field in the EMBL example above.

Therefore, embl-key:"fau gene" would match the entry X13776 displayed above, as would embl-key:fau*, but embl-key:fau would not match it.

Similarly, embl-org:"homo sapiens (human)" and embl-org:*human* and embl-org:hominidae would match this entry, but embl-org:human would not match it as the 'word' that contains "human" is "Homo sapiens (human)". The search embl-org:homo would match as the word "Homo" occurs in its own field at the end of the second OC line.

The definition of a 'word' is much more intuitive in DES searches: a 'word' is bounded by spaces and other non-alphanumeric characters. Words start with a letter or number, and end with a letter or number. SRS typically does the same, but allows a single quote at the end. This catches words such as 3' and 5' but is a problem with some quoted text.

Therefore embl-des:fau and embl-des:sapiens match. "H.sapiens" is not a word - it is split into the words 'H' and 'sapiens' because the dot (.) is not an alphanumeric character. Phrases don't work for the DES field; it is word based, so the search embl-des:"fau mRNA" will fail.

6.6.9.3. SV

Sequence versions are formed from the accession number followed by a full stop ('.') and then the number of releases there have been of this sequence. (e.g. X65923.1). It makes it possible to find the current version of any sequence and to find the SV of all previous versions. Further, a sequence may be unambiguously identified by the sequence version, for example: embl-sv:X65923.1 Care is needed however. In February 1999, everything in DDBJ/EMBL/GenBank was assigned version 1, even if it was the 1st or 10th version for a given sequence. Consider the entry below:

ID   AC000003; SV 1; linear; genomic DNA; STD; HUM; 122228 BP.
XX
AC   AC000003;
XX
DT   01-OCT-1996 (Rel. 49, Created)
DT   07-MAR-2000 (Rel. 63, Last updated, Version 6)
XX
DE   Homo sapiens chromosome 17, clone 104H12, complete sequence.
XX
KW   HTG.
XX

The entry AC000003 shows version 1, but is really the third sequence version (3rd gi) for that record (see http://www.ncbi.nlm.nih.gov:80/entrez/sutils/girevhist.cgi?val=AC000003). Rather confusingly, the version on the DT line has nothing to do with the sequence version (SV)

If, after Feb 1999, the author had updated the sequence of AC000003, then that new one would be version 2 (AC000003.2) and it is a lot easier for a human to track sequence version changes when you see the incremental increase. Bear in mind that just because you are looking at SV X00001.1 it doesn't mean you have the first version that was ever in the databases (DDBJ, EMBL, GenBank).

Both sequence version identifiers and GI numbers (see below) share the sv field in USAs.

6.6.9.4. GI Number

GI numbers are assigned to entries in GenBank and other sequence databases originating from the NCBI. They are an integer key for identifying the entry version. For example:

VERSION     AF181452.1  GI:6017929
            ^^^^^^^^^^  ^^^^^^^^^^
            Compound    NCBI GI
            Accession   Identifier
            Number

The NCBI GI identifier on the VERSION line serves as a method for identifying the sequence data that has existed for a database entry over time. GI identifiers are numeric values of one or more digits. Since they are integer keys they are less human-friendly than the accession version system described above. If the sequence changes a new integer GI will be assigned.

A sequence may be unambiguously identified by the GI Number, for example: genbank-sv:6017929.

Two methods for identifying the version of the sequence associated with a database entry are used because:

  • Some data sources processed by NCBI for incorporation into its Entrez sequence retrieval system do not version their own sequences.

  • GIs provide a uniform integer identifier system for every sequence NCBI has processed. Some products and systems derived from (or reliant upon) NCBI products and services prefer to use these integer identifiers because they can all be processed in the same manner.

Both sequence version identifiers (see above) and GI numbers share the sv field in USAs.

6.6.9.5. Start, End, Reverse

The start and end of the sequence is specified by appending [start:end] to the end of the USA. For example:

myfile.fasta[20:45]

specifies the sequences in the file myfile.fasta starting at 20 and ending at position 45.

If the 'start' or 'end' position is given as a negative number, then the position is counted from the end of the sequence. For example:

myfile.fasta[-10:-1]

specifies the last 10 residues.

If [start:end:r] is given at the end of the USA, then nucleotide sequenced are reverse-complemented. For example:

myfile.fasta[1:-1:r]

is the whole sequence reverse-complemented.

Zeros can be used to denote the start and end of the complete sequence. For example, the entire sequence may be specified by:

myfile.fasta[0:0]

6.6.10. USA Summary

The following are valid USAs for sequences:

asis::Sequence
@ListFileName
list::ListFileName
DatabaseName
DatabaseName:Entry
DatabaseName-SearchField:word
FileName
FileName:Entry
FileName:SearchField:word

Each of the above can have [start : end] or [start : end : reverse] appended to them.

The FileName and DatabaseName forms of USA can have format:: in front of them to specify the format although this is not normally necessary. Some examples are shown below (???).

6.6.10.1. USA Examples

USA Examples
TypeExampleDescription
FileNamexxx.seqA sequence file xxx.seq in any format
Format::FileNamefasta::xxx.seqA sequence file xxx.seq in FASTA format
DatabaseName:IDnameembl:X13776EMBL entry X13776, using whatever access method is defined locally for the EMBL database
DatabaseName:AccessionNumberembl:X13776EMBL entry X13776, using whatever access method is defined locally for the EMBL database and searching by accession number and entry name (X13776 is the accession number in this case)
DatabaseName-acc:AccessionNumberembl-acc:X13776EMBL entry X13776, using whatever access method is defined locally for the EMBL database and searching by accession number only
DatabaseName-id:IDnameembl-id:X13776EMBL entry X13776, using whatever access method is defined locally for the EMBL database, and searching by ID only
DatabaseName-SearchField:wordembl-des:lectinEMBL entries containing the word 'lectin' in the 'Description' line
DatabaseName-SearchField:wildcard wordembl-org:*human*EMBL entries containing the wildcarded word 'human' in the 'Organism' fields
DatabaseName:wildcard IDembl:X1377*EMBL entries with the prefix X1377, usually in alphabetical order, using whatever access method is defined locally for the EMBL database
DatabaseName or DatabaseName:*embl or EMBL:*All sequences in the EMBL database
@ListFileName@mylistReads file mylist and uses each line as a separate USA. Listfiles can contain references to other list files or any other standard USA.
list:ListFileNamelist:mylistSame as @mylist above
'program parameters |''getz -e [embl-id:X13776] |'The pipe character | causes EMBOSS to fire up getz (the SRS sequence retrieval program) to extract entry X13776 from EMBL in EMBL format. Any application or script which writes one or more sequences to stdout can be used in this way.
asis::sequenceasis::atacgcagttatctgaccatFor specifying literal sequences on the command lines.