The Uniform Sequence Address (USA) is a standard sequence naming scheme used by all EMBOSS applications. Typically, one or more sequences are read from a file or from a larger database. However, other sources such as an application or web server can be specified in a USA (Section 6.6, “The Uniform Sequence Address (USA)”).
A USA specifies:
The sequence format to expect
The file or database to open
The entry or entries to read
The general format of a USA specification is:
Format is the database format of a file of sequences (
FileName) or installed database (
DatabaseName) you have provided and
Entry is the database entry code.
DatabaseName is strictly necessary. If the expected format is omitted then EMBOSS will attempt parsing with a carefully organised list of supported formats (Section A.1, “Supported Sequence Formats”) until one succeeds. If the database entry code is omitted, then all of the entries in the file or database are read.
Here are some common variants of USAs:
ListFileName is the name of a listfile which itself can contain a list of valid USAs. The
: syntax is to allow, for example, "embl" and "pir" to be both database names and sequence formats.
In the following examples,
AccessionNumber is the sequence's accession number in the database, and
DatabaseId is its identifier:
The full command line syntax of the possible USAs are give below. Whitespace has been added for clarity but would not be used on the command line:
end : reverse]
Format :: @
end : reverse]
Format :: list :
end : reverse]
end : reverse]
end : reverse]
end : reverse]
end : reverse]
ProgramParameters | [
end : reverse]
The tokens (
Format etc.) are described below.
Sequence is an explicit sequence in either upper or lower case, for example:
Format must be a name of one of the valid sequence formats (Section A.1, “Supported Sequence Formats”).
The sequence format may usually be omitted when reading in a sequence; EMBOSS will try most known sequence formats until it can read the sequence.
ListFileName is the name of a listfile: a file of USAs with one USA per line. Either
list: are required before the name of the listfile to indicate that it is a listfile. Listfiles may be nested (a listfile may contain the USA of another listfile).
Where the sequence specification
[ is used, then all the USAs in the listfile are affected, unless these USAs have their own
end : reverse]
[ specifier in which case that given on the command line is overridden.
end : reverse]
This also holds true where the sequence is specified with the
-send or any other command line qualifier (Section 6.4, “Datatype-specific Command Line Qualifiers”) which affects the input sequence: all USAs in the listfile are affected unless they have their own sequence specification.
DatabaseName must be a valid database name as defined in the EMBOSS configuration files (Section 2.8, “Maintenance”).
If the name is not a valid database, a file with the same name is looked for instead. Database names may have Search Field names appended to them (for example
embl-id) (see below).
Entry specifies the ID name or accession number of one or more sequences in a database or file. If it is omitted, then all the files in the database or file will be read.
Entry may be wildcarded. For example
hs* will match all ID names starting with
* indicates that all entries in the database or file will be read.
There may be restrictions on certain databases preventing access to a single entry, wildcarded entries or reading in all entries. This is a consequence of the way some databases are accessed. The restrictions are given in the database definition (see the EMBOSS Administrators Guide).
A database or file location must be given as part of a USA that has an
Entry; you cannot give an entry name on its own, i.e. you cannot give just an accession number or ID name and expect EMBOSS to deduce that it is indeed an accession number or ID name and to which database it might refer.
SearchField is the name of one of the available search fields shown in the table (Table 6.3, “Sequence Retrieval Search Fields”).
|Sequence version/GI number|
Word is the keyword to search for in the search field. Words may be wildcarded.
KEY fields may contain spaces because the complete key-phrase or organism classification level (the text field (including spaces) between the semicolons (
;) delimiting sections of these fields) is indexed as one 'word'.
Words in the
DES field contain only alphanumeric characters and thus end at spaces or other non-alphanumeric characters.
The words in
ACC fields are equivalent to
Program is the name of a sequence retrieval application in the current path.
ProgramParameters are any parameters it takes in order to specify one or more entries.
Any USA may optionally take a subsequence specifier after the main body of the USA in one of the following forms:
end are the required start and end positions. Negative positions count from the end of the sequence. Zero values for
end stand for the default values, i.e. position 1 and the length of the sequence respectively.
Use of the USA subsequence specifier is equivalent to using the
-sreverse command line qualifiers. For more information see Section 6.4, “Datatype-specific Command Line Qualifiers”).
The format, if specified, goes right at the start of the USA. For example:
The sequence format can be any of those supported by EMBOSS (Section A.1, “Supported Sequence Formats”).
If the format is omitted from the USA, EMBOSS will check supported formats, in a carefully defined order, until the sequences are read successfully. Therefore it's not usually necessary to specify the format, although the application may run faster if you do as the tests will not need to be performed.
It's never necessary to specify the format of entries in a sequence database. All databases must be defined in the EMBOSS configuration files (Section 2.8, “Maintenance”) and the definitions include the format of the database.
The one case where it is recommended to specify the format is for sequence input in "plain" format, i.e. just the sequence without annotation, title or comments. This is because some variations of "plain" format may not otherwise be recognised by EMBOSS. If a format is not recognised, the application will fail with an informative error message.
The database name is specified in a USA before either an entry to retrieve or a search field:
The name of any database you've defined in your EMBOSS installation can be used. Databases are defined in your EMBOSS configuration files (Section 2.8, “Maintenance”). To find out what local databases are available run:
This will give a table of the database names, whether they are protein or nucleic and the types of access that is possible (see below). If EMBOSS was set up by your system administrator it's likely that one or more of the following major databases will have been set up:
EMBL - nucleic sequences from the EMBL-EBI
GenBank - nucleic sequences from the NCBI
SwissProt - protein sequences from the EMBL-EBI/ExPASy
PIR - protein sequences from the NBRF
Abbreviations of these names are often used, for example
em for databases in EMBL format. There is no standard naming scheme for databases because total control over database setup (including naming) is given to you or your local system administrator (the person who set up EMBOSS at your site). The dot character ('.') is, however, not allowed in database names. EMBOSS interprets a '.' character as being part of a file name.
The simplest way to specify a database entry in a USA is:
DatabaseName is the name of a database and
Entry is either the sequence's accession number or ID in that database. For example:
EMBOSS will try searching for your specified sequence by both the accession number field and the
ID name field. You don't need to specify whether you gave the accession number or ID. The database name and entry are case-insensitive: they can be in either upper or lower-case. For example:
EM:AF061303 is the same as
You cannot specify a sequence in EMBOSS by giving just the ID name or accession number; the database name must be given. You cannot therefore just give
X65923 and expect EMBOSS to know what this is - it will assume that
X65923 is the name of a database or a file which of course is unlikely to exist.
It's common to run an application on all the entries in a database. This can be done by just giving the name of the database. Typically, however, an asterisk is used to indicate all entries are required. Either of the following therefore refer to all of the entries in the EMBL database:
Often a set of wildcarded entry names in a database are required. Wildcard text is specified by a
* whereas a single wildcard character is specified by using a
? character. For example:
refers to all the human entries in swissprot (strictly, it is all the entries in swissprot whose names end in
The specifications for a complete database or wildcarded entry names both refer to multiple entries in a database, but are implemented in EMBOSS in a very different way. When all entries are read, the application starts at the beginning of the database and reads an entry at a time. In contrast, reading wildcarded entries requires an index file of entry ID names and accession numbers. The index file is queried and gives the positions in the database of those entries whose names match the wildcarded specification. For more information on database indexing see the EMBOSS Administrators Guide.
Not all databases will be searchable by all types of sequence specifications. For example, databases that are set up to access a web site will probably not allow retrieval of wildcarded entry name specifications or complete databases: it would take too long to transfer the files across the Internet!
The application showdb will give a list of the available databases, together with the ways in which they can be accessed. This information is given under the three columns
Applications can extract a single explicitly-named entry from the database, e.g.
Applications can extract a set of matching wildcard entry names, e.g.
Applications can read all entries sequentially, e.g.
Ideally all of the databases available on your site will be available using all three methods, but this may well not be the case, so you should check how you can access the databases by running showdb.
Be aware that using
? on the UNIX command line is problematic. UNIX tries to interpret the word containing the
? as a wildcarded filename to be matched to existing files. When this fails UNIX gives an error message without running the application. To avoid this, these characters need to be hidden in quotes or preceded by a backslash on the UNIX command line. For example:
Quoting of wildcard characters is only required on the command line. It is not required when replying to an application prompt or when filling in a field on a GUI's form. This, for example, is fine:
seqretReads and writes (returns) sequences Input sequence(s): embl:* ..
There is a system filename (
stdin) that you can give whenever an input filename is requested. If you enter this name, then the resulting sequence will be read from the keyboard. This is only useful when you wish to type the sequence immediately, or are 'piping' the results from a previous application into the current application.
You can specify the format to read in by using
. For example:
A sequence filename is specified in a USA before an entry to retrieve or a search field:
Any file containing sequences can be used but the sequence must be in one of the formats that EMBOSS supports (Section A.1, “Supported Sequence Formats” The filename is case-sensitive:
FRED.SEQ is not the same filename as
Most sequence formats allow files to contain more than one sequence in the same file. Some formats however, such as gcg, plain, raw, staden do not: they have no indication of where the sequence ends and the next sequence starts.
If just the name of the file containing multiple sequences is specified, then all the sequences in that file will be read. This is the equivalent of specifying
filename:*. For example
is the same thing as
The simplest way to specify a single specific sequence in a file containing multiple sequences is:
FileName is the name of a file and
Entry is the sequence's ID name or accession number in that file. For example the following USA would specify a sequence in the file
myfile.fasta whose ID name is
As for database entries, you cannot specify a sequence in EMBOSS by giving just the ID name, the file name must be given.
To help GCG users, an additional syntax is allowed where the entry name is enclosed in curly brackets:
When given on the command line the brackets must be escaped as follows:
To specify wildcarded sequence names, the wildcard characters '
*' and '
?' are again used. When used on the command line (but not in response to an EMBOSS prompt) they must be enclosed in quotes or preceded by a backslash. For example:
will read in all sequences in the file
myfile.fasta whose ID name starts with
A listfile is specified by giving
list: before the name of the listfile as follows:
An EMBOSS listfile is a file of USAs with one USA per line. They are essentially the same idea as a "File of Filenames" used in the Staden Package. However, instead of containing the sequences themselves, a listfile contains references (USAs) to sequences. Any valid USA can be given as a reference so, for example, you might include database entries, the names of files containing sequences, or even the names of other listfiles. For example, here's a valid listfile:
opsd_abyko.fasta sw:opsd_xenla sw:opsd_c* @another_list
The contents are as follows:
@ in front of the last entry. This indicates the file is a listfile, not a regular sequence file. Alternatively,
list: may be used in place of
Any blank lines or lines starting with a
# character (typically used for informative comments) are ignored.
The simplest USA specification uses
asis to specify a sequence directly, i.e. as a string and not in a file or database. The syntax is:
asis::atgctagcttagctgac specifies the sequence
asis can only specify one sequence at a time. The sequence has no ID name or title.
An unusual way of getting a sequence is to run an application to extract it from some other system. This is done by specifying the application's name and the sequence. These must be followed by a pipe (
getz -e [embl-id:AF061303] |
will invoke getz (the SRS sequence retrieval application) to extract entry
AF061303 from EMBL. Any application or script which writes one or more sequences to screen (
stdout) can be used in this way.
So far you have specified individual sequences in files or databases by using their ID name or their accession numbers, which are the default search fields. There are, however, other ways to specify sequences using other data fields defined in sequence database entries. An excerpt from typical sequence entry in EMBL format is shown below:
ID X65923; SV 1; linear; mRNA; STD; HUM; 518 BP. XX AC X65923; XX DT 13-MAY-1992 (Rel. 31, Created) DT 18-APR-2005 (Rel. 83, Last updated, Version 11) XX DE H.sapiens fau mRNA XX KW fau gene. XX OS Homo sapiens (human) OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; OC Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; OC Homo. XX ... The rest of entry is not shown
You can see the accession number (
AC) and ID name (
ID). Sequence retrieval is also possible by sequence version number (
SV) and by specifying sequences that contain words occurring in their short description field (the
DE line), their "Keyword" field (
KW) or the Organism fields (
A search for ID name, accession number and version number, which are all usually unique to a sequence, will retrieve a single sequence only. In contrast, words in the description or organism name, for example, are not unique and searches against such fields will probably find more than one match. In this case you will get more than one sequence entry returned, as is often the case when you specify a wildcarded ID name.
You must explicitly specify which field type to search by using one of the search field names given in the table below (Table 6.4, “Database Search Fields”), together with the data to search for.
|Sequence Version/GI Number|
The type of field to search by is specified by adding a field name to the database name, for example:
When specifying a search field in a sequence file (as opposed to a database) the notation is a little different: you use a '
:' (colon) instead of a '
-' (dash), for example:
This is because
myfile.seq-des could be a valid file name whereas
myfile.seq:des is not.
Currently you can only specify one search field at a time.
Missing description, keyword, organism or sequence version fields cause queries to fail. If the file or database you are searching doesn't contain the field you are searching for then you will get an error message, something like:
acc search fields can normally be omitted. If no search field is specified, (for example
embl:X13776), then the default is to search for a match in both the
acc fields .
file:acc: is a way of telling EMBOSS that it need not try to search for the entry by testing both the ID name field and the accession number field; it only needs to test accession number. This is allowed for ID too, for example,
database-id:. Specifying the
id search fields will make accessing the sequences slightly faster, but they are not required. EMBOSS applications report USAs in this style however, so do not get alarmed when you see it.
DES fields have the following meaning:
The full organism classification names (
OC field in EMBL).
Words and phrases that classify the entry by form and function, as specified by the database curators. (
KW field in EMBL).
Brief one-line description of the sequence entry. This field is the title line in simple sequence formats, such as
fasta format) (
DE field in EMBL).
Searches in these fields are by word. For example
embl-des:fau will search for the text "fau" in the description field. If you wish to search for part of a word, use an asterisk to indicate a wildcard. For example:
embl-des:h*emoglobin. The searches are case-insensitive: 'Human' is the same as 'human'.
The definition of a 'word' in
ORG searches is anything that matches the text field (including spaces) between the semicolons (
;) delimiting the sections of these fields, or the entire field if no sections are described as is the case for the
KW field in the EMBL example above.
embl-key:"fau gene" would match the entry
X13776 displayed above, as would
embl-key:fau would not match it.
embl-org:"homo sapiens (human)" and
embl-org:hominidae would match this entry, but
embl-org:human would not match it as the 'word' that contains "human" is "Homo sapiens (human)". The search
embl-org:homo would match as the word "Homo" occurs in its own field at the end of the second
The definition of a 'word' is much more intuitive in
DES searches: a 'word' is bounded by spaces and other non-alphanumeric characters. Words start with a letter or number, and end with a letter or number. SRS typically does the same, but allows a single quote at the end. This catches words such as 3' and 5' but is a problem with some quoted text.
embl-des:sapiens match. "H.sapiens" is not a word - it is split into the words 'H' and 'sapiens' because the dot (
.) is not an alphanumeric character. Phrases don't work for the DES field; it is word based, so the search
embl-des:"fau mRNA" will fail.
Sequence versions are formed from the accession number followed by a full stop ('
.') and then the number of releases there have been of this sequence. (e.g.
X65923.1). It makes it possible to find the current version of any sequence and to find the
SV of all previous versions. Further, a sequence may be unambiguously identified by the sequence version, for example:
embl-sv:X65923.1 Care is needed however. In February 1999, everything in DDBJ/EMBL/GenBank was assigned version 1, even if it was the 1st or 10th version for a given sequence. Consider the entry below:
ID AC000003; SV 1; linear; genomic DNA; STD; HUM; 122228 BP. XX AC AC000003; XX DT 01-OCT-1996 (Rel. 49, Created) DT 07-MAR-2000 (Rel. 63, Last updated, Version 6) XX DE Homo sapiens chromosome 17, clone 104H12, complete sequence. XX KW HTG. XX
AC000003 shows version 1, but is really the third sequence version (3rd
gi) for that record (see http://www.ncbi.nlm.nih.gov:80/entrez/sutils/girevhist.cgi?val=AC000003). Rather confusingly, the version on the
DT line has nothing to do with the sequence version (
If, after Feb 1999, the author had updated the sequence of
AC000003, then that new one would be version 2 (
AC000003.2) and it is a lot easier for a human to track sequence version changes when you see the incremental increase. Bear in mind that just because you are looking at
SV X00001.1 it doesn't mean you have the first version that was ever in the databases (DDBJ, EMBL, GenBank).
Both sequence version identifiers and GI numbers (see below) share the
sv field in USAs.
GI numbers are assigned to entries in GenBank and other sequence databases originating from the NCBI. They are an integer key for identifying the entry version. For example:
VERSION AF181452.1 GI:6017929 ^^^^^^^^^^ ^^^^^^^^^^ Compound NCBI GI Accession Identifier Number
The NCBI GI identifier on the
VERSION line serves as a method for identifying the sequence data that has existed for a database entry over time. GI identifiers are numeric values of one or more digits. Since they are integer keys they are less human-friendly than the accession version system described above. If the sequence changes a new integer GI will be assigned.
A sequence may be unambiguously identified by the GI Number, for example:
Two methods for identifying the version of the sequence associated with a database entry are used because:
Some data sources processed by NCBI for incorporation into its Entrez sequence retrieval system do not version their own sequences.
GIs provide a uniform integer identifier system for every sequence NCBI has processed. Some products and systems derived from (or reliant upon) NCBI products and services prefer to use these integer identifiers because they can all be processed in the same manner.
Both sequence version identifiers (see above) and GI numbers share the
sv field in USAs.
The start and end of the sequence is specified by appending
[ to the end of the USA. For example:
specifies the sequences in the file
myfile.fasta starting at 20 and ending at position 45.
If the 'start' or 'end' position is given as a negative number, then the position is counted from the end of the sequence. For example:
specifies the last 10 residues.
[start:end:r] is given at the end of the USA, then nucleotide sequenced are reverse-complemented. For example:
is the whole sequence reverse-complemented.
Zeros can be used to denote the start and end of the complete sequence. For example, the entire sequence may be specified by:
The following are valid USAs for sequences:
Each of the above can have
[ appended to them.
end : reverse]
DatabaseName forms of USA can have
format:: in front of them to specify the format although this is not normally necessary. Some examples are shown below (???).
|A sequence file |
|A sequence file |
|EMBL entry |
|EMBL entry |
|EMBL entry |
|EMBL entry |
|EMBL entries containing the word 'lectin' in the 'Description' line|
|EMBL entries containing the wildcarded word 'human' in the 'Organism' fields|
|EMBL entries with the prefix |
|All sequences in the EMBL database|
|Reads file |
|Same as |
|The pipe character |
|For specifying literal sequences on the command lines.|