5.2. Introduction to Sequence Formats

5.2. Introduction to Sequence Formats
Prev	Chapter 5. File Formats	Next

5.2.1. What is a Sequence Format?

A sequence format defines the permitted layout and content of text in a file. This includes text tokens that define fields used in a databank. These fields include the sequence itself, the sequence identifier name and accession number, amongst others. Non-printable control characters are not generally used, allowing most formats to be viewed on screen or printed out.

The FASTA format is a very widely used (and abused) format. It consists of a header line starting with a > character followed by a code identifying the sequence and, very often, some text describing the sequence. The header line is followed by one or more lines containing the sequence itself. FASTA files may contain one or more sequences:

>crab_anapl ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).             
MDITIHNPLIRRPLFSWLAPSRIFDQIFGEHLQESELLPASPSLSPFLMR
SPIFRMPSWLETGLSEMRLEKDKFSVNLDVKHFSPEELKVKVLGDMVEIH
GKHEERQDEHGFIAREFNRKYRIPADVDPLTITSSLSLDGVLTVSAPRKQ
SDVPERSIPITREEKPAIAGAQRK
>crab_bovin ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).             
MDIAIHHPWIRRPFFPFHSPSRLFDQFFGEHLLESDLFPASTSLSPFYLR
PPSFLRAPSWIDTGLSEMRLEKDRFSVNLDVKHFSPEELKVKVLGDVIEV
HGKHEERQDEHGFISREFHRKYRIPADVDPLAITSSLSSDGVLTVNGPRK
QASGPERTIPITREEKPAVTAAPKK
>crab_chick ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).             
MDITIHNPLVRRPLFSWLTPSRIFDQIFGEHLQESELLPTSPSLSPFLMR
SPFFRMPSWLETGLSEMRLEKDKFSVNLDVKHFSPEELKVKVLGDMIEIH
GKHEERQDEHGFIAREFSRKYRIPADVDPLTITSSLSLDGVLTVSAPRKQ
SDVPERSIPITREEKPAIAGSQRK

Beyond FASTA, the most widespread sequence formats are those used by the major sequence databases:

EMBL: http://www.ebi.ac.uk/embl/Documentation/User_manual/format.html
GenBank: http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html
SwissProt: http://ca.expasy.org/sprot/userman.html#whatis
PIR: http://pir.georgetown.edu/pirwww/search/textpsd.shtml

Sadly, sequences are occasionally stored in non-standard formats. These include proprietary word processor formats (e.g. MS Word and MS WordPad) and text formatting languages (e.g. PostScript, PDF, RTF, TeX and HTML). EMBOSS will not read a sequence in any of these formats.

If you have a sequence in a non-standard format you should:

Save the sequence to a file as plain ASCII text, without any formatting whatsoever. The file should contain the sequence only. EMBOSS will recognise this "plain" format. The program you are using to view the file should have an option to "Save as..." plain text.
If there is not an option to save your sequence in plain text format directly, there may well be a utility program to convert the file to plain text format. The EMBOSS user community will be able to help you with this (see Section 3.5, “How to Get Help”).
Use a text editor that is capable of writing files in plain text format in the future. These include pico, nedit, emacs and MS wordpad. When using a text editor to create a sequence file, the best (simplest) format to use is FASTA as described above. Be sure to save your sequence as plain text.
If you intend to manipulate or edit the sequences substantially, investigate using a full-blown sequence editor such as mse. Such editors should have an option to save the sequence to a file in one or more of the standard formats.

5.2.2. Supported Sequence Formats

Some sequence formats can hold multiple sequences in one file. Typically there will be multiple entries (one per sequence) that are catenated in the file. Other formats, such as Staden, can only hold one sequence per file. An attempt to catenate several such sequences in one file would result in a mess from which it would be difficult to differentiate the sequences from the annotation. Most systems including EMBOSS will not parse such files, therefore you should never use a single sequence format to hold multiple sequences. Sequences are also held in alignment files. These contain the results of aligning (lining up similar or equivalent characters) in two or more sequences. EMBOSS supports most common sequence alignments formats (Section A.3, “Supported Alignment Formats”).

All of the common sequence formats are supported in EMBOSS for both application input (reading) and output (writing). These are summarised below. Some support single sequences only, some multiple sequences. The names of the sequence formats are taken from common EMBOSS database configurations. Some of these are obviously synonyms e.g. "embl" and "em". In practice, the names available will depend on what's defined in your EMBOSS configuration files (see Section 2.8, “Maintenance”). For descriptions and examples of the supported formats see Section A.1, “Supported Sequence Formats”.

The supported sequence formats are summarised in the table below. The columns are as follows: Input format (format name), Output format (format name), Sngl (indicates whether each sequence is written to a new file. This behaviour is the default and can be set by the -ossingle command line qualifier. Save (indicates that sequence data is stored internally and written when the output is closed. This is needed for 'interleaved' formats such as Phylip and MSF), Try (indicates whether the format can be detected automatically on input), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented, Feat (whether the format includes feature annotation data. EMBOSS can also read feature data from a separate feature file). Gap (whether the format supports sequence data with gap characters, for example the results of an alignment), Mset ("true" indicates that more than one set of sequences can be stored in a single file. This is used by, for example, phylogenetic analysis applications to store many versions of a multiple alignment for statistical analysis) and Description (short description of the format).

Table 5.1. Input sequence formats
Input Format	Try	Nuc	Pro	Feat	Gap	Mset	Description
abi	Yes	Yes	Yes	No	Yes	No	ABI trace file
acedb	Yes	Yes	Yes	No	Yes	No	ACEDB sequence format
clustal	Yes	Yes	Yes	No	Yes	No	Clustalw output format
codata	Yes	Yes	Yes	Yes	Yes	No	Codata entry format
dbid	No	Yes	Yes	No	Yes	No	Fasta format variant with database name before ID
embl	Yes	Yes	No	Yes	Yes	No	EMBL entry format
experiment	Yes	Yes	Yes	No	Yes	No	Staden experiment file
fasta	Yes	Yes	Yes	No	Yes	No	FASTA format including NCBI-style IDs
fastq	Yes	Yes	No	No	No	No	FASTQ short read format ignoring quality scores
fastq-illumina	No	Yes	No	No	No	No	FASTQ Illumina 1.3 short read format
fastq-sanger	No	Yes	No	No	No	No	FASTQ short read format with phred quality
fastq-solexa	No	Yes	No	No	No	No	FASTQ Solexa/Illumina 1.0 short read format
fitch	Yes	Yes	Yes	No	Yes	No	Fitch program format
gcg	Yes	Yes	Yes	No	Yes	No	GCG sequence format
genbank	Yes	Yes	No	Yes	Yes	No	Genbank entry format
genpept	No	No	Yes	Yes	Yes	No	Refseq protein entry format (alias)
gff2	Yes	Yes	Yes	Yes	Yes	No	GFF feature file with sequence in the header
gff3	Yes	Yes	Yes	Yes	Yes	No	GFF3 feature file with sequence
gifasta	No	Yes	Yes	No	Yes	No	FASTA format including NCBI-style GIs (alias)
hennig86	Yes	Yes	Yes	No	Yes	No	Hennig86 output format
ig	No	Yes	Yes	No	Yes	No	Intelligenetics sequence format
igstrict	Yes	Yes	Yes	No	Yes	No	Intelligenetics sequence format strict parser
jackknifer	Yes	Yes	Yes	No	Yes	No	Jackknifer interleaved and non-interleaved formats
mase	No	Yes	Yes	No	Yes	No	Mase program format
mega	Yes	Yes	Yes	No	Yes	No	Mega interleaved and non-interleaved formats
msf	Yes	Yes	Yes	No	Yes	No	GCG MSF (multiple sequence file) file format
nbrf	Yes	Yes	Yes	Yes	Yes	No	NBRF/PIR entry format
nexus	Yes	Yes	Yes	No	Yes	No	Nexus/paup interleaved format
pdb	Yes	No	Yes	No	No	No	PDB protein databank format ATOM lines
pdbnuc	No	Yes	No	No	No	No	PDB protein databank format nucleotide ATOM lines
pdbnucseq	No	Yes	No	No	No	No	PDB protein databank format nucleotide SEQRES lines
pdbseq	Yes	No	Yes	No	No	No	PDB protein databank format SEQRES lines
pearson	Yes	Yes	Yes	No	Yes	No	Plain old fasta format with IDs not parsed further
phylip	Yes	Yes	Yes	No	Yes	Yes	Phylip interleaved and non-interleaved formats
phylipnon	No	Yes	Yes	No	Yes	Yes	Phylip non-interleaved format
raw	Yes	Yes	Yes	No	No	No	Raw sequence with no non-sequence characters
refseqp	No	No	Yes	Yes	Yes	No	Refseq protein entry format
selex	No	Yes	Yes	No	Yes	No	Selex format
staden	No	Yes	Yes	No	Yes	No	Old staden package sequence format
stockholm	Yes	Yes	Yes	No	Yes	No	Stockholm (pfam) format
strider	Yes	Yes	Yes	No	Yes	No	DNA strider output format
swiss	Yes	No	Yes	Yes	Yes	No	Swissprot entry format
text	No	Yes	Yes	No	Yes	No	Plain text
treecon	Yes	Yes	Yes	No	Yes	No	Treecon output format

Table 5.2. Output sequence formats
Output Format	Sngl	Save	Nuc	Pro	Feat	Gap	Mset	Description
acedb	No	No	Yes	Yes	No	Yes	No	ACEDB sequence format
asn1	No	No	Yes	Yes	No	Yes	No	NCBI ASN.1 format
clustal	No	Yes	Yes	Yes	No	Yes	No	Clustalw multiple alignment format
codata	No	No	Yes	Yes	No	Yes	No	Codata entry format
das	No	No	Yes	Yes	No	Yes	No	DASSEQUENCE DAS any sequence
dasdna	No	No	Yes	No	No	Yes	No	DASDNA DAS nucleotide-only sequence
debug	No	No	Yes	Yes	No	Yes	No	Debugging trace of full internal data content
embl	No	No	Yes	No	Yes	Yes	No	EMBL entry format
experiment	No	No	Yes	Yes	No	Yes	No	Staden experiment file
fasta	No	No	Yes	Yes	No	Yes	No	FASTA format
fastq-illumina	No	No	Yes	No	No	No	No	FASTQ Illumina 1.3 short read format
fastq-sanger	No	No	Yes	No	No	No	No	FASTQ short read format with phred quality
fastq-solexa	No	No	Yes	No	No	No	No	FASTQ Solexa/Illumina 1.0 short read format
fitch	No	No	Yes	Yes	No	Yes	No	Fitch program format
gcg	No	No	Yes	Yes	No	Yes	No	GCG sequence format
genbank	No	No	Yes	No	No	Yes	No	Genbank entry format
gff2	No	No	Yes	Yes	Yes	Yes	No	GFF2 feature file with sequence in the header
gff3	No	No	Yes	Yes	Yes	Yes	No	GFF3 feature file with sequence in FASTA format after
gifasta	No	No	Yes	Yes	No	Yes	No	NCBI fasta format with NCBI-style IDs using GI number
hennig86	No	Yes	Yes	Yes	No	Yes	No	Hennig86 output format
ig	No	No	Yes	Yes	No	Yes	No	Intelligenetics sequence format
jackknifer	No	Yes	Yes	Yes	No	Yes	No	Jackknifer output interleaved format
jackknifernon	No	Yes	Yes	Yes	No	Yes	No	Jackknifer output non-interleaved format
mase	No	No	Yes	Yes	No	Yes	No	Mase program format
mega	No	Yes	Yes	Yes	No	Yes	No	Mega interleaved output format
meganon	No	Yes	Yes	Yes	No	Yes	No	Mega non-interleaved output format
msf	No	Yes	Yes	Yes	No	Yes	No	GCG MSF (multiple sequence file) file format
nbrf	No	No	Yes	Yes	Yes	Yes	No	NBRF/PIR entry format
ncbi	No	No	Yes	Yes	No	Yes	No	NCBI fasta format with NCBI-style IDs
nexus	No	Yes	Yes	Yes	No	Yes	No	Nexus/paup interleaved format
nexusnon	No	Yes	Yes	Yes	No	Yes	No	Nexus/paup non-interleaved format
phylip	No	Yes	Yes	Yes	No	Yes	Yes	Phylip interleaved format
phylipnon	No	Yes	Yes	Yes	No	Yes	No	Phylip non-interleaved format
selex	No	Yes	Yes	Yes	No	Yes	No	Selex format
staden	No	No	Yes	Yes	No	Yes	No	Old staden package sequence format
strider	No	No	Yes	Yes	No	Yes	No	DNA strider output format
swiss	No	No	No	Yes	Yes	Yes	No	Swissprot entry format
text	No	No	Yes	Yes	No	Yes	No	Plain text
treecon	No	Yes	Yes	Yes	No	Yes	No	Treecon output format

5.2.3. Contents of a Sequence Entry

An entry in a sequence databank will typically include a code and other information to identify the sequence, some bibliographic information, sequence annotation including a description of any features and, of course, the sequence itself.

An excerpt of the EMBL entry for a beta-glucosidase mRNA sequence is shown below:

ID   X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
XX
AC   X56734; S46826;
XX
DT   12-SEP-1991 (Rel. 29, Created)
DT   25-NOV-2005 (Rel. 85, Last updated, Version 11)
XX
DE   Trifolium repens mRNA for non-cyanogenic beta-glucosidase
XX
KW   beta-glucosidase.
XX
OS   Trifolium repens (white clover)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids;
OC   eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium.
XX
RN   [5]
RP   1-1859
RX   PUBMED; 1907511.
RA   Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
RT   "Nucleotide and derived amino acid sequence of the cyanogenic
RT   beta-glucosidase (linamarase) from white clover (Trifolium repens L.)";
RL   Plant Mol. Biol. 17(2):209-219(1991).
XX
RN   [6]
RP   1-1859
RA   Hughes M.A.;
RT   ;
RL   Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases.
RL   Hughes M.A., University of Newcastle Upon Tyne, Medical School, Newcastle
RL   Upon Tyne, NE2 4HH, UK
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1859
FT                   /organism="Trifolium repens"
FT                   /mol_type="mRNA"
FT                   /clone_lib="lambda gt10"
FT                   /clone="TRE361"
FT                   /tissue_type="leaves"
FT                   /db_xref="taxon:3899"
FT   CDS             14..1495
FT                   /product="beta-glucosidase"
FT                   /EC_number="3.2.1.21"
FT                   /note="non-cyanogenic"
FT                   /db_xref="GOA:P26204"
FT                   /db_xref="HSSP:P26205"
FT                   /db_xref="InterPro:IPR001360"
FT                   /db_xref="UniProtKB/Swiss-Prot:P26204"
FT                   /protein_id="CAA40058.1"
FT                   /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI
FT                   FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK
FT                   DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ
FT                   VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR
FT                   CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD
FT                   DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF
FT                   IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ
FT                   EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA
FT                   IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
FT   mRNA            1..1859
FT                   /experiment="experimental evidence, no additional details
FT                   recorded"
XX
SQ   Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
     aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt        60
     cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag       120
     tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga       180
.
. sequence omitted for brevity
.
     aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc      1740
     agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac      1800
     tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa       1859
//

5.2.3.1. Identification

Ids and Accessions

An entry in a database must have some way of being uniquely identified. Most sequence databases have two such identifiers for each sequence - an ID name and an accession number.

Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the life of the database. If two sequences are merged, then the new sequence will get a new accession number and the accession numbers of the merged sequences will be retained as 'secondary' accession numbers. EMBL, GenBank and Swissprot share an accession numbering scheme - an accession number uniquely identifies a sequence within these three databases. In contrast, ID names are not guaranteed to remain the same between different versions of a database, although in practice they usually do.

Why are there two such identifiers? The ID name was originally intended to be a human-readable name that indicate the function of its sequence. In EMBL and GenBank the first two (or three) letters indicated the species and the rest indicated the function, for example hsfau is the 'Homo Sapiens FAU pseudogene'. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Instead, the accession numbers started to be also assigned as the ID name. Therefore you will now find ID names like AF061303 are the same as the accession number for that sequence in EMBL.

Most sequence formats include an identifier code in some form or another. Typically this is an accession number and/or identifier name (ID) and is given near the top of the entry. They uniquely identify an entry in the database.

For our EMBL entry, the accession number X56734 is given on the ID line and separately in the AC line:

ID   X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
XX
AC   X56734; S46826;
XX

In contrast, FASTA format often gives the ID as the first word of an informative title line:

>IDName An Informative comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt

5.2.3.2. Bibliographic Information

Most sequence formats have records for bibliographic information such as literature references, experimental details, author contact information, cross-links to other databases, and much more besides. In the example below, the date of release (DT,) a description (DE), keywords (KW), organism species (OS), organism classification (OC) and reference information (RN, RP, RX, RA, RT and RL) are given:

DT   12-SEP-1991 (Rel. 29, Created)
DT   25-NOV-2005 (Rel. 85, Last updated, Version 11)
XX
DE   Trifolium repens mRNA for non-cyanogenic beta-glucosidase
XX
KW   beta-glucosidase.
XX
OS   Trifolium repens (white clover)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids;
OC   eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium.
XX
RN   [5]
RP   1-1859
RX   PUBMED; 1907511.
RA   Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
RT   "Nucleotide and derived amino acid sequence of the cyanogenic
RT   beta-glucosidase (linamarase) from white clover (Trifolium repens L.)";
RL   Plant Mol. Biol. 17(2):209-219(1991).
XX
RN   [6]
RP   1-1859
RA   Hughes M.A.;
RT   ;
RL   Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases.
RL   Hughes M.A., University of Newcastle Upon Tyne, Medical School, Newcastle
RL   Upon Tyne, NE2 4HH, UK
XX

5.2.3.3. Annotation and Features

Most sequence formats have records for descriptions, annotations and comments provided with the sequence. Molecular features associated with the sequence, such as protein secondary structure or molecular recognition sites, are kept in a feature table. These are marked up by FT records in the EMBL entry below.

XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1859
FT                   /organism="Trifolium repens"
FT                   /mol_type="mRNA"
FT                   /clone_lib="lambda gt10"
FT                   /clone="TRE361"
FT                   /tissue_type="leaves"
FT                   /db_xref="taxon:3899"
FT   CDS             14..1495
FT                   /product="beta-glucosidase"
FT                   /EC_number="3.2.1.21"
FT                   /note="non-cyanogenic"
FT                   /db_xref="GOA:P26204"
FT                   /db_xref="HSSP:P26205"
FT                   /db_xref="InterPro:IPR001360"
FT                   /db_xref="UniProtKB/Swiss-Prot:P26204"
FT                   /protein_id="CAA40058.1"
FT                   /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI
FT                   FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK
FT                   DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ
FT                   VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR
FT                   CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD
FT                   DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF
FT                   IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ
FT                   EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA
FT                   IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
FT   mRNA            1..1859
FT                   /experiment="experimental evidence, no additional details
FT                   recorded"
XX

Further information on sequence features is available (Section A.2, “Supported Feature Formats”).

5.2.3.4. The Sequence

Sequences are usually represented in IUBMB standard one-letter codes (see http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html). There are exceptions, for example Staden format uses non-standard ambiguity codes. In the case of FASTA format the sequence is anything after the '>' line until the next entry starts. For other databases, records are used to delineate the sequence.

In EMBL entries, an SQ label is used to identify the sequence (the full sequence is not given):

XX
SQ   Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
     aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt        60
     cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag       120
     tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga       180
.
. sequence omitted for brevity
.
     aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc      1740
     agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac      1800
     tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa       1859
//

5.2.4. Specifying Sequences on the Command Line

Sequences are referred to on the EMBOSS command line by their Uniform Sequence Address or USA (Section 6.6, “The Uniform Sequence Address (USA)”). This is a standard sequence naming scheme used by all EMBOSS applications. A USA specifies one or more sequences that might be read from or written to a file or to a larger databank. Other sequence sources such as an applications or web servers can also be specified.

There are also a set of command line qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) that are used for sequence input and output. These allow you to set such things as file format, sequence regions, database and entry names.

For example, the format of an output sequence may be set by on the command line as follows:

seqret seq.in seq.out -osformat embl

... or by giving it in the USA of the output filename:

seqret seq.in embl::seq.out

5.2.5. Applications for Basic Sequence Manipulation

Most of the EMBOSS applications are for sequence manipulation. The generic sequence-handling applications are summarised in the table (???).

Applications for Basic Sequence Manipulation

Application	Description
backtranseq	Backtranslate a protein sequence
compseq	Count composition of dimer/trimer/etc words in a sequence
cutseq	Removes a specified section from a sequence
degapseq	Removes gap characters from sequences
descseq	Alter the name or description of a sequence
diffseq	Find differences between nearly identical sequences
extractseq	Extract regions from a sequence
infoseq	Displays some simple information about sequences
maskseq	Mask off regions of a sequence
newseq	Type in a short new sequence
notseq	Exclude a set of sequences and write out the remaining ones
nthseq	Writes one sequence from a multiple set of sequences
pasteseq	Insert one sequence into another
prettyseq	Output sequence with translated ranges
revseq	Reverse and complement a sequence
seqmatchall	All-against-all comparison of a set of sequences
seqret	Reads and writes (returns) sequences
seqretsplit	Reads and writes (returns) sequences in individual files
showseq	Display a sequence with features, translation etc
shuffleseq	Shuffles a set of sequences maintaining composition
skipseq	Reads and writes (returns) sequences, skipping first few
transeq	Translate nucleic acid sequences
trimseq	Trim ambiguous bits off the ends of sequences

Prev	Up	Next
5.1. Introduction	Home	5.3. Introduction to Feature Formats