5.2. Introduction to Sequence Formats

5.2.1. What is a Sequence Format?

A sequence format defines the permitted layout and content of text in a file. This includes text tokens that define fields used in a databank. These fields include the sequence itself, the sequence identifier name and accession number, amongst others. Non-printable control characters are not generally used, allowing most formats to be viewed on screen or printed out.

The FASTA format is a very widely used (and abused) format. It consists of a header line starting with a > character followed by a code identifying the sequence and, very often, some text describing the sequence. The header line is followed by one or more lines containing the sequence itself. FASTA files may contain one or more sequences:

>crab_anapl ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).             
MDITIHNPLIRRPLFSWLAPSRIFDQIFGEHLQESELLPASPSLSPFLMR
SPIFRMPSWLETGLSEMRLEKDKFSVNLDVKHFSPEELKVKVLGDMVEIH
GKHEERQDEHGFIAREFNRKYRIPADVDPLTITSSLSLDGVLTVSAPRKQ
SDVPERSIPITREEKPAIAGAQRK
>crab_bovin ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).             
MDIAIHHPWIRRPFFPFHSPSRLFDQFFGEHLLESDLFPASTSLSPFYLR
PPSFLRAPSWIDTGLSEMRLEKDRFSVNLDVKHFSPEELKVKVLGDVIEV
HGKHEERQDEHGFISREFHRKYRIPADVDPLAITSSLSSDGVLTVNGPRK
QASGPERTIPITREEKPAVTAAPKK
>crab_chick ALPHA CRYSTALLIN B CHAIN (ALPHA(B)-CRYSTALLIN).             
MDITIHNPLVRRPLFSWLTPSRIFDQIFGEHLQESELLPTSPSLSPFLMR
SPFFRMPSWLETGLSEMRLEKDKFSVNLDVKHFSPEELKVKVLGDMIEIH
GKHEERQDEHGFIAREFSRKYRIPADVDPLTITSSLSLDGVLTVSAPRKQ
SDVPERSIPITREEKPAIAGSQRK

Beyond FASTA, the most widespread sequence formats are those used by the major sequence databases:

Sadly, sequences are occasionally stored in non-standard formats. These include proprietary word processor formats (e.g. MS Word and MS WordPad) and text formatting languages (e.g. PostScript, PDF, RTF, TeX and HTML). EMBOSS will not read a sequence in any of these formats.

If you have a sequence in a non-standard format you should:

  • Save the sequence to a file as plain ASCII text, without any formatting whatsoever. The file should contain the sequence only. EMBOSS will recognise this "plain" format. The program you are using to view the file should have an option to "Save as..." plain text.

  • If there is not an option to save your sequence in plain text format directly, there may well be a utility program to convert the file to plain text format. The EMBOSS user community will be able to help you with this (see Section 3.5, “How to Get Help”).

  • Use a text editor that is capable of writing files in plain text format in the future. These include pico, nedit, emacs and MS wordpad. When using a text editor to create a sequence file, the best (simplest) format to use is FASTA as described above. Be sure to save your sequence as plain text.

  • If you intend to manipulate or edit the sequences substantially, investigate using a full-blown sequence editor such as mse. Such editors should have an option to save the sequence to a file in one or more of the standard formats.

5.2.2. Supported Sequence Formats

Some sequence formats can hold multiple sequences in one file. Typically there will be multiple entries (one per sequence) that are catenated in the file. Other formats, such as Staden, can only hold one sequence per file. An attempt to catenate several such sequences in one file would result in a mess from which it would be difficult to differentiate the sequences from the annotation. Most systems including EMBOSS will not parse such files, therefore you should never use a single sequence format to hold multiple sequences. Sequences are also held in alignment files. These contain the results of aligning (lining up similar or equivalent characters) in two or more sequences. EMBOSS supports most common sequence alignments formats (Section A.3, “Supported Alignment Formats”).

All of the common sequence formats are supported in EMBOSS for both application input (reading) and output (writing). These are summarised below. Some support single sequences only, some multiple sequences. The names of the sequence formats are taken from common EMBOSS database configurations. Some of these are obviously synonyms e.g. "embl" and "em". In practice, the names available will depend on what's defined in your EMBOSS configuration files (see Section 2.8, “Maintenance”). For descriptions and examples of the supported formats see Section A.1, “Supported Sequence Formats”.

The supported sequence formats are summarised in the table below. The columns are as follows: Input format (format name), Output format (format name), Sngl (indicates whether each sequence is written to a new file. This behaviour is the default and can be set by the -ossingle command line qualifier. Save (indicates that sequence data is stored internally and written when the output is closed. This is needed for 'interleaved' formats such as Phylip and MSF), Try (indicates whether the format can be detected automatically on input), Nuc ("true" indicates nucleotide sequence data may be represented), Pro ("true" indicates protein sequence data may be represented, Feat (whether the format includes feature annotation data. EMBOSS can also read feature data from a separate feature file). Gap (whether the format supports sequence data with gap characters, for example the results of an alignment), Mset ("true" indicates that more than one set of sequences can be stored in a single file. This is used by, for example, phylogenetic analysis applications to store many versions of a multiple alignment for statistical analysis) and Description (short description of the format).

Table 5.1. Input sequence formats
Input FormatTryNucProFeatGapMsetDescription
abiYesYesYesNoYesNoABI trace file
acedbYesYesYesNoYesNoACEDB sequence format
clustalYesYesYesNoYesNoClustalw output format
codataYesYesYesYesYesNoCodata entry format
dbidNoYesYesNoYesNoFasta format variant with database name before ID
emblYesYesNoYesYesNoEMBL entry format
experimentYesYesYesNoYesNoStaden experiment file
fastaYesYesYesNoYesNoFASTA format including NCBI-style IDs
fastqYesYesNoNoNoNoFASTQ short read format ignoring quality scores
fastq-illuminaNoYesNoNoNoNoFASTQ Illumina 1.3 short read format
fastq-sangerNoYesNoNoNoNoFASTQ short read format with phred quality
fastq-solexaNoYesNoNoNoNoFASTQ Solexa/Illumina 1.0 short read format
fitchYesYesYesNoYesNoFitch program format
gcgYesYesYesNoYesNoGCG sequence format
genbankYesYesNoYesYesNoGenbank entry format
genpeptNoNoYesYesYesNoRefseq protein entry format (alias)
gff2YesYesYesYesYesNoGFF feature file with sequence in the header
gff3YesYesYesYesYesNoGFF3 feature file with sequence
gifastaNoYesYesNoYesNoFASTA format including NCBI-style GIs (alias)
hennig86YesYesYesNoYesNoHennig86 output format
igNoYesYesNoYesNoIntelligenetics sequence format
igstrictYesYesYesNoYesNoIntelligenetics sequence format strict parser
jackkniferYesYesYesNoYesNoJackknifer interleaved and non-interleaved formats
maseNoYesYesNoYesNoMase program format
megaYesYesYesNoYesNoMega interleaved and non-interleaved formats
msfYesYesYesNoYesNoGCG MSF (multiple sequence file) file format
nbrfYesYesYesYesYesNoNBRF/PIR entry format
nexusYesYesYesNoYesNoNexus/paup interleaved format
pdbYesNoYesNoNoNoPDB protein databank format ATOM lines
pdbnucNoYesNoNoNoNoPDB protein databank format nucleotide ATOM lines
pdbnucseqNoYesNoNoNoNoPDB protein databank format nucleotide SEQRES lines
pdbseqYesNoYesNoNoNoPDB protein databank format SEQRES lines
pearsonYesYesYesNoYesNoPlain old fasta format with IDs not parsed further
phylipYesYesYesNoYesYesPhylip interleaved and non-interleaved formats
phylipnonNoYesYesNoYesYesPhylip non-interleaved format
rawYesYesYesNoNoNoRaw sequence with no non-sequence characters
refseqpNoNoYesYesYesNoRefseq protein entry format
selexNoYesYesNoYesNoSelex format
stadenNoYesYesNoYesNoOld staden package sequence format
stockholmYesYesYesNoYesNoStockholm (pfam) format
striderYesYesYesNoYesNoDNA strider output format
swissYesNoYesYesYesNoSwissprot entry format
textNoYesYesNoYesNoPlain text
treeconYesYesYesNoYesNoTreecon output format
Table 5.2. Output sequence formats
Output FormatSnglSaveNucProFeatGapMsetDescription
acedbNoNoYesYesNoYesNoACEDB sequence format
asn1NoNoYesYesNoYesNoNCBI ASN.1 format
clustalNoYesYesYesNoYesNoClustalw multiple alignment format
codataNoNoYesYesNoYesNoCodata entry format
dasNoNoYesYesNoYesNoDASSEQUENCE DAS any sequence
dasdnaNoNoYesNoNoYesNoDASDNA DAS nucleotide-only sequence
debugNoNoYesYesNoYesNoDebugging trace of full internal data content
emblNoNoYesNoYesYesNoEMBL entry format
experimentNoNoYesYesNoYesNoStaden experiment file
fastaNoNoYesYesNoYesNoFASTA format
fastq-illuminaNoNoYesNoNoNoNoFASTQ Illumina 1.3 short read format
fastq-sangerNoNoYesNoNoNoNoFASTQ short read format with phred quality
fastq-solexaNoNoYesNoNoNoNoFASTQ Solexa/Illumina 1.0 short read format
fitchNoNoYesYesNoYesNoFitch program format
gcgNoNoYesYesNoYesNoGCG sequence format
genbankNoNoYesNoNoYesNoGenbank entry format
gff2NoNoYesYesYesYesNoGFF2 feature file with sequence in the header
gff3NoNoYesYesYesYesNoGFF3 feature file with sequence in FASTA format after
gifastaNoNoYesYesNoYesNoNCBI fasta format with NCBI-style IDs using GI number
hennig86NoYesYesYesNoYesNoHennig86 output format
igNoNoYesYesNoYesNoIntelligenetics sequence format
jackkniferNoYesYesYesNoYesNoJackknifer output interleaved format
jackknifernonNoYesYesYesNoYesNoJackknifer output non-interleaved format
maseNoNoYesYesNoYesNoMase program format
megaNoYesYesYesNoYesNoMega interleaved output format
meganonNoYesYesYesNoYesNoMega non-interleaved output format
msfNoYesYesYesNoYesNoGCG MSF (multiple sequence file) file format
nbrfNoNoYesYesYesYesNoNBRF/PIR entry format
ncbiNoNoYesYesNoYesNoNCBI fasta format with NCBI-style IDs
nexusNoYesYesYesNoYesNoNexus/paup interleaved format
nexusnonNoYesYesYesNoYesNoNexus/paup non-interleaved format
phylipNoYesYesYesNoYesYesPhylip interleaved format
phylipnonNoYesYesYesNoYesNoPhylip non-interleaved format
selexNoYesYesYesNoYesNoSelex format
stadenNoNoYesYesNoYesNoOld staden package sequence format
striderNoNoYesYesNoYesNoDNA strider output format
swissNoNoNoYesYesYesNoSwissprot entry format
textNoNoYesYesNoYesNoPlain text
treeconNoYesYesYesNoYesNoTreecon output format

5.2.3. Contents of a Sequence Entry

An entry in a sequence databank will typically include a code and other information to identify the sequence, some bibliographic information, sequence annotation including a description of any features and, of course, the sequence itself.

An excerpt of the EMBL entry for a beta-glucosidase mRNA sequence is shown below:

ID   X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
XX
AC   X56734; S46826;
XX
DT   12-SEP-1991 (Rel. 29, Created)
DT   25-NOV-2005 (Rel. 85, Last updated, Version 11)
XX
DE   Trifolium repens mRNA for non-cyanogenic beta-glucosidase
XX
KW   beta-glucosidase.
XX
OS   Trifolium repens (white clover)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids;
OC   eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium.
XX
RN   [5]
RP   1-1859
RX   PUBMED; 1907511.
RA   Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
RT   "Nucleotide and derived amino acid sequence of the cyanogenic
RT   beta-glucosidase (linamarase) from white clover (Trifolium repens L.)";
RL   Plant Mol. Biol. 17(2):209-219(1991).
XX
RN   [6]
RP   1-1859
RA   Hughes M.A.;
RT   ;
RL   Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases.
RL   Hughes M.A., University of Newcastle Upon Tyne, Medical School, Newcastle
RL   Upon Tyne, NE2 4HH, UK
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1859
FT                   /organism="Trifolium repens"
FT                   /mol_type="mRNA"
FT                   /clone_lib="lambda gt10"
FT                   /clone="TRE361"
FT                   /tissue_type="leaves"
FT                   /db_xref="taxon:3899"
FT   CDS             14..1495
FT                   /product="beta-glucosidase"
FT                   /EC_number="3.2.1.21"
FT                   /note="non-cyanogenic"
FT                   /db_xref="GOA:P26204"
FT                   /db_xref="HSSP:P26205"
FT                   /db_xref="InterPro:IPR001360"
FT                   /db_xref="UniProtKB/Swiss-Prot:P26204"
FT                   /protein_id="CAA40058.1"
FT                   /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI
FT                   FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK
FT                   DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ
FT                   VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR
FT                   CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD
FT                   DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF
FT                   IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ
FT                   EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA
FT                   IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
FT   mRNA            1..1859
FT                   /experiment="experimental evidence, no additional details
FT                   recorded"
XX
SQ   Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
     aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt        60
     cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag       120
     tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga       180
.
. sequence omitted for brevity
.
     aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc      1740
     agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac      1800
     tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa       1859
//

5.2.3.1. Identification

Ids and Accessions

An entry in a database must have some way of being uniquely identified. Most sequence databases have two such identifiers for each sequence - an ID name and an accession number.

Accession numbers are unique alphanumeric identifiers that are guaranteed to remain with that sequence through the life of the database. If two sequences are merged, then the new sequence will get a new accession number and the accession numbers of the merged sequences will be retained as 'secondary' accession numbers. EMBL, GenBank and Swissprot share an accession numbering scheme - an accession number uniquely identifies a sequence within these three databases. In contrast, ID names are not guaranteed to remain the same between different versions of a database, although in practice they usually do.

Why are there two such identifiers? The ID name was originally intended to be a human-readable name that indicate the function of its sequence. In EMBL and GenBank the first two (or three) letters indicated the species and the rest indicated the function, for example hsfau is the 'Homo Sapiens FAU pseudogene'. This naming scheme started to be a problem when the number of entries added each day was so vast that people could not make up the ID names fast enough. Instead, the accession numbers started to be also assigned as the ID name. Therefore you will now find ID names like AF061303 are the same as the accession number for that sequence in EMBL.

Most sequence formats include an identifier code in some form or another. Typically this is an accession number and/or identifier name (ID) and is given near the top of the entry. They uniquely identify an entry in the database.

For our EMBL entry, the accession number X56734 is given on the ID line and separately in the AC line:

ID   X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
XX
AC   X56734; S46826;
XX

In contrast, FASTA format often gives the ID as the first word of an informative title line:

>IDName An Informative comment
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcatt

5.2.3.2. Bibliographic Information

Most sequence formats have records for bibliographic information such as literature references, experimental details, author contact information, cross-links to other databases, and much more besides. In the example below, the date of release (DT,) a description (DE), keywords (KW), organism species (OS), organism classification (OC) and reference information (RN, RP, RX, RA, RT and RL) are given:

DT   12-SEP-1991 (Rel. 29, Created)
DT   25-NOV-2005 (Rel. 85, Last updated, Version 11)
XX
DE   Trifolium repens mRNA for non-cyanogenic beta-glucosidase
XX
KW   beta-glucosidase.
XX
OS   Trifolium repens (white clover)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; rosids;
OC   eurosids I; Fabales; Fabaceae; Papilionoideae; Trifolieae; Trifolium.
XX
RN   [5]
RP   1-1859
RX   PUBMED; 1907511.
RA   Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
RT   "Nucleotide and derived amino acid sequence of the cyanogenic
RT   beta-glucosidase (linamarase) from white clover (Trifolium repens L.)";
RL   Plant Mol. Biol. 17(2):209-219(1991).
XX
RN   [6]
RP   1-1859
RA   Hughes M.A.;
RT   ;
RL   Submitted (19-NOV-1990) to the EMBL/GenBank/DDBJ databases.
RL   Hughes M.A., University of Newcastle Upon Tyne, Medical School, Newcastle
RL   Upon Tyne, NE2 4HH, UK
XX

5.2.3.3. Annotation and Features

Most sequence formats have records for descriptions, annotations and comments provided with the sequence. Molecular features associated with the sequence, such as protein secondary structure or molecular recognition sites, are kept in a feature table. These are marked up by FT records in the EMBL entry below.

XX
FH   Key             Location/Qualifiers
FH
FT   source          1..1859
FT                   /organism="Trifolium repens"
FT                   /mol_type="mRNA"
FT                   /clone_lib="lambda gt10"
FT                   /clone="TRE361"
FT                   /tissue_type="leaves"
FT                   /db_xref="taxon:3899"
FT   CDS             14..1495
FT                   /product="beta-glucosidase"
FT                   /EC_number="3.2.1.21"
FT                   /note="non-cyanogenic"
FT                   /db_xref="GOA:P26204"
FT                   /db_xref="HSSP:P26205"
FT                   /db_xref="InterPro:IPR001360"
FT                   /db_xref="UniProtKB/Swiss-Prot:P26204"
FT                   /protein_id="CAA40058.1"
FT                   /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRSSFPRGFI
FT                   FGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITVDQYHRYKEDVGIMK
FT                   DQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLINELLANGIQPFVTLFHWDLPQ
FT                   VLEDEYGGFLNSGVINDFRDYTDLCFKEFGDRVRYWSTLNEPWVFSNSGYALGTNAPGR
FT                   CSASNVAKPGDSGTGPYIVTHNQILAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLD
FT                   DNSIPDIKAAERSLDFQFGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDF
FT                   IGINYYSSSYISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQ
FT                   EDFEIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYYIRSA
FT                   IRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
FT   mRNA            1..1859
FT                   /experiment="experimental evidence, no additional details
FT                   recorded"
XX

Further information on sequence features is available (Section A.2, “Supported Feature Formats”).

5.2.3.4. The Sequence

Sequences are usually represented in IUBMB standard one-letter codes (see http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html). There are exceptions, for example Staden format uses non-standard ambiguity codes. In the case of FASTA format the sequence is anything after the '>' line until the next entry starts. For other databases, records are used to delineate the sequence.

In EMBL entries, an SQ label is used to identify the sequence (the full sequence is not given):

XX
SQ   Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
     aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt        60
     cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag       120
     tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga       180
.
. sequence omitted for brevity
.
     aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc      1740
     agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac      1800
     tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa       1859
//

5.2.4. Specifying Sequences on the Command Line

Sequences are referred to on the EMBOSS command line by their Uniform Sequence Address or USA (Section 6.6, “The Uniform Sequence Address (USA)”). This is a standard sequence naming scheme used by all EMBOSS applications. A USA specifies one or more sequences that might be read from or written to a file or to a larger databank. Other sequence sources such as an applications or web servers can also be specified.

There are also a set of command line qualifiers (Section 6.4, “Datatype-specific Command Line Qualifiers”) that are used for sequence input and output. These allow you to set such things as file format, sequence regions, database and entry names.

For example, the format of an output sequence may be set by on the command line as follows:

seqret seq.in seq.out -osformat embl

... or by giving it in the USA of the output filename:

seqret seq.in embl::seq.out

5.2.5. Applications for Basic Sequence Manipulation

Most of the EMBOSS applications are for sequence manipulation. The generic sequence-handling applications are summarised in the table (???).

Applications for Basic Sequence Manipulation

ApplicationDescription
backtranseqBacktranslate a protein sequence
compseqCount composition of dimer/trimer/etc words in a sequence
cutseqRemoves a specified section from a sequence
degapseqRemoves gap characters from sequences
descseqAlter the name or description of a sequence
diffseqFind differences between nearly identical sequences
extractseqExtract regions from a sequence
infoseqDisplays some simple information about sequences
maskseqMask off regions of a sequence
newseqType in a short new sequence
notseqExclude a set of sequences and write out the remaining ones
nthseqWrites one sequence from a multiple set of sequences
pasteseqInsert one sequence into another
prettyseqOutput sequence with translated ranges
revseqReverse and complement a sequence
seqmatchallAll-against-all comparison of a set of sequences
seqretReads and writes (returns) sequences
seqretsplitReads and writes (returns) sequences in individual files
showseqDisplay a sequence with features, translation etc
shuffleseqShuffles a set of sequences maintaining composition
skipseqReads and writes (returns) sequences, skipping first few
transeqTranslate nucleic acid sequences
trimseqTrim ambiguous bits off the ends of sequences