EMBL format (skbio.io.format.embl)

EMBL format stores sequence and its annotation together. The start of the annotation section is marked by a line beginning with the word “ID”. The start of sequence section is marked by a line beginning with the word “SQ”. The “//” (terminator) line also contains no data or comments and designates the end of an entry. More information on EMBL file format can be found here 1.

The EMBL file may end with .embl or .txt extension. An example of EMBL file can be seen here 2.

Feature Level Products

As described in 3 “Feature-level products contain nucleotide sequence and related annotations derived from submitted ENA assembled and annotated sequences. Data are distributed in flatfile format, similar to that of parent ENA records, with each flatfile representing a single feature”. While only the sequence of the feature is included in such entries, features are derived from the parent entry, and can’t be applied as interval metadata. For such reason, interval metatdata are ignored from Feature-level products, as they will be ignored by subsetting a generic Sequence object.

Format Support

Has Sniffer: Yes

NOTE: No protein support at the moment

Current protein support development is tracked in issue-1499 4

Reader

Writer

Object Class

Yes

Yes

skbio.sequence.Sequence

Yes

Yes

skbio.sequence.DNA

Yes

Yes

skbio.sequence.RNA

No

No

skbio.sequence.Protein

Yes

Yes

generator of skbio.sequence.Sequence objects

Format Specification

State: Experimental as of 0.5.1-dev.

Sections before FH (Feature Header)

All the sections before FH (Feature Header) will be read into the attribute of metadata. The header and its content of a section are stored as key-value pairs in metadata. For the RN (Reference Number) section, its value is stored as a list, as there are often multiple reference sections in one EMBL record.

SQ section

The sequence in the SQ section is always in lowercase for the EMBL files downloaded from ENA. For the RNA molecules, t (thymine), instead of u (uracil) is used in the sequence. All EMBL writers follow these conventions while writing EMBL files.

Examples

Reading EMBL Files

Suppose we have the following EMBL file example:

>>> embl_str = '''
... ID   X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
... XX
... AC   X56734; S46826;
... XX
... DT   12-SEP-1991 (Rel. 29, Created)
... DT   25-NOV-2005 (Rel. 85, Last updated, Version 11)
... XX
... DE   Trifolium repens mRNA for non-cyanogenic beta-glucosidase
... XX
... KW   beta-glucosidase.
... XX
... OS   Trifolium repens (white clover)
... OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
... OC   Spermatophyta; Magnoliophyta; eudicotyledons; Gunneridae;
... OC   Pentapetalae; rosids; fabids; Fabales; Fabaceae; Papilionoideae;
... OC   Trifolieae; Trifolium.
... XX
... RN   [5]
... RP   1-1859
... RX   DOI; 10.1007/BF00039495.
... RX   PUBMED; 1907511.
... RA   Oxtoby E., Dunn M.A., Pancoro A., Hughes M.A.;
... RT   "Nucleotide and derived amino acid sequence of the cyanogenic
... RT   beta-glucosidase (linamarase) from white clover
... RT   (Trifolium repens L.)";
... RL   Plant Mol. Biol. 17(2):209-219(1991).
... XX
... RN   [6]
... RP   1-1859
... RA   Hughes M.A.;
... RT   ;
... RL   Submitted (19-NOV-1990) to the INSDC.
... RL   Hughes M.A., University of Newcastle Upon Tyne, Medical School,
... RL   Newcastle
... RL   Upon Tyne, NE2 4HH, UK
... XX
... DR   MD5; 1e51ca3a5450c43524b9185c236cc5cc.
... XX
... FH   Key             Location/Qualifiers
... FH
... FT   source          1..1859
... FT                   /organism="Trifolium repens"
... FT                   /mol_type="mRNA"
... FT                   /clone_lib="lambda gt10"
... FT                   /clone="TRE361"
... FT                   /tissue_type="leaves"
... FT                   /db_xref="taxon:3899"
... FT   mRNA            1..1859
... FT                   /experiment="experimental evidence, no additional
... FT                   details recorded"
... FT   CDS             14..1495
... FT                   /product="beta-glucosidase"
... FT                   /EC_number="3.2.1.21"
... FT                   /note="non-cyanogenic"
... FT                   /db_xref="GOA:P26204"
... FT                   /db_xref="InterPro:IPR001360"
... FT                   /db_xref="InterPro:IPR013781"
... FT                   /db_xref="InterPro:IPR017853"
... FT                   /db_xref="InterPro:IPR033132"
... FT                   /db_xref="UniProtKB/Swiss-Prot:P26204"
... FT                   /protein_id="CAA40058.1"
... FT                   /translation="MDFIVAIFALFVISSFTITSTNAVEASTLLDIGNLSRS
... FT                   SFPRGFIFGAGSSAYQFEGAVNEGGRGPSIWDTFTHKYPEKIRDGSNADITV
... FT                   DQYHRYKEDVGIMKDQNMDSYRFSISWPRILPKGKLSGGINHEGIKYYNNLI
... FT                   NELLANGIQPFVTLFHWDLPQVLEDEYGGFLNSGVINDFRDYTDLCFKEFGD
... FT                   RVRYWSTLNEPWVFSNSGYALGTNAPGRCSASNVAKPGDSGTGPYIVTHNQI
... FT                   LAHAEAVHVYKTKYQAYQKGKIGITLVSNWLMPLDDNSIPDIKAAERSLDFQ
... FT                   FGLFMEQLTTGDYSKSMRRIVKNRLPKFSKFESSLVNGSFDFIGINYYSSSY
... FT                   ISNAPSHGNAKPSYSTNPMTNISFEKHGIPLGPRAASIWIYVYPYMFIQEDF
... FT                   EIFCYILKINITILQFSITENGMNEFNDATLPVEEALLNTYRIDYYYRHLYY
... FT                   IRSAIRAGSNVKGFYAWSFLDCNEWFAGFTVRFGLNFVD"
... XX
... SQ   Sequence 1859 BP; 609 A; 314 C; 355 G; 581 T; 0 other;
...      aaacaaacca aatatggatt ttattgtagc catatttgct ctgtttgtta ttagctcatt
...      cacaattact tccacaaatg cagttgaagc ttctactctt cttgacatag gtaacctgag
...      tcggagcagt tttcctcgtg gcttcatctt tggtgctgga tcttcagcat accaatttga
...      aggtgcagta aacgaaggcg gtagaggacc aagtatttgg gataccttca cccataaata
...      tccagaaaaa ataagggatg gaagcaatgc agacatcacg gttgaccaat atcaccgcta
...      caaggaagat gttgggatta tgaaggatca aaatatggat tcgtatagat tctcaatctc
...      ttggccaaga atactcccaa agggaaagtt gagcggaggc ataaatcacg aaggaatcaa
...      atattacaac aaccttatca acgaactatt ggctaacggt atacaaccat ttgtaactct
...      ttttcattgg gatcttcccc aagtcttaga agatgagtat ggtggtttct taaactccgg
...      tgtaataaat gattttcgag actatacgga tctttgcttc aaggaatttg gagatagagt
...      gaggtattgg agtactctaa atgagccatg ggtgtttagc aattctggat atgcactagg
...      aacaaatgca ccaggtcgat gttcggcctc caacgtggcc aagcctggtg attctggaac
...      aggaccttat atagttacac acaatcaaat tcttgctcat gcagaagctg tacatgtgta
...      taagactaaa taccaggcat atcaaaaggg aaagataggc ataacgttgg tatctaactg
...      gttaatgcca cttgatgata atagcatacc agatataaag gctgccgaga gatcacttga
...      cttccaattt ggattgttta tggaacaatt aacaacagga gattattcta agagcatgcg
...      gcgtatagtt aaaaaccgat tacctaagtt ctcaaaattc gaatcaagcc tagtgaatgg
...      ttcatttgat tttattggta taaactatta ctcttctagt tatattagca atgccccttc
...      acatggcaat gccaaaccca gttactcaac aaatcctatg accaatattt catttgaaaa
...      acatgggata cccttaggtc caagggctgc ttcaatttgg atatatgttt atccatatat
...      gtttatccaa gaggacttcg agatcttttg ttacatatta aaaataaata taacaatcct
...      gcaattttca atcactgaaa atggtatgaa tgaattcaac gatgcaacac ttccagtaga
...      agaagctctt ttgaatactt acagaattga ttactattac cgtcacttat actacattcg
...      ttctgcaatc agggctggct caaatgtgaa gggtttttac gcatggtcat ttttggactg
...      taatgaatgg tttgcaggct ttactgttcg ttttggatta aactttgtag attagaaaga
...      tggattaaaa aggtacccta agctttctgc ccaatggtac aagaactttc tcaaaagaaa
...      ctagctagta ttattaaaag aactttgtag tagattacag tacatcgttt gaagttgagt
...      tggtgcacct aattaaataa aagaggttac tcttaacata tttttaggcc attcgttgtg
...      aagttgttag gctgttattt ctattatact atgttgtagt aataagtgca ttgttgtacc
...      agaagctatg atcataacta taggttgatc cttcatgtat cagtttgatg ttgagaatac
...      tttgaattaa aagtcttttt ttattttttt aaaaaaaaaa aaaaaaaaaa aaaaaaaaa
... //
... '''

Now we can read it as DNA object:

>>> import io
>>> from skbio import DNA, RNA, Sequence
>>> embl = io.StringIO(embl_str)
>>> dna_seq = DNA.read(embl)
>>> dna_seq
DNA
----------------------------------------------------------------------
Metadata:
    'ACCESSION': 'X56734; S46826;'
    'CROSS_REFERENCE': <class 'list'>
    'DATE': <class 'list'>
    'DBSOURCE': 'MD5; 1e51ca3a5450c43524b9185c236cc5cc.'
    'DEFINITION': 'Trifolium repens mRNA for non-cyanogenic beta-
                   glucosidase'
    'KEYWORDS': 'beta-glucosidase.'
    'LOCUS': <class 'dict'>
    'REFERENCE': <class 'list'>
    'SOURCE': <class 'dict'>
    'VERSION': 'X56734.1'
Interval metadata:
    3 interval features
Stats:
    length: 1859
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 35.99%
----------------------------------------------------------------------
0    AAACAAACCA AATATGGATT TTATTGTAGC CATATTTGCT CTGTTTGTTA TTAGCTCATT
60   CACAATTACT TCCACAAATG CAGTTGAAGC TTCTACTCTT CTTGACATAG GTAACCTGAG
...
1740 AGAAGCTATG ATCATAACTA TAGGTTGATC CTTCATGTAT CAGTTTGATG TTGAGAATAC
1800 TTTGAATTAA AAGTCTTTTT TTATTTTTTT AAAAAAAAAA AAAAAAAAAA AAAAAAAAA

Since this is a mRNA molecule, we may want to read it as RNA. As the EMBL file usually have t instead of u in the sequence, we can read it as RNA by converting t to u:

>>> embl = io.StringIO(embl_str)
>>> rna_seq = RNA.read(embl)
>>> rna_seq
RNA
----------------------------------------------------------------------
Metadata:
    'ACCESSION': 'X56734; S46826;'
    'CROSS_REFERENCE': <class 'list'>
    'DATE': <class 'list'>
    'DBSOURCE': 'MD5; 1e51ca3a5450c43524b9185c236cc5cc.'
    'DEFINITION': 'Trifolium repens mRNA for non-cyanogenic beta-
                   glucosidase'
    'KEYWORDS': 'beta-glucosidase.'
    'LOCUS': <class 'dict'>
    'REFERENCE': <class 'list'>
    'SOURCE': <class 'dict'>
    'VERSION': 'X56734.1'
Interval metadata:
    3 interval features
Stats:
    length: 1859
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 35.99%
----------------------------------------------------------------------
0    AAACAAACCA AAUAUGGAUU UUAUUGUAGC CAUAUUUGCU CUGUUUGUUA UUAGCUCAUU
60   CACAAUUACU UCCACAAAUG CAGUUGAAGC UUCUACUCUU CUUGACAUAG GUAACCUGAG
...
1740 AGAAGCUAUG AUCAUAACUA UAGGUUGAUC CUUCAUGUAU CAGUUUGAUG UUGAGAAUAC
1800 UUUGAAUUAA AAGUCUUUUU UUAUUUUUUU AAAAAAAAAA AAAAAAAAAA AAAAAAAAA

We can also trascribe a sequence and verify that it will be a RNA sequence

>>> rna_seq == dna_seq.transcribe()
True

Reading EMBL Files using generators

Soppose we have an EMBL file with multiple records: we can instantiate a generator object to deal with multiple records

>>> import skbio
>>> embl = io.StringIO(embl_str)
>>> embl_gen = skbio.io.read(embl, format="embl")
>>> dna_seq = next(embl_gen)

For more informations, see skbio.io

References

1

ftp://ftp.ebi.ac.uk/pub/databases/embl/release/doc/usrman.txt

2

http://www.ebi.ac.uk/ena/data/view/X56734&display=text

3

http://www.ebi.ac.uk/ena/browse/feature-level-products

4

https://github.com/biocore/scikit-bio/issues/1499